This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Create offloading entries when using the new driver
ClosedPublic

Authored by jhuber6 on Apr 10 2022, 1:15 PM.

Details

Summary

The changes made in D123460 generalized the code generation for OpenMP's
offloading entries. We can use the same scheme to register globals for
CUDA code. This patch adds the code generation to create these
offloading entries when compiling using the new offloading driver mode.
The offloading entries are simple structs that contain the information
necessary to register the global. The struct used is as follows:

Type struct __tgt_offload_entry {
  void    *addr;      // Pointer to the offload entry info.
                      // (function or global)
  char    *name;      // Name of the function or global.
  size_t  size;       // Size of the entry info (0 if it a function).
  int32_t flags;
  int32_t reserved;
};

Currently CUDA handles RDC code generation by deferring the registration
of globals in the current TU to a callback function containing the
modules ID. Later all the module IDs will be used to register all of the
globals at once. Rather than mimic this, offloading entries allow us to
mimic the way OpenMP registers globals. That is, we create a simple
global struct for each device global to be registered. These are placed
at a special section cuda_offloading_entires. Because this section is
a valid C-identifier, the linker will profide a __start and __stop
pointer that we can use to iterate and register all globals at runtime.

the registration requires a flag variable to indicate which registration
function to use. I have assigned the flags somewhat arbitrarily, but
these use the following values.

Kernel: 0
Variable: 0
Managed: 1
Surface: 2
Texture: 3

Depends on D120272

Diff Detail

Event Timeline

jhuber6 created this revision.Apr 10 2022, 1:15 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 10 2022, 1:15 PM
jhuber6 requested review of this revision.Apr 10 2022, 1:15 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 10 2022, 1:15 PM

Is OpenMP runtime able to find these entries without registering them through some API functions? If so, do you have a pointer to the code doing that?

most CUDA/HIP programs assume -fno-gpu-rdc mode, which have multiple sections containing these entries merged by linker, with gaps between them. How do runtime identify such gaps and skip them?

Is OpenMP runtime able to find these entries without registering them through some API functions? If so, do you have a pointer to the code doing that?

Yes, the linker will define __start/__stop symbols for any sections found with a section name that is a valid C-identifier. If you compile the following file with any OpenMP offloading code you should be able to print out all the symbol names that will be registered when the runtime is initialized.

#include <stdint.h>
#include <stdio.h>
struct __tgt_offload_entry {
  void *addr;  // Pointer to the offload entry info.
               // (function or global)
  char *name;  // Name of the function or global.
  size_t size; // Size of the entry info (0 if it a function).
  int32_t flags;
  int32_t reserved;
};

extern struct __tgt_offload_entry __start_omp_offloading_entries;
extern struct __tgt_offload_entry __stop_omp_offloading_entries;

__attribute__((constructor)) void print() {
  struct __tgt_offload_entry *iter = &__start_omp_offloading_entries;
  for (; iter != &__stop_omp_offloading_entries; ++iter)
    printf("%s\n", iter->name);
}

And then compile like

$ clang input.c -fopenmp -fopenmp-targets=nvptx64 -c
$ clang print.c -c
$ clang input.o print.o -fopenmp -fopenmp-targets=nvptx64
$ ./a.out 
x
__omp_offloading_fd02_605785f3_main_l8

most CUDA/HIP programs assume -fno-gpu-rdc mode, which have multiple sections containing these entries merged by linker, with gaps between them. How do runtime identify such gaps and skip them?

I'm making the executive decision to always enable fgpu-rdc when using this new driver in the future. The above is handled by the linker so there shouldn't be any gaps.

tra added a comment.Apr 12 2022, 12:33 PM

I've mentioned in D123441 that it would be useful to have a list of GPU-side symbols needed by the host and this offload info is pretty close to what we need. The only remaining feature is being able to extract them by external tool, so we could pass them to the GPU-side linker. Perhaps we could just generate a GPU-side stub file which would only have an array of needed GPU-side references, compile and add it to the GPU-side linker as yet another input which would ensure we do link in the exact set of GPU objects from the static libraries.

I've mentioned in D123441 that it would be useful to have a list of GPU-side symbols needed by the host and this offload info is pretty close to what we need. The only remaining feature is being able to extract them by external tool, so we could pass them to the GPU-side linker. Perhaps we could just generate a GPU-side stub file which would only have an array of needed GPU-side references, compile and add it to the GPU-side linker as yet another input which would ensure we do link in the exact set of GPU objects from the static libraries.

These will probably only be valid once the final executable is linked. Since the structure contains a pointers to other symbols they'll only have non-null values after the final linking. After linking for the host you should be able to just use something like objdump -s -j cuda_offloading_entries to get all of them. For my use-case I only need to be able to iterate these symbols when the program is run. If we want to use this for something else it would be good to keep them synced up to avoid duplicating error. Also the patches say "CUDA" but the vast majority will also apply to HIP without much change.

HIP is considering a unified device binary embedding scheme with OpenMP. However, some large MI frameworks are compiled with -fno-gpu-rdc. If compiling with -fgpu-rdc, the linking time will significantly increase since the post-linking optimizations take much longer time with the large linked IR. Therefore, it would be desirable if the new OpenMP device binary embedding scheme supports -fno-gpu-rdc mode.

That said, I think this new scheme may work for -fno-gpu-rdc, probably with some minor changes.

For -fno-gpu-rdc, each TU has its own device binary, so the device binaries in the final image would be per GPU and per TU. That seems not a big problem since they can be post-fixed with a unique ID for each TU.

Different offload entries may have the same name in different TU's, therefore an offload entry may not be uniquely identified by its name. To uniquely identify an offload entry, it needs its name and the pointer to its belonging device binary. Therefore, it would be desirable to have one extra field 'owner':

Type struct __tgt_offload_entry {
  void    *addr;      // Pointer to the offload entry info.
                      // (function or global)
  char    *name;      // Name of the function or global.
  size_t  size;       // Size of the entry info (0 if it a function).
  int32_t flags;
  void  *owner; // pointer to the device binary containing this offload-entry
  int32_t reserved;
};

It may be possible to use the reserved field for that purpose. However, it is not sure if reserved will be used for some other purpose later.

Another choice is to let addr point to a struct which contains owner info. However, that would introduce another level of indirection.

HIP is considering a unified device binary embedding scheme with OpenMP. However, some large MI frameworks are compiled with -fno-gpu-rdc. If compiling with -fgpu-rdc, the linking time will significantly increase since the post-linking optimizations take much longer time with the large linked IR. Therefore, it would be desirable if the new OpenMP device binary embedding scheme supports -fno-gpu-rdc mode.

This work should be very close to that, the new driver allows us to link everything together so OpenMP can call HIP / CUDA functions and vice-versa. I have done some preliminary tests with registering CUDA device variables with OpenMP, the only change required is to store these offloading sections at omp_offloading_entries and the OpenMP runtime will pick them up and try to register them. This method allows us to compile HIP / CUDA with OpenMP but since we're going to be registering two different images they'll have unique state. For full interoperability we'd need some way for make either HIP / CUDA or OpenMP "borrow" the other one's registered image so they can share the state.

That said, I think this new scheme may work for -fno-gpu-rdc, probably with some minor changes.

My understanding is that non-RDC builds do all the registration per-TU. Since that's the case then we should just be able to link them as we do now and they won't emit any device code that needs to be linked. So individual files could specify no-rdc and then they wouldn't be touched by the device linker run later.

For -fno-gpu-rdc, each TU has its own device binary, so the device binaries in the final image would be per GPU and per TU. That seems not a big problem since they can be post-fixed with a unique ID for each TU.

Different offload entries may have the same name in different TU's, therefore an offload entry may not be uniquely identified by its name. To uniquely identify an offload entry, it needs its name and the pointer to its belonging device binary. Therefore, it would be desirable to have one extra field 'owner':

Type struct __tgt_offload_entry {
  void    *addr;      // Pointer to the offload entry info.
                      // (function or global)
  char    *name;      // Name of the function or global.
  size_t  size;       // Size of the entry info (0 if it a function).
  int32_t flags;
  void  *owner; // pointer to the device binary containing this offload-entry
  int32_t reserved;
};

It may be possible to use the reserved field for that purpose. However, it is not sure if reserved will be used for some other purpose later.

For OpenMP we use an exec_mode global to control some kernel execution, there's a possibility we'd want to put it in the reserved field instead. We could add more fields to this, but it would break the ABI. We could work around that but it would be some additional complexity.

Another choice is to let addr point to a struct which contains owner info. However, that would introduce another level of indirection.

Yeah, I think for arbitrary extensions that would be the easiest way without breaking the ABI. We could use the reserved field to indicate if we have some "extension" there.

I think we're working through some similar stuff. I haven't worked much with HIP but I think there would be some benefit to bringing this all under the new driver I've been working on for OpenMP. Let me know if you want to collaborate on something for getting this to work with HIP.

jhuber6 updated this revision to Diff 426150.Apr 29 2022, 1:16 PM

Fixed missing info flag for --offload-new-driver.

jhuber6 updated this revision to Diff 426404.May 2 2022, 7:03 AM

Fix test.

tra added a comment.May 6 2022, 10:48 AM

Type struct __tgt_offload_entry {

void    *addr;      // Pointer to the offload entry info.
                    // (function or global)
char    *name;      // Name of the function or global.
size_t  size;       // Size of the entry info (0 if it a function).
int32_t flags;
int32_t reserved;

};

One thing you need to consider is that this introduces a new ABI.
This structure may change over time and we will need to be able to deal with libraries compiled with potentially different version of clang which may use a different format for the entries.
I think we may need some sort of version stamp.
We could use the section name for this purpose and rename it when we change the struct format, but that would be a bit more fragile as it's easier to forget to update the name if/when the struct format changes.
Also, format mismatch would looks like offload section is missing, which would need special handling when we diagnose the problem to distinguish incompatible offload table from the missing offload table.

Type struct __tgt_offload_entry {

void    *addr;      // Pointer to the offload entry info.
                    // (function or global)
char    *name;      // Name of the function or global.
size_t  size;       // Size of the entry info (0 if it a function).
int32_t flags;
int32_t reserved;

};

One thing you need to consider is that this introduces a new ABI.
This structure may change over time and we will need to be able to deal with libraries compiled with potentially different version of clang which may use a different format for the entries.
I think we may need some sort of version stamp.
We could use the section name for this purpose and rename it when we change the struct format, but that would be a bit more fragile as it's easier to forget to update the name if/when the struct format changes.
Also, format mismatch would looks like offload section is missing, which would need special handling when we diagnose the problem to distinguish incompatible offload table from the missing offload table.

It's a little tough, I chose that format because it's exactly the same as we use with OpenMP with a few different flags. I wish whoever initially designed the struct made the `reserved` field 64-bits so it could conceivably hold a pointer to some additional information, but that ship has sailed. I originally chose to have this match the OpenMP struct because it will heavily simplify things if every language uses this same method for registering their globals. I would like to change it, but I'm not sure how well it would be received considering backwards compatibility. I'm not sure what the best path forward is on that front.

tra added a comment.May 6 2022, 11:33 AM

It's a little tough, I chose that format because it's exactly the same as we use with OpenMP with a few different flags. I wish whoever initially designed the struct made the `reserved` field 64-bits so it could conceivably hold a pointer to some additional information, but that ship has sailed. I originally chose to have this match the OpenMP struct because it will heavily simplify things if every language uses this same method for registering their globals. I would like to change it, but I'm not sure how well it would be received considering backwards compatibility. I'm not sure what the best path forward is on that front.

If it's exiting format that's already in use, then sticking with it is fine. We'll deal with this if/when we'll need to change it.

clang/lib/CodeGen/CGCUDARuntime.h
58–70

I'm a bit puzzled by this arrangement. Are those actually flags (i.e. can be set independently) or are they enumerating specific offload kinds (i.e. only one of these values is intended to be set)?

I think we want the latter. If that's the case I'd propose to enumerate kernel and data together, so each kind gets a distinct value and is easy to tell when one needs to examine the offload table manually. Right now both kernels and global vars set the flags to 0.

jhuber6 added inline comments.May 6 2022, 11:38 AM
clang/lib/CodeGen/CGCUDARuntime.h
58–70

It probably should just be an enumeration. I was tentatively keeping them somewhat separate because OpenMP uses different values for these flags, but I think keeping this completely compatible is an impossible proposition. If we need them to use the same flag we should be able to configure that at some point. I will change it to just be a standard enum (I don't handle anything but kernels and regular globals in the linker wrapper right now anyway)

jhuber6 updated this revision to Diff 427701.May 6 2022, 11:54 AM

Changing enum values from a bitfield to simple enumeration.

jhuber6 edited the summary of this revision. (Show Details)May 6 2022, 11:59 AM
tra added inline comments.May 6 2022, 12:15 PM
clang/lib/CodeGen/CGCUDARuntime.h
56–70

We can also fold both enums into one, as we still have the ambiguity of what flags=0 means.

jhuber6 added inline comments.May 6 2022, 1:02 PM
clang/lib/CodeGen/CGCUDARuntime.h
56–70

They're selected based on the size, if the size is zero it uses the kernel flags, otherwise it uses the variable flags. That's how it's done for OpenMP. I figured keeping the enums separate makes that more clear.

tra added inline comments.May 6 2022, 2:55 PM
clang/lib/CodeGen/CGCUDARuntime.h
56–70

They're selected based on the size, if the size is zero it uses the kernel flags, otherwise it uses the variable flags.

Why use two different enums, when one would do? It does not buy us anything other than unnecessary additional complexity.

jhuber6 added inline comments.May 6 2022, 3:02 PM
clang/lib/CodeGen/CGCUDARuntime.h
56–70

I mostly copied this from OpenMP, I can merge it into one.

jhuber6 updated this revision to Diff 427766.May 6 2022, 3:03 PM

removing enum

jhuber6 marked 3 inline comments as done.May 6 2022, 3:06 PM
tra added inline comments.May 6 2022, 3:42 PM
clang/lib/CodeGen/CGCUDARuntime.h
58

We're still using the same numeric value for two different kinds of entities.
Considering that it's the third round we're making around this point, I'm starting to suspect that I may be missing something.
Is there a particular reason kernels and global unmanaged variables have to have the same 'kind'?

It's possible that I didn't do a good job explaining my enthusiastic nitpicking here.
My suggestion to have unified enum for all entities we register is based on a principle of separation of responsibilities. If we want to know what kind of entry we're dealing with, checking the 'kind' field should be sufficient. The 'size' field should only indicate the size of the entity. Having to consider both kind and size to determine what you're dealing with just muddies things and should not be done unless there's a good reason for that. E.g. it might be OK if we were short on flag bits.

jhuber6 added inline comments.May 6 2022, 3:50 PM
clang/lib/CodeGen/CGCUDARuntime.h
58

Ah, I see the point you're making now. This is yet another thing that OpenMP did that I just copied. I wouldn't have implemented it this way but I figured it would be simpler to keep them similar. I mostly did it this way because I did some initial tests of registering and accessing CUDA globals in OpenMP and it required using the same flags for the kernels and globals. We could change it for CUDA in the future and I could make that change here if it's valuable. Ideally I would like to rewrite how we do all this registration with the structs but breaking the ABI makes it complicated...

tra added inline comments.May 6 2022, 4:13 PM
clang/lib/CodeGen/CGCUDARuntime.h
58

I did some initial tests of registering and accessing CUDA globals in OpenMP and it required using the same flags for the kernels and globals.

OK. So, there is something that requires this magic. If that's something we must have, then it must be mentioned in the comments around the enum.

Do you know where I should find the code which needs this? I'm curious what's going on there.
I wonder if it just checks for "flags==0" and refuses to deal with unknown flags.
To think of it, we probably want to put the enum into a common header which defines the __tgt_offload_entry.We would not want OpenMP itself to start using the same bits for something else.

jhuber6 added inline comments.May 6 2022, 4:37 PM
clang/lib/CodeGen/CGCUDARuntime.h
58

Sorry, I should be more specific. The OpenMP offloading runtime currently uses a size of zero to indicate a kernel function and the flags have a different meaning if it's a kernel. For OpenMP, 0 is a kernel, 1 and 2 are device ctors / dtors. I'm not sure why they chose this over just another flag but it's the current standard. You can see it used like this here https://github.com/llvm/llvm-project/blob/main/openmp/libomptarget/src/omptarget.cpp#L147. I'm not sure if there's a good way to wrangle these together now that I think about it, considering OpenMP already uses 0x1 to represent link OpenMP variables so this already collides. But treating the flags different on the size is at least consistent with what OpenMP does. It makes it a little hard to define one enum for it since we use it two different ways, I'm not a fan of it but it's what the current ABI uses.

tra added inline comments.May 9 2022, 12:57 PM
clang/lib/CodeGen/CGCUDARuntime.h
58

I see.

Using size=0 as the coda/data flag which changes interpretation of the flags sort of makes sense. In that case two different types for the flags field would be appropriate, with an appropriate comment describing that size==0 determines which one is in effect.

jhuber6 added inline comments.May 9 2022, 1:00 PM
clang/lib/CodeGen/CGCUDARuntime.h
58

Personally I'm find with it landing like this, and if we wanted to improve this later it would probably just go in some greater ABI break for offloading entries. There might be a good reason to change them all at once when we start focusing more on complete interoperability of offloading languages.

tra added inline comments.May 9 2022, 1:11 PM
clang/lib/CodeGen/CGCUDARuntime.h
58

I'm fine with that. Just add a comment describing how OffloadGlobalEntry is used for both code and data and that the size is used to distinguish them.

jhuber6 updated this revision to Diff 428186.May 9 2022, 1:14 PM

Updating comments to desceibe usage of flags with zero size.

jhuber6 updated this revision to Diff 428188.May 9 2022, 1:18 PM

Clang format

Is this with D123810 and D123812 good to land? It would be nice to be able to test this upstream.

tra accepted this revision.May 10 2022, 3:47 PM
This revision is now accepted and ready to land.May 10 2022, 3:47 PM
This revision was landed with ongoing or failed builds.May 11 2022, 4:30 AM
This revision was automatically updated to reflect the committed changes.