Page MenuHomePhabricator

[OpenMP][ClangLinkerWrapper] Extending linker wrapper to embed metadata for multi-arch fat binaries
Needs ReviewPublic

Authored by saiislam on Apr 27 2022, 6:07 AM.

Details

Summary

This patch adds "__tgt_image_info" field for each of the images
embedded in a multi-arch image. Required changes in libomptarget
are also shown.

The information in "tgt_image_info" struct is provided in the clang-linker-wrapper
as a call to
tgt_register_image_info for each image in the library
of images also created by the clang-linker-wrapper.
tgt_register_image_info is called for each image BEFORE the single
call to
tgt_register_lib so that image information is available
before they are loaded. clang-linker-wrapper gets this image information
from command line arguments provided by the clang driver when it creates
the call to the clang-linker-wrapper command.
This architecture allows the binary image (pointed to by ImageStart and
ImageEnd in
tgt_device_image) to remain architecture indenendent.
That is, the architecture independent part of the libomptarget runtime
does not need to peer inside the image to determine if it is loadable
even though in most cases the image is an elf object.
There is one tgt_image_info for each tgt_device_image. For backward
compabibility, no changes are allowed to either tgt_device_image or
tgt_bin_desc. The absense of __tgt_image_info is the indication that
the runtime is being used on a binary created by an old version of
the compiler.

Diff Detail

Event Timeline

saiislam created this revision.Apr 27 2022, 6:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2022, 6:07 AM
Herald added a subscriber: guansong. · View Herald Transcript
saiislam requested review of this revision.Apr 27 2022, 6:07 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 27 2022, 6:07 AM
jdoerfert added inline comments.Apr 27 2022, 7:25 AM
openmp/libomptarget/include/omptarget.h
197

I doubt we want/need to number and count them. This seems fragile and unhelpful. "initial allocation" is not expensive, image numbers should be implicit.

jhuber6 added inline comments.Apr 27 2022, 8:06 AM
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
1100–1103

I'm doing something similar in D123810, I just used the existing DeviceFile because I needed the Arch and Kind fields to dispatch the appropriate wrapping job for CUDA / HIP.

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
285

I'm wondering if it would be better to create a new __tgt_bin_desc and call a new __tgt_register_lib with it here so we don't need multiple calls here. Inside that new runtime function we could just widen or shrink the existing structs as needed. That way each device image would have this metadata associated with it and the target plugin can handle it as-needed.

need a test for the generated registration code

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
120–126

I am wondering whether we should add a few more fields to make it more generic for all offloading languages and platforms:

char* target_triple;
char* offloading_kind; // openmp, hip, etc
char* file_type; // elf, spirv, bitcode, etc

need a test for the generated registration code

Yes, I will add tests.

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
1100–1103

Seems simpler. I will pull that change here.

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
120–126

Good idea. Though I am not sure whether file_type info is being propagated in by the linker-wrapper or not. I will check.

285

Last time multiple vendors objected to changing __tgt_bin_desc and __tgt_device_image structs. The reason was backward compatibility of multiple downstream runtimes.

jhuber6 added inline comments.Apr 28 2022, 4:46 AM
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
285

If you keep the old __tgt_register_lib function, old applications will use that. Then inside libomptarget you can change __tgt_register_lib function to create a new __tgt_device_image from the old format. It would have some time penalty as we'd need to allocate some new structs, but it might be easier to deal with if each image contained its own metadata for the plugin to use. I was imagining it would look something like

struct __tgt_device_image_v2 {
  void* 	ImageStart;
  void* 	ImageEnd;
  __tgt_offload_entry* 	EntriesBegin;
  __tgt_offload_entry* 	EntriesEnd;
  __tgt_image_into *  ImageInfo;
};

Not sure if this is the best approach, but it should let us keep backwards compatibility and change the structs.

yaxunl added inline comments.Apr 28 2022, 7:52 AM
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
120–126

the file_type is for the file embedded by OffloadWrapper, therefore OffloadWrapper knows it.

Currently it is fatbin or cubin for CUDA and code object (elf) for HIP.

In the future, it may be spirv or bitcode to allow JIT compilation of spirv or bitcode in runtime so that one binary for all GPU's. OffloadWrapper will decide what is the final file type to embed based on type of embedded binaries in input files, target triple, and compile arguments.

yaxunl added inline comments.Apr 28 2022, 9:11 AM
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
120–126

the file_type is for the file embedded by OffloadWrapper, therefore OffloadWrapper knows it.

Currently it is fatbin or cubin for CUDA and code object (elf) for HIP.

In the future, it may be spirv or bitcode to allow JIT compilation of spirv or bitcode in runtime so that one binary for all GPU's. OffloadWrapper will decide what is the final file type to embed based on type of embedded binaries in input files, target triple, and compile arguments.

Sorry I mean ClangLinkerWrapper. ClangLinkerWrapper should somehow pass the file type info to OffloadWrapper.

jhuber6 added a comment.EditedMay 4 2022, 8:00 AM

I'm suggesting instead we define a new __tgt_device_image and a new __tgt_register_lib function to support this. This new __tgt_device_image will simply contain a pointer to an optional information struct.

struct __tgt_device_image_v2 {
  void* 	ImageStart;
  void* 	ImageEnd;
  __tgt_offload_entry* 	EntriesBegin;
  __tgt_offload_entry* 	EntriesEnd;
  __tgt_image_into*  ImageInfo;
};

This new struct breaks the ABI with the old __tgt_device_image because these are put into an array and we change the size, but we should be able to provide backwards compatibility by copying from the old format to the new format and creating a new array. We can detect the new vs. old ABI by expecting that existing applications will call the __tgt_register_image function. We will create a new __tgt_register_image_v2 function for example that all new programs will call. In libomptarget we then change __tgt_register_image to do the necessary translation.

struct __tgt_bin_desc {                                                                                                                                                                  
  int32_t NumDeviceImages;           // Number of device types supported                                                                                                                   
  __tgt_device_image *DeviceImages;  // Array of device images (1 per dev. type)                                                        
  __tgt_offload_entry *HostEntriesBegin; // Begin of table with all host entries                                                           
  __tgt_offload_entry *HostEntriesEnd;   // End of table (non inclusive)                                                                                                           
};

EXTERN void __tgt_register_lib(__tgt_bin_desc *desc) {
  __tgt_device_image_v2 *new_image = alloc_new_version(desc);
  desc->DeviceImages = new_Image;
  __tgt_register_lib_v2(desc);
}

EXTERN void __tgt_unregister_lib(__tgt_bin_desc *desc) {
  __tgt_unregister_lib_v2(desc);
  dealloc(desc->DeviceImages);
}

Now the rest of libomptarget solely uses the new format, and we check if information is available by seeing that the ImageInfo field is non-null.

I'm suggesting instead we define a new __tgt_device_image and a new __tgt_register_lib function to support this. This new __tgt_device_image will simply contain a pointer to an optional information struct.

struct __tgt_device_image_v2 {
  void* 	ImageStart;
  void* 	ImageEnd;
  __tgt_offload_entry* 	EntriesBegin;
  __tgt_offload_entry* 	EntriesEnd;
  __tgt_image_into*  ImageInfo;
};

This new struct breaks the ABI with the old __tgt_device_image because these are put into an array and we change the size, but we should be able to provide backwards compatibility by copying from the old format to the new format and creating a new array. We can detect the new vs. old ABI by expecting that existing applications will call the __tgt_register_image function. We will create a new __tgt_register_image_v2 function for example that all new programs will call. In libomptarget we then change __tgt_register_image to do the necessary translation.

struct __tgt_bin_desc {                                                                                                                                                                  
  int32_t NumDeviceImages;           // Number of device types supported                                                                                                                   
  __tgt_device_image *DeviceImages;  // Array of device images (1 per dev. type)                                                        
  __tgt_offload_entry *HostEntriesBegin; // Begin of table with all host entries                                                           
  __tgt_offload_entry *HostEntriesEnd;   // End of table (non inclusive)                                                                                                           
};

EXTERN void __tgt_register_lib(__tgt_bin_desc *desc) {
  __tgt_device_image_v2 *new_image = alloc_new_version(desc);
  desc->DeviceImages = new_Image;
  __tgt_register_lib_v2(desc);
}

EXTERN void __tgt_unregister_lib(__tgt_bin_desc *desc) {
  __tgt_unregister_lib_v2(desc);
  dealloc(desc->DeviceImages);
}

Now the rest of libomptarget solely uses the new format, and we check if information is available by seeing that the ImageInfo field is non-null.

Thanks for the input. I am going to try it.

saiislam updated this revision to Diff 431975.May 25 2022, 7:06 AM

Changed the embedding scheme to add ImageInfo field in __tgt_device_image.

Where is the code to change the target registration? We need a new initialization runtime call to use and change the old one to reallocate the __tgt_device_image array. Did you work around that some other way?

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
613–614

Does this just pass the architecture to the AMD link? Is this related? If not move it to a separate patch and I'll review it.

1069

Is this necessary?

1152

This should be a StringRef, the data is stable. When you create the struct just use this and it will add the null terminator for you.

Constant *ConstantStr = ConstantDataArray::getString(M.getContext(), Str);
1164

An empty StringRef should already create a null string when you create a ConstantDataArray, should be fine to just push it back.

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
62–63

The struct should be contained within the image itself so we can query it during registration, why do we need to know the number?

64–65
187–190

You should be able to pass the name to the constructor, if it's internal LLVM will give you a new name.

209–243

I would just assert that the metadata vector's size matches the number of images, we're always going to generate these now so it makes sense to generate them in pairs. Then just use llvm::zip or just an iterator to go through both of these at the same time.

210–222

Why do we need all these strings? Should just be generating a constant initializer.

224–227

Why does this need a custom section? We should just use it like this, not sure why we need these to be anything but some internal struct.

for (auto *Image : BinDesc->Images) {
  if (Image->Info)
   // Use Info
}
235–236

Why do we need this when we should already know which image we're looking at?

openmp/libomptarget/include/omptarget.h
122

This comment is extremely long. It should be sufficient to just say what it does, 3 lines max. Specify that it may be null in the device image struct.

145–146

I still feel like we don't need these, correct me if I'm wrong.

161
173–180

Shouldn't the plugin handle this? We should be able to get the architecture in the plugin and select the info struct that matches it, if it exists. Then if the plugin finds one that matches we indicate that it's compatible like we do now. Correct me if I'm wrong.

openmp/libomptarget/src/rtl.cpp
443

Shouldn't we be able to have the plugin handle its own architecture detection, then return from R.is_valid_binary? That makes more sense to me than introducing a new tool to libomptarget, since we can already query the architecture from the CUDA runtime for example.

saiislam updated this revision to Diff 432352.Thu, May 26, 12:08 PM
saiislam marked 8 inline comments as done.

Addressed some simple review changes. Will update remaining in the next iteration.

Thanks for the detailed review. I will update rest of the patch soon.

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
613–614

Yes, it passes the architecture to AMD link.

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
224–227

Custom section is required so that it survives binary stripping done by various OSes. Offload-arch tool will look into this section of the binary to print list of supported architectures.

jhuber6 added inline comments.Thu, May 26, 1:47 PM
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
224–227

I see, I was also planning on making a tool that will extract these similar to cuobjdump, we could merge it into that as well.

Also I probably should've discussed this earlier, but another potential solution is to use the binary format that we use to embed the object files for the images as well. This is more similar to how CUDA does it. Making it backwards compatible would probably require just assuming it's an image with no information if it doesn't include the magic bytes. This might be better if we want a unified tool to give information on embedded device formats.

saiislam updated this revision to Diff 433401.Wed, Jun 1, 7:47 AM

Added the multi-entry logic in libomptarget. Yet to move the image compatibility testing to plugin.