Page MenuHomePhabricator

[Clang][OpenMP Offload] Add new tool for wrapping offload device binaries
ClosedPublic

Authored by sdmitriev on Fri, Sep 27, 3:26 PM.

Details

Summary

This patch removes the remaining part of the OpenMP offload linker scripts which was used for inserting device binaries into the output linked binary. Device binaries are now inserted into the host binary with a help of the wrapper bit-code file which contains device binaries as data. Wrapper bit-code file is dynamically created by the clang driver with a help of new tool clang-offload-wrapper which takes device binaries as input and produces bit-code file with required contents. Wrapper bit-code is then compiled to an object and resulting object is appended to the host linking by the clang driver.

This is the second part of the patch for eliminating OpenMP linker script (please see https://reviews.llvm.org/D64943).

Diff Detail

Event Timeline

sdmitriev created this revision.Fri, Sep 27, 3:26 PM
Hahnfeld set the repository for this revision to rG LLVM Github Monorepo.Sat, Sep 28, 3:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptSat, Sep 28, 3:04 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
ABataev added inline comments.Wed, Oct 2, 11:54 AM
clang/lib/Driver/Driver.cpp
2286–2290

Could you fix these comments is a separate patch?

This comment was removed by JonChesterfield.
sdmitriev added inline comments.Wed, Oct 2, 1:03 PM
clang/lib/Driver/Driver.cpp
2286–2290

Sure. I will prepare a separate patch for the comment change.

JonChesterfield added a comment.EditedWed, Oct 2, 2:25 PM

I think this patch is a behaviour change. Currently, the target binary is embedded in the host binary at link time. With this change, the contents of the binary are embedded in bitcode which is subsequently fed into the link. If indeed so, that seems strictly better - code in the host that cares about the size of the bitcode now has it available at opt time, instead of at link time. The target specific nastiness objcopy would introduce is neatly sidestepped.

This change takes N binaries (that I think need to be for different triples, or the loop doesn't work) and puts them in separate section-annotated bitcode arrays. Equivalent behaviour would result from calling the tool once per binary and passing the N results onward, e.g. to llvm-link.

The functionality of 'take a binary and embed it in bitcode as a const array' is likely to be useful outside of openmp. I've done similar things in the past in non-portable fashion. Aside from the section and symbol names, I don't think there's anything specific to openmp in the tool.

How would you feel about simplifying the tool to work on one file at a time, with an interface that takes the host target (could default to whatever is running the tool) and a string for section name, which generates some bitcode containing that file as a const array plus start/end symbols derived from the section name? The change would involve deleting the multiple file handling and renaming OffloadTargets to SectionName or similar.

clang-offload-wrapper than becomes binary-to-bitcode-embedder (or better, names are hard), with the intent that projects outside of the openmp target offload compiler could use it.

edit: Or keep the multiple file handling if you prefer, preferably raising an error if there are duplicates in the requested section names

clang/tools/clang-offload-wrapper/ClangOffloadWrapper.cpp
85

I don't think this works for multiple binaries with the same target triple. They'll all be put in the same section and there will be duplicate symbols for start/end.

sdmitriev updated this revision to Diff 222917.Wed, Oct 2, 2:54 PM
sdmitriev marked 2 inline comments as done.

Addressed some comments and rebased patch.

I think this patch is a behaviour change. Currently, the target binary is embedded in the host binary at link time. With this change, the contents of the binary are embedded in bitcode which is subsequently fed into the link. If indeed so, that seems strictly better - code in the host that cares about the size of the bitcode now has it available at opt time, instead of at link time. The target specific nastiness objcopy would introduce is neatly sidestepped.

This change takes N binaries (that I think need to be for different triples, or the loop doesn't work) and puts them in separate section-annotated bitcode arrays. Equivalent behaviour would result from calling the tool once per binary and passing the N results onward, e.g. to llvm-link.

The functionality of 'take a binary and embed it in bitcode as a const array' is likely to be useful outside of openmp. I've done similar things in the past in non-portable fashion. Aside from the section and symbol names, I don't think there's anything specific to openmp in the tool.

How would you feel about simplifying the tool to work on one file at a time, with an interface that takes the host target (could default to whatever is running the tool) and a string for section name, which generates some bitcode containing that file as a const array plus start/end symbols derived from the section name? The change would involve deleting the multiple file handling and renaming OffloadTargets to SectionName or similar.

clang-offload-wrapper than becomes binary-to-bitcode-embedder (or better, names are hard), with the intent that projects outside of the openmp target offload compiler could use it.

edit: Or keep the multiple file handling if you prefer, preferably raising an error if there are duplicates in the requested section names

The tool indeed does not have anything specific to OpenMP at this step, but that will change in the 3rd part of the D64943 where I am planning to move offload registration code generation from clang to the wrapper tool. So it will have OpenMP specifics in future, though I do not see any problems with enabling it for other offloading models. We can always change driver to pass an additional information that represent 'offload kind' to the wrapper tool (can for example be done in a similar way how it is passed to the bundler tool), and wrapper will customize output bit-code depending on the offloading model if there would be a need for that.

Regarding the multiple vs single file handling. Wrapping each device binary independently would still be possible with multi-file wrapping support, but it will just increase startup time without adding any benefits in return (once we move offload registration code to the wrapper). So, I think that for OpenMP it does not make sense to do it (I cannot say anything about the other offloading models though).

Anyway, I suggest so start with something that eliminates OpenMP linker script. We can always customize/improve tools in future once there would be a need for that.

clang/tools/clang-offload-wrapper/ClangOffloadWrapper.cpp
85

Adding the same target triple to the list of OpenMP targets more than once is not supported, so such use case isn't viable:

bash-4.2$ clang -fopenmp -fopenmp-targets=x86_64-pc-linux-gnu,x86_64-pc-linux-gnu test.c
clang-10: warning: The OpenMP offloading target 'x86_64-pc-linux-gnu' is similar to target 'x86_64-pc-linux-gnu' already specified - will be ignored. [-Wopenmp-target]
bash-4.2$

But in any case I am going to remove the code which passes offload target triples to the wrapper tool in the last part of D64943 because they will not be needed for creating wrapper bit-code. As you know start/end symbols are referenced from the offload registration code only, so, moving offload registration code to the wrapper bit-code eliminates the need to create global start/end symbols with predefined names derived from the triple.

The tool indeed does not have anything specific to OpenMP at this step, but that will change...

That makes sense to me, thanks.

I think we're going to have some trouble adapting this to our build as there's already a standalone tool that runs at link time. Overall dropping the linker script is probably worth the integration headache.

clang/tools/clang-offload-wrapper/ClangOffloadWrapper.cpp
85

That's true. It seems a shame that we can embed at most one device binary per architecture into the host, but that's an existing limitation.

sdmitriev updated this revision to Diff 223335.Fri, Oct 4, 4:35 PM

Rebased patch and changed clang-offload-wrapper CMakeLists.txt to use add_clang_tool() rather than add_clang_executable() with a custom install rule.

JonChesterfield accepted this revision.Wed, Oct 9, 7:10 AM

The direction is good and I believe all the feedback from D64943 has already been incorporated. LGTM, thanks.

This revision is now accepted and ready to land.Wed, Oct 9, 7:10 AM
This revision was automatically updated to reflect the committed changes.

Out of interest (or ignorance :) ), why is this a separate binary instead of just part of the normal clang driver? C, C++, Objective-C, and assembly all can do with a single driver, yet the offload stuff now has both clang-offload-wrapper and clang-offload-bundler. Why isn't just clang enough?

Out of interest (or ignorance :) ), why is this a separate binary instead of just part of the normal clang driver? C, C++, Objective-C, and assembly all can do with a single driver, yet the offload stuff now has both clang-offload-wrapper and clang-offload-bundler. Why isn't just clang enough?

Well, theoretically both bunder and wrapper functionality can be implemented directly in the clang driver, and so technically these tools can be eliminated. But it is just a matter of splitting functionality into well-defined logical pieces:))