This is an archive of the discontinued LLVM Phabricator instance.

[Offloading] Initial support for registering offloading entries on COFF targets
Needs ReviewPublic

Authored by jhuber6 on Nov 4 2022, 3:50 PM.

Details

Summary

The new driver registers all the offloading entries by first storing a
structure containing the necessary information into a special section
and then iterating the section at runtime. This is done in ELF targets
using the linker defined __start and __stop sections. However for
COFF targets these are not provided. This is instead done by generating
sections as described here.

This patch adds the initial support required to offloadon COFF targets
by implementing this for the new driver. We use the .<kind>$Ox section
for COFF now.

NOTE: I have not tested the runtime functionality of patch as I do not have a Windows machine set up yet.

Diff Detail

Event Timeline

jhuber6 created this revision.Nov 4 2022, 3:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 4 2022, 3:50 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
jhuber6 requested review of this revision.Nov 4 2022, 3:50 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptNov 4 2022, 3:50 PM

This looks reasonable to me, overall. I didn't quite try to follow all the wall-of-text changes in the tests though, but overall fine. Just a couple comments and one case where I didn't find where code went in the refactoring.

clang/test/CodeGenCUDA/offloading-entries.cu
92

Is this identical to the one above? Should the lines be shared with --check-prefix=COMMON,COFF etc? (The number of lines is rather small here so it's maybe not strictly necessary, but I saw that done in the other testcase.)

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
145

FWIW, this comment doesn't feel entirely accurate: Regardless of the length of the section name, all sections with names of the form name$suffix will get merged into the same section name (sorted by the suffix). Then if name is 8 chars or less, the name is kept in the section table (so that it can easily be looked up at runtime), while if it is longer, the full name is kept in the string table (which is not mapped at runtime).

Also as an extra side note; we added an exception into lld for .eh_frame - this is 9 chars, but libunwind wants to locate the section at runtime. So for that case, lld truncates it to .eh_fram. (This behaviour is lld specific, to appease libunwind - binutils doesn't do that, and libgcc locates that section differently.)

337

I don't quite see where the corresponding GlobalVariable for this case is created after the refactoring?

Thanks for the feedback.

Another significant portion of getting this workflow to work for Windows / COFF is parsing the linker arguments. I should be able to look at lld-link and add necessarily aliases to what ld.lld takes I assume? E.g. we use values like -o and -L in the linker wrapper to set the output and find libraries.

clang/test/CodeGenCUDA/offloading-entries.cu
92

Yeah it might be valid to collapse it further, this test is mostly just copy-pasted directly from the output so we should probably try to keep it common.

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
145

I see, I'm not that familiar with the inner workings of the COFF linking process. All that matters for this use-case is whether or not we can get a pointer to the array. In that case we shouldn't need to worry about the eight character limit right?

337

The CUDA / HIP cases did this separately. This patch merged it into a common method getELFEntriesArray. Functionally this just changed the order in the output slightly. The dummy variable is only necessary for the ELF linkers to generate the begin / end section. For COFF we make the $a and $z variables which perform a similar role.

Another significant portion of getting this workflow to work for Windows / COFF is parsing the linker arguments. I should be able to look at lld-link and add necessarily aliases to what ld.lld takes I assume? E.g. we use values like -o and -L in the linker wrapper to set the output and find libraries.

Sorry, I'm not quite up to speed with exactly what is being done linker-wise here - can you give a more detailed overview? Keep in mind that there's two separate interfaces to lld for COFF; when used in mingw mode, it invokes the ld.lld frontend, but with a -m option which directs lld to the mingw frontend, which parses ld.lld style options and rewrites them to lld-link style options and invokes that interface. And when Clang is operating in msvc/clang-cl mode, lld-link is invoked (or called directly by the build system).

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
145

If you locate the contents at runtime by using specific symbols that point to the start and end of the data, then yes, you don't need to worry about keeping it below the 8 char limit.

The 8 char limit is relevant if you enumerate and iterate over the sections of a DLL/EXE at runtime, and try to locate the section dynamically that way.

337

Ah, now I see - this is the second half of what's being merged into the cuda/hip call below in createRegisterGlobalsFunction.

Sorry, I'm not quite up to speed with exactly what is being done linker-wise here - can you give a more detailed overview? Keep in mind that there's two separate interfaces to lld for COFF; when used in mingw mode, it invokes the ld.lld frontend, but with a -m option which directs lld to the mingw frontend, which parses ld.lld style options and rewrites them to lld-link style options and invokes that interface. And when Clang is operating in msvc/clang-cl mode, lld-link is invoked (or called directly by the build system).

Sure, there's a bit of documentation for what's going on here, but I may need to update it a bit.

Basically, for offloading languages (CUDA, HIP, OpenMP, etc) we compile the source code twice, once for the host and once for the target device. We embed the device relocatable object inside the host so we follow a standard compilation pipeline. This linker-wrapper then fishes those relocatable objects out and performs the device-linking phase. The linked output is then put into a global along with some runtime calls to register the image and kernels. That new file gets passed to the wrapped linker job and we get a final executable.

My concern is that the linker wrapper keys off of certain arguments to the linker to do its job since it's invoked something like clang-linker-wrapper <linker-args>. I understand these are fundamentally different for lld-link so I was wondering if this approach in general would work there.

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
145

Good to know, I may change the section names to be more verbose then, something like cuda.entries$OE.

rnk resigned from this revision.Nov 18 2022, 12:00 PM

Sorry, I'm not quite up to speed with exactly what is being done linker-wise here - can you give a more detailed overview? Keep in mind that there's two separate interfaces to lld for COFF; when used in mingw mode, it invokes the ld.lld frontend, but with a -m option which directs lld to the mingw frontend, which parses ld.lld style options and rewrites them to lld-link style options and invokes that interface. And when Clang is operating in msvc/clang-cl mode, lld-link is invoked (or called directly by the build system).

Sure, there's a bit of documentation for what's going on here, but I may need to update it a bit.

Basically, for offloading languages (CUDA, HIP, OpenMP, etc) we compile the source code twice, once for the host and once for the target device. We embed the device relocatable object inside the host so we follow a standard compilation pipeline. This linker-wrapper then fishes those relocatable objects out and performs the device-linking phase. The linked output is then put into a global along with some runtime calls to register the image and kernels. That new file gets passed to the wrapped linker job and we get a final executable.

My concern is that the linker wrapper keys off of certain arguments to the linker to do its job since it's invoked something like clang-linker-wrapper <linker-args>. I understand these are fundamentally different for lld-link so I was wondering if this approach in general would work there.

Right, yes, lld-link uses different arguments than ld.lld. (Also note that for mingw targets, the regular ld.lld interface is used on Windows too, while link.exe or lld-link is used for msvc targets.)

Anyway, I don't have much other comments on this - other than what I commented before, this seems mostly reasonable, but I'm not familiar enough with the whole OpenMP pipeline to give any authoritative review really. Feel free to ask specifically about other details relating to PE/COFF though!