Page MenuHomePhabricator

[Clang][OpenMP offload] Eliminate use of OpenMP linker script
Needs ReviewPublic

Authored by sdmitriev on Jul 18 2019, 1:23 PM.

Details

Summary

OpenMP linker script is known to cause problems for gold and lld linkers on Linux and it will also cause problems for Windows enabling in future. This is a sufficient justification for eliminating use of OpenMP linker script and replacing it with a portable solution. This patch contains prototype changes which implement such solution.

First a brief explanation how OpenMP linker script is being used in existing implementation. OpenMP linker script is dynamically generated by the clang driver and is added to the host link command to fulfil the following tasks

  1. Insert device binaries into the host binary at link time as data (makes host binary fat)
  2. Creates pair of symbols for the start/end address for each device binary
  3. And creates pair of symbols with start/end addresses around the compiler generated offload entry table

All symbols that are created by the linker script are used by the offload registration code that is added by the compiler to each host object as a comdat group. This compiler generated code consists of a pair of data objects (device binary descriptor) that use those symbols as initializers and two functions. One of those functions registers device binary descriptor at OpenMP runtime at program startup and the other unregisters it. BTW, having offload registration code in each host object is not good because it makes host object dependent on a particular list of targets (device binary descriptor depends on the offload targets).

This patch implements an alternative solution for the above tasks. Device binaries are inserted into the host binary with a help of the wrapper bit-code file which contains device binaries as data as well as the offload registration code for registering device binaries in offload runtime (tasks 1 and 2 in the above list). Wrapper bit-code file is dynamically created by the clang driver with a help of new tool clang-offload-wrapper which takes device binaries as input and produces bit-code file with required contents. Wrapper bit-code is then compiled to an object and resulting object is appended to the host linking by the clang driver.

Start/end symbols around the offload entry table (3 in the list above) are added with a help of a new pair of object files omptargetbegin.o/omptargetend.o that define those symbols in appropriate section. These object files are added before/after user's objects and libraries by the clang driver similar to the other begin/end objects like crti.o/crtn.o or crtbegin.o/crtend.o.

This is a full patch with all required changes. I will split is into smaller pieces for upstreaming.

Diff Detail

Event Timeline

sdmitriev created this revision.Jul 18 2019, 1:23 PM
sdmitriev edited the summary of this revision. (Show Details)Jul 18 2019, 1:29 PM
sdmitriev edited the summary of this revision. (Show Details)
ABataev added inline comments.Mon, Aug 5, 4:17 PM
clang/include/clang/Driver/Action.h
74

Do we really need this new kind of job here, can we use bundler instead?

clang/lib/CodeGen/CGOpenMPRuntime.cpp
9752–9753

Do not emit it for the devices and simd only mode. Also, would be good to assert if no devices triples were specified.

clang/lib/CodeGen/CGOpenMPRuntime.h
1466

Ithink, you can drop virtual here and remove overridden version from the CGOpenMPRuntimeSimd. Instead, just check for OpenMP simd mode in the original function and just early exit in this case.

sdmitriev updated this revision to Diff 213635.Tue, Aug 6, 9:12 AM

Addressed some review comments.

sdmitriev marked 4 inline comments as done.Tue, Aug 6, 9:45 AM
sdmitriev added inline comments.
clang/include/clang/Driver/Action.h
74

Well, I can probably try to reuse bundling action for wrapping, but I think it will just complicate the logic. Wrapping logically differs from bundling and wrapping is done by a different tool, so I think it is natural to add a distinct action class for it.

clang/lib/CodeGen/CGOpenMPRuntime.cpp
9752–9753

Offload entries are actually emitted both for host and target compilations. I have added a check for OpenMP simd mode to createOffloadingBinaryDescriptorRegistration().

clang/lib/CodeGen/CGOpenMPRuntime.h
1466

Ok. Without virtual I do not see much reasons for adding new function which just calls createOffloadEntriesAndInfoMetadata(), so instead I have just made createOffloadEntriesAndInfoMetadata() public and added a check for OpenMP simd mode to this function.

ABataev added inline comments.Tue, Aug 6, 11:36 AM
clang/tools/clang-offload-wrapper/ClangOffloadWrapper.cpp
70

Why not ArrayRef?

73–76

Not sure that this is the best solution, we may end up with incorrect size_t type in some cases.

74

Use real type here, not auto

134

Add comment for the true constant with the name of parameter.

137

Same here

144

Comments?

145

Use real type, not auto

149

Seems to me the code is not formatted

197

Use real type, not auto

210

llvm::None instead of {}

217

Remove extra braces here, they are not needed.

223

Extra braces

sdmitriev updated this revision to Diff 213775.Tue, Aug 6, 6:29 PM
sdmitriev marked 2 inline comments as done.
sdmitriev marked 14 inline comments as done.Tue, Aug 6, 6:39 PM
sdmitriev added inline comments.
clang/tools/clang-offload-wrapper/ClangOffloadWrapper.cpp
70

Changed to MemoryBufferRef which in some sense is similar to ArrayRef.

73–76

I have slightly revised this code to avoid unexpected results.

149

Right. Fixed.

210

Switched to a different constructor which does not have argument types.

vzakhari added inline comments.
clang/tools/clang-offload-wrapper/ClangOffloadWrapper.cpp
227

FYI, llvm.global_dtors does not work on Windows. The symbol will be placed into .CRT$XC[A-Z] portion of .CRT in TargetLoweringObjectFileImpl.cpp, so, basically, tgt_unregister_lib will be called right after tgt_register_lib. We can either use a trick from ASAN, i.e. put llvm.global_dtors into .CRT$XT[A-Z] (I am not sure how solid this solution is) or call atexit() inside tgt_register_lib to register tgt_unregister_lib terminator.

sdmitriev marked 3 inline comments as done.Wed, Aug 7, 11:36 AM
sdmitriev added inline comments.
clang/tools/clang-offload-wrapper/ClangOffloadWrapper.cpp
227

It works as expected on Linux, so I guess this is just a bug in lowering code for Windows that need to be fixed.

ABataev added inline comments.Wed, Aug 7, 11:41 AM
clang/tools/clang-offload-wrapper/ClangOffloadWrapper.cpp
227

Still, better to call atexit(), this is common solution to call a global destructor/deinitializer

vzakhari added inline comments.Wed, Aug 7, 11:42 AM
clang/tools/clang-offload-wrapper/ClangOffloadWrapper.cpp
227

I agree. One other thing: if __tgt_register_lib is never called do we want to call __tgt_unregister_lib? It is possible with global_ctors/global_dtors, but not possible with atexit/cxa_atexit.

As I understand ‘atexit’ solution would be target dependent (‘__cxa_atexit’ on Linux and ‘atexit’ on Windows) whereas @llvm.global_ctors/dtors variables offer similar and platform neutral functionality (http://llvm.org/docs/LangRef.html#the-llvm-global-ctors-global-variable). Why do you think that ‘atexit’ is a better choice?

As I understand ‘atexit’ solution would be target dependent (‘__cxa_atexit’ on Linux and ‘atexit’ on Windows) whereas @llvm.global_ctors/dtors variables offer similar and platform neutral functionality (http://llvm.org/docs/LangRef.html#the-llvm-global-ctors-global-variable). Why do you think that ‘atexit’ is a better choice?

Because it does not work on Windows, better to have portable solution, if possible.

clang/tools/clang-offload-wrapper/ClangOffloadWrapper.cpp
73
  1. Markt it const.
  2. This still is not the best solution, since size_t not necessarily has the pointer size. I don't know if there is a better solution. @hfinkel? If this is the best, why not just to use getIntPtrType(C)?
136

Use \\\ style for comments here

172

Use ArrayRef instead of const SmallVectorImpl &

331

Use std:string instead of auto

332

Also, better to specify type expplicitly.

As I understand ‘atexit’ solution would be target dependent (‘__cxa_atexit’ on Linux and ‘atexit’ on Windows) whereas @llvm.global_ctors/dtors variables offer similar and platform neutral functionality (http://llvm.org/docs/LangRef.html#the-llvm-global-ctors-global-variable). Why do you think that ‘atexit’ is a better choice?

Because it does not work on Windows, better to have portable solution, if possible.

Is there a bug # ?

As I understand ‘atexit’ solution would be target dependent (‘__cxa_atexit’ on Linux and ‘atexit’ on Windows) whereas @llvm.global_ctors/dtors variables offer similar and platform neutral functionality (http://llvm.org/docs/LangRef.html#the-llvm-global-ctors-global-variable). Why do you think that ‘atexit’ is a better choice?

Because it does not work on Windows, better to have portable solution, if possible.

Is there a bug # ?

@vzakhari?

As I understand ‘atexit’ solution would be target dependent (‘__cxa_atexit’ on Linux and ‘atexit’ on Windows) whereas @llvm.global_ctors/dtors variables offer similar and platform neutral functionality (http://llvm.org/docs/LangRef.html#the-llvm-global-ctors-global-variable). Why do you think that ‘atexit’ is a better choice?

Because it does not work on Windows, better to have portable solution, if possible.

Is there a bug # ?

@vzakhari?

I do not have bug #, but the issue was introduced with the following commit:
commit f803b23879d9e1d9415ec1875713534dcc203df5
Author: Reid Kleckner <rnk@google.com>
Date: Fri Sep 7 23:07:55 2018 +0000

[COFF] Implement llvm.global_ctors priorities for MSVC COFF targets

Summary:
MSVC and LLD sort sections ASCII-betically, so we need to use section
names that sort between .CRT$XCA (the start) and .CRT$XCU (the default
priority).

In the general case, use .CRT$XCT12345 as the section name, and let the
linker sort the zero-padded digits.

Users with low priorities typically want to initialize as early as
possible, so use .CRT$XCA00199 for prioties less than 200. This number
is arbitrary.

Implements PR38552.

Reviewers: majnemer, mstorsjo

Subscribers: hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D51820

llvm-svn: 341727

The destructors are still in .CRT$XT for default priority (65535) now, but for non-default priority they will go into .CRT$XC. I will upload a fixing patch with a LIT test shortly.

This clang-offload-wrapper commit will work properly, if we use default priority for the destructors.

As I understand ‘atexit’ solution would be target dependent (‘__cxa_atexit’ on Linux and ‘atexit’ on Windows) whereas @llvm.global_ctors/dtors variables offer similar and platform neutral functionality (http://llvm.org/docs/LangRef.html#the-llvm-global-ctors-global-variable). Why do you think that ‘atexit’ is a better choice?

Because it does not work on Windows, better to have portable solution, if possible.

Is there a bug # ?

@vzakhari?

I do not have bug #, but the issue was introduced with the following commit:
commit f803b23879d9e1d9415ec1875713534dcc203df5
Author: Reid Kleckner <rnk@google.com>
Date: Fri Sep 7 23:07:55 2018 +0000

[COFF] Implement llvm.global_ctors priorities for MSVC COFF targets
 
Summary:
MSVC and LLD sort sections ASCII-betically, so we need to use section
names that sort between .CRT$XCA (the start) and .CRT$XCU (the default
priority).
 
In the general case, use .CRT$XCT12345 as the section name, and let the
linker sort the zero-padded digits.
 
Users with low priorities typically want to initialize as early as
possible, so use .CRT$XCA00199 for prioties less than 200. This number
is arbitrary.
 
Implements PR38552.
 
Reviewers: majnemer, mstorsjo
 
Subscribers: hiraditya, llvm-commits
 
Differential Revision: https://reviews.llvm.org/D51820
 
llvm-svn: 341727

The destructors are still in .CRT$XT for default priority (65535) now, but for non-default priority they will go into .CRT$XC. I will upload a fixing patch with a LIT test shortly.

This clang-offload-wrapper commit will work properly, if we use default priority for the destructors.

'IMHO' if there is a problem with lowering of LLVM IR constructs for some
particular targets, that problem must be resolved instead of adding workarounds.

As I understand ‘atexit’ solution would be target dependent (‘__cxa_atexit’ on Linux and ‘atexit’ on Windows) whereas @llvm.global_ctors/dtors variables offer similar and platform neutral functionality (http://llvm.org/docs/LangRef.html#the-llvm-global-ctors-global-variable). Why do you think that ‘atexit’ is a better choice?

Because it does not work on Windows, better to have portable solution, if possible.

Is there a bug # ?

@vzakhari?

I do not have bug #, but the issue was introduced with the following commit:
commit f803b23879d9e1d9415ec1875713534dcc203df5
Author: Reid Kleckner <rnk@google.com>
Date: Fri Sep 7 23:07:55 2018 +0000

[COFF] Implement llvm.global_ctors priorities for MSVC COFF targets
 
Summary:
MSVC and LLD sort sections ASCII-betically, so we need to use section
names that sort between .CRT$XCA (the start) and .CRT$XCU (the default
priority).
 
In the general case, use .CRT$XCT12345 as the section name, and let the
linker sort the zero-padded digits.
 
Users with low priorities typically want to initialize as early as
possible, so use .CRT$XCA00199 for prioties less than 200. This number
is arbitrary.
 
Implements PR38552.
 
Reviewers: majnemer, mstorsjo
 
Subscribers: hiraditya, llvm-commits
 
Differential Revision: https://reviews.llvm.org/D51820
 
llvm-svn: 341727

The destructors are still in .CRT$XT for default priority (65535) now, but for non-default priority they will go into .CRT$XC. I will upload a fixing patch with a LIT test shortly.

This clang-offload-wrapper commit will work properly, if we use default priority for the destructors.

'IMHO' if there is a problem with lowering of LLVM IR constructs for some
particular targets, that problem must be resolved instead of adding workarounds.

I completely agree with you! I am testing the patch for destructors.

sdmitriev updated this revision to Diff 215728.Fri, Aug 16, 6:05 PM
sdmitriev marked 4 inline comments as done.

Addressed review comments.

sdmitriev marked 2 inline comments as done.Fri, Aug 16, 6:09 PM
sdmitriev added inline comments.
clang/tools/clang-offload-wrapper/ClangOffloadWrapper.cpp
73

It cannot be const because of Type::getIntXXTy(LLVMContext &C) calls.

172

Done. And I have also changed MemoryBufferRef => ArrayRef (as you earlier suggested).

sdmitriev marked an inline comment as done.Fri, Aug 16, 6:11 PM

As I understand ‘atexit’ solution would be target dependent (‘__cxa_atexit’ on Linux and ‘atexit’ on Windows) whereas @llvm.global_ctors/dtors variables offer similar and platform neutral functionality (http://llvm.org/docs/LangRef.html#the-llvm-global-ctors-global-variable). Why do you think that ‘atexit’ is a better choice?

Because it does not work on Windows, better to have portable solution, if possible.

@vzakhari has already committed his fix for llvm.global_dtors (https://reviews.llvm.org/D66373), so I assume use of llvm.global_dtors in this patch would no longer cause problems on Windows.