This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain
Needs ReviewPublic

Authored by gtbercea on May 25 2018, 2:33 PM.

Details

Summary

So far, the clang-offload-bundler has been the default tool for bundling together various files types produced by the different OpenMP offloading toolchains supported by Clang. It does a great job for file types such as .bc, .ll, .ii, .ast. It is also used for bundling object files. Object files are special, in this case object files which contain sections meant to be executed on devices other than the host (such is the case of the OpenMP NVPTX toolchain). The bundling of object files prevents:

  • STATIC LINKING: These bundled object files can be part of static libraries which means that the object file requires an unbundling step. If an object file in a static library requires "unbundling" then we need to know the whereabouts of that library and of the files before the actual link step which makes it impossible to do static linking using the "-L/path/to/lib/folder -labc" flag.
  • INTEROPERABILITY WITH OTHER COMPILERS: These bundled object files can end up being passed between Clang and other compilers which may lead to incompatibilities: passing a bundled file from Clang to another compiler would lead to that compiler not being able to unbundle it. Passing an unbundled object file to Clang and therefore Clang not knowing that it doesn't need to unbundle it.

Goal:
Disable the use of the clang-offload-bundler for bundling/unbundling object files which contain OpenMP NVPTX device offloaded code. This applies to the case where the following set of flabold textgs are passed to Clang:
-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda
When the above condition is not met the compiler works as it does today by invoking the clang-offload-bundler for bundling/unbundling object files (at the cost of static linking and interoperability).
The clang-offload-bundler usage on files other than object files is not affected by this patch.

Extensibility
Although this patch disables bundling/unbundling of object files via the clang-offload-bundler for the OpenMP NVPTX device offloading toolchain ONLY, this functionality can be extended to other platforms/system where:

  • the device toolchain can produce a host-compatible object AND
  • partial linking of host objects is supported.

Current situation in trunk
In the current trunk the OpenMP device offloading toolchain performs the following steps depending on the input. Note: the clang-offload-bundler calls are part of the host toolchain but are shown here for clarity.

  • SCENARIO 1 (-c -o)
INPUT TO CLANG: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda input.cpp -c -o input.o
RELEVANT COMPILATION STEPS: PTXAS --------[.cubin]-------> clang-offload-bundler --bundle
  • SCENARIO 2 (input object file):
INPUT TO CLANG: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda input.o 
RELEVANT  COMPILATION STEPS: clang-offload-bundler --unbundle --------[.cubin]-------> NVLINK
  • SCENARIO 3 (static linking):
INPUT TO CLANG: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -L/path/to/lib -lstatic input.cpp
RELEVANT  COMPILATION STEPS: PTXAS --------[.cubin]-------> NVLINK [STATIC LINKING FLAGS ARE IGNORED]
  • SCENARIO 4 (C/C++ compilation):
INPUT TO CLANG: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda input.cpp
RELEVANT  COMPILATION STEPS: PTXAS --------[.cubin]-------> NVLINK

In the current trunk, the object on which the device toolchain operates is always a pure device object, i.e. a cubin. This only works when NVLINK can work on cubins directly, when these cubins are part of a static library or are bundled, NVLINK does not detect them anymore.

The solution:
The solution to this problem involves several changes:

A. Make the device object file detectable by NVLINK in all situations (even when it is part of a static library).
To do this we need to add 2 steps to the OpenMP NVPTX device offloading toolchain for the case when an object file is created:

SCENARIO 1 changes to:

PTXAS --------[.cubin AND .s]-------> FATBINARY --------[.c]-------> CLANG++ --------[.o]-------> clang-offload-bundler --bundle

B. Since the new object we create in SCENARIO 1 is a host object we no longer need a custom "bundling" scheme [FIXES INTEROPERABILITY]

SCENARIO 1 changes to:

PTXAS --------[.cubin AND .s]-------> FATBINARY --------[.c]-------> CLANG++ --------[.o]-------> ld -r

!!IMPORTANT!!!: ld -r is a host step shown here for completeness, it replaces the clang-offload-bundler call.

C. With changes A & B we don't need to perform unbundling because the object files can now be passed directly to the OpenMP NVPTX device offloading toolchain. NVLINK detects the device part in each file automatically (no need for special unbundling step here).

SCENARIO 2 changes to:

--------[input.o]-------> NVLINK

D. Enable static linking by passing the input flags directly to existing NVLINK. NVLINK can now detect device objects even when they are packed in a static library (since they were created using FATBINARY + CLANG++). [FIXES STATIC LINKING]

SCENARIO 3 changes to:

PTXAS --------[.cubin]-------> NVLINK -L/path/to/lib -lstatic

Note that SCENARIO 4 remains unchanged.

This patch implements changes A, B, C and D in one go.

Diff Detail

Event Timeline

gtbercea created this revision.May 25 2018, 2:33 PM
gtbercea updated this revision to Diff 148677.May 25 2018, 3:04 PM

@gtbercea, thanks for the patch.

INTEROPERABILITY WITH OTHER COMPILERS: These bundled object files can end up being passed between Clang and other compilers which may lead to incompatibilities: passing a bundled file from Clang to another compiler would lead to that compiler not being able to unbundle it. Passing an unbundled object file to Clang and therefore Clang not knowing that it doesn't need to unbundle it.

I don't understand the interoperability issue here. Clang convention to create the offloading records in the host binary is specific to clang and was discussed, defined and documented. We do not expect other compilers to understand it as we don't expect clang to understand that from other compilers. In terms of the binary itself, it is an ELF file that can be understood by other compilers/tools - albeit the linking would probably fail if the device section addresses are not defined.

I think that what you mean here is that by forcing nvcc as the host linker when one of the targets is a NVPTX target, you shift the (un)bundling part to nvlink and you will be compatible with host compilers supported by nvcc.

I understand that at the moment there is no immediate need for combining multiple targets, and there is an immediate need to support archives with offloading code. Therefore, changing the host linker in function of the offload target seems reasonable as that is what would help most users.

I just want to note that the separation of the toolchains in the driver and the support for multiple toolchains in the same binary were part of the reason we converged to the current design. The idea was to have a generic driver with all the details relative to bundling formats handled by a separate tool, the bundler. Of course the requirements can be reviewed at any time and priorities can change. However, I think it would be cleaner to have the nvlink compatible binary generated by the bundler in the current design. Just my two cents.

sfantao added inline comments.May 29 2018, 8:36 AM
include/clang/Driver/Compilation.h
125

Use CamelCase for class local variables.

314

Why is this a property of the compilation and not of a set of actions referring to a given target? That would allow one to combine in the same compilation targets requiring the bundler and targets that wouldn't.

lib/Driver/Compilation.cpp
287

Given the logic you have below, you are assuming this is not set to false ever. It would be wise to get an assertion here in case you end up having toolchains skipping and others don't. If that is just not supported a diagnostic should be added instead.

The convention is that local variables use CamelCase.

lib/Driver/Driver.cpp
3213

CamelCase

3229

Can you just implement this check in the definition of Compilation: canSkipClangOffloadBundler and get rid of setSkipOffloadBundler? All the requirted information is already in Compilation under C.getInputArgs().

3248

In the current implementation there is no distinction between what is meant for Windows/Linux. This check would only work on Linux and the test below would fail for bots running windows.

Also, I think it makes more sense to have this check part of the Toolchain instead of doing it in the driver. The Toolchain definition knows the names of the third-party executables, the driver doesn't.

lib/Driver/ToolChains/Clang.cpp
6109

I believe this check should be done when the toolchain is created with all the required diagnostics. What happens if the linker does not support partial linking?

lib/Driver/ToolChains/Cuda.cpp
551

What prevents all this from being done in the bundler? If I understand it correctly, if the bundler implements this wrapping all the checks for librariers wouldn't be required and, only two changes would be required in the driver:

  • generate fatbin instead of cubin. This is straightforward to do by changing the device assembling job. In terms of the loading of the kernels by the device API, doing it through fatbin or cubin should be equivalent except that fatbin enables storing the PTX format and JIT for newer GPUs.
  • Use NVIDIA linker as host linker.

This last requirement could be problematic if we get two targets attempting to use different (incompatible linkers). If we get this kind of incompatibility we should get the appropriate diagnostic.

test/Driver/openmp-offload.c
497

We need a test for the static linking. The host linker has to be nvcc in that case, right?

Just to clarify one thing in my last comment:

When I say that we didn't aim at having clang compatible with other compilers, I mean the OpenMP offloading descriptors, where all the variables and offloading entry points are. Of course we want to allow the resulting binaries to be compatible with linkers taking inputs of other compilers, so that you can have, e.g., OpenMP and CUDA supported in the same executable, even though working independently.

gtbercea added inline comments.May 29 2018, 9:01 AM
test/Driver/openmp-offload.c
497

The host linker is "ld". The "bundling" step is replaced (in the case of OpenMP NVPTX device offloading only) by a call to "ld -r" to partially link the 2 object files: the object file produced by the HOST toolchain and the object file produced by the OpenMP NVPTX device offloading toolchain (because we want to produce a single output).

gtbercea added inline comments.May 29 2018, 9:02 AM
test/Driver/openmp-offload.c
497

nvcc is not called at all in this patch.

I am not quite familiar with Clang driver set up, I will add Greg for more comments. But I have hacked one the latest YKT tree to support simple AMDGCN path the same way as NVPTX. The last patch is here

https://github.com/ROCm-Developer-Tools/hcc2-clang/commit/8c1cce0d39717c9e40ea70aea91e280673de756e

It is not upstreamed but we can compile the same binary for both nvptx and amdgcn cards as we designed.

If I understand this correctly, we are now switching nvlink as the default host linker whenever nvptx is involved. I am concerned that this may cause trouble for integration with other platforms. Maybe this path should be under a special option even for nvptx?

gtbercea added inline comments.May 29 2018, 11:08 AM
lib/Driver/ToolChains/Cuda.cpp
551

What prevents it is the fact that the bundler is called AFTER the HOST and DEVICE object files have been produced. The creation of the fatbin (FATBINARY + CALNG++) needs to happen within the NVPTX toolchain.

Just to clarify one thing in my last comment:

When I say that we didn't aim at having clang compatible with other compilers, I mean the OpenMP offloading descriptors, where all the variables and offloading entry points are. Of course we want to allow the resulting binaries to be compatible with linkers taking inputs of other compilers, so that you can have, e.g., OpenMP and CUDA supported in the same executable, even though working independently.

Today you will have trouble linking against a Clang object file in another compiler that doesn't know anything about the clang-offload-bundler.

tra added a comment.May 29 2018, 11:37 AM

"Interoperability with other compilers" is probably a statement that's a bit too strong. At best it's kind of compatible with CUDA tools and I don't think it's feasible for other compilers. I.e. it will be useless for AMD GPUs and whatever compiler they use.

In general it sounds like you're going back to what regular CUDA compilation pipeline does:

  • [clang] C++->.ptx
  • [ptxas] .ptx -> .cubin
  • [fatbin] .cubin -> .fatbin
  • [clang] C++ + .fatbin -> host .o

On one hand I can see how being able to treat GPU-side binaries as any other host files is convenient. On the other hand, this convenience comes with the price of targeting only NVPTX. This seems contrary to OpenMP's goal of supporting many different kinds of accelerators. I'm not sure what's the consensus in the OpenMP community these days, but I vaguely recall that generic bundling/unbundling was explicitly chosen over vendor-specific encapsulation in host .o when the bundling was implemented. If the underlying reasons have changed since then it would be great to hear more details about that.

Assuming we do proceed with back-to-CUDA approach, one thing I'd consider would be using clang's -fcuda-include-gpubinary option which CUDA uses to include GPU code into the host object. You may be able to use it to avoid compiling and partially linking .fatbin and host .o.

In D47394#1115086, @tra wrote:

On one hand I can see how being able to treat GPU-side binaries as any other host files is convenient. On the other hand, this convenience comes with the price of targeting only NVPTX. This seems contrary to OpenMP's goal of supporting many different kinds of accelerators. I'm not sure what's the consensus in the OpenMP community these days, but I vaguely recall that generic bundling/unbundling was explicitly chosen over vendor-specific encapsulation in host .o when the bundling was implemented. If the underlying reasons have changed since then it would be great to hear more details about that.

I second this statement, static linking might come handy for all targets and Clang should try to avoid vendor specific solutions as much as possible.

In a discussion off-list I proposed adding constructor functions to all object files and handle them like shared libraries are already handled today (ie register separately and let the runtime figure out how to relocate symbols in different translation units). I don't have an implementation of that approach so I can't claim that it works and doesn't have a huge performance impact (which we don't want either), but it should be agnostic of the offloading target so it may be worth investigating.

In a discussion off-list I proposed adding constructor functions to all object files and handle them like shared libraries are already handled today (ie register separately and let the runtime figure out how to relocate symbols in different translation units). I don't have an implementation of that approach so I can't claim that it works and doesn't have a huge performance impact (which we don't want either), but it should be agnostic of the offloading target so it may be worth investigating.

I don't understand how this would work. Doing something like that would require reimplementing the GPU-code linker, which requires knowing proprietary information of the GPU binary format. I would know how to resolve all the relocations in the device code. In my view, the solution would only work (or at least be more easily implemented) if we don't have relocatable code.

Assuming we do proceed with back-to-CUDA approach, one thing I'd consider would be using clang's -fcuda-include-gpubinary option which CUDA uses to include GPU code into the host object. You may be able to use it to avoid compiling and partially linking .fatbin and host .o.

Cool, I agree this is worth investigating.

lib/Driver/ToolChains/Cuda.cpp
551

Why does it have to happen in NVPTX toolchain, you are making the NVPTX toolchain generate an ELF object from another toolchain, right? What I'm suggesting is to do the stuff that mixes two (or more) toolchains in the bundler. Your inputs are still a fatbin and a host file.

test/Driver/openmp-offload.c
497

Ok, so how do you link device code? I.e. if you have two compilation units that depend on each other (some definition in one unit is used in the other), where are they linked together? Something has to understand the two files resulting from your "ld -r" step, my understanding is that that something is nvcc that calls nvlink behind the scenes, right? So, nvcc will do the unbundling+linking bit, right?

Assuming we do proceed with back-to-CUDA approach, one thing I'd consider would be using clang's -fcuda-include-gpubinary option which CUDA uses to include GPU code into the host object. You may be able to use it to avoid compiling and partially linking .fatbin and host .o.

I tried this example (https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/). It worked with NVCC but not with clang++. I can produce the main.o particle.o and v.o objects as relocatable (-fcuda-rdc) but the final step fails with a missing reference error.
This leads me to believe that embedding the CUDA fatbin code in the host object comes with limitations. If I were to change the OpenMP NVPTX toolchain to do the same then I would run into similar problems.

On the other hand., the example, ported to use OpenMP declare target regions (instead of device) it all compiles, links and runs correctly.

In general, I feel that if we go the way you propose then the solution is truly confined to NVPTX. If we instead implement a scheme like the one in this patch then we give other toolchains a chance to perhaps fill the nvlink "gap" and eventually be able to handle offloading in a similar manner and support static linking.

tra added a comment.May 31 2018, 3:23 PM

I tried this example (https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/). It worked with NVCC but not with clang++. I can produce the main.o particle.o and v.o objects as relocatable (-fcuda-rdc) but the final step fails with a missing reference error.

It's not clear what exactly you mean by the "final step" and what exactly was the error. Could you give me more details?

This leads me to believe that embedding the CUDA fatbin code in the host object comes with limitations. If I were to change the OpenMP NVPTX toolchain to do the same then I would run into similar problems.

It's a two-part problem.

In the end, we need to place GPU-side binary (whether it's an object or an executable) in a way that CUDA tools can recognize. You should end up with pretty much the same set of bits. If clang currently does not do that well enough, we should fix it.

Second part is what do we do about GPU-side object files. NVCC has some under-the-hood magic that invokes nvlink. If we invoke clang for the final linking phase, it has no idea that some of .o files may have GPU code in it that may need extra steps before we can pass everything to the linker to produce the host executable. IMO the linking of GPU-side objects should be done outside of clang. I.e. one could do it with an extra build rule which would invoke nvcc --device-link ... to link all GPU-side objects into a GPU executable, still wrapped in a host .o, which can then be linked into the host executable.

On the other hand., the example, ported to use OpenMP declare target regions (instead of device) it all compiles, links and runs correctly.

In general, I feel that if we go the way you propose then the solution is truly confined to NVPTX. If we instead implement a scheme like the one in this patch then we give other toolchains a chance to perhaps fill the nvlink "gap" and eventually be able to handle offloading in a similar manner and support static linking.

I'm not sure how is "fatbin + clang -fcuda-gpubinary" is any more confining to NVPTX than "fatbin + clang + ld -r" -- either way you rely on nvidia-specific tool. If at some point you find it too confining, changing either of those will require pretty much the same amount of work.

gtbercea added a comment.EditedMay 31 2018, 5:17 PM

The error is related to lack of device linking, just like you explained two paragraphs down. This is the error I get:

main.o: In function `__cuda_module_ctor':
main.cu:(.text+0x674): undefined reference to `__cudaRegisterLinkedBinary__nv_c5b75865'

You nailed the problem on the head: the device linking step is the tricky bit.

The OpenMP NVPTX device offloading toolchain has the advantage that it already calls NVLINK (upstreamed a long time ago). This patch doesn't change that. This patch "fixes" (for a lack of a better word) the way in which objects are created on the device side. By adding the FATBINARY + CLANG++ steps, I ensure that the existing call to NVLINK will be able to "detect" the device-part of object files and archived object files (static libraries). This is not a valid statement in today's compiler in which NVLINK would not be able to do so when passed a static library.

In general, for offloading toolchains, I don't see the reliance on vendor specific tools as a problem if and only if the calls to vendor-specific tools remain confined to a device-speicifc toolchain. This patch respects this condition. All the calls to CUDA tools in this patch are part of the OpenMP NVPTX device offloading toolchain (which is an NVPTX device specific toolchain).

The only host-side change is the call to "ld -r" which replaces a call to the "openmp-offload-bundler" tool.

The error is related to lack of device linking, just like you explained two paragraphs down. This is the error I get:

main.o: In function `__cuda_module_ctor':
main.cu:(.text+0x674): undefined reference to `__cudaRegisterLinkedBinary__nv_c5b75865'

That's because I didn't implement linking for relocatable device code in CUDA, you have to use nvcc for that. Please see https://clang.llvm.org/docs/ReleaseNotes.html#cuda-support-in-clang and the original patch D42922.

The OpenMP NVPTX device offloading toolchain has the advantage that it already calls NVLINK (upstreamed a long time ago). This patch doesn't change that. This patch "fixes" (for a lack of a better word) the way in which objects are created on the device side. By adding the FATBINARY + CLANG++ steps, I ensure that the existing call to NVLINK will be able to "detect" the device-part of object files and archived object files (static libraries). This is not a valid statement in today's compiler in which NVLINK would not be able to do so when passed a static library.

In general, for offloading toolchains, I don't see the reliance on vendor specific tools as a problem if and only if the calls to vendor-specific tools remain confined to a device-speicifc toolchain. This patch respects this condition. All the calls to CUDA tools in this patch are part of the OpenMP NVPTX device offloading toolchain (which is an NVPTX device specific toolchain).

I disagree in this context because this patch currently means that static archives will only work with NVPTX and there is no clear path how to "fix" things for other offloading targets. I'll try to work on my proposal over the next few days (sorry, very busy week...), maybe I can put together a prototype of my idea.

kkwli0 added a subscriber: kkwli0.Jun 1 2018, 4:56 AM

I disagree in this context because this patch currently means that static archives will only work with NVPTX and there is no clear path how to "fix" things for other offloading targets. I'll try to work on my proposal over the next few days (sorry, very busy week...), maybe I can put together a prototype of my idea.

Other toolchains can also have static linking if they:

  1. ditch the clang-offload-bundler for generating/consuming object files.
  2. implement a link step on the device toolchain which can identify the vendor specific object file inside the host object file. (this is how the so called "bunlding" should have been done in the first place not using a custom tool which limits the functionality of the compiler). Identifying toolchain-specific objects/binaries is a task that belongs within the device-specific toolchain and SHOULD NOT be factored out because you can't treat object that are different by definition in the same way. Ignoring their differences leads to those object not being link-able. On top of that, factoring out introduces custom object formats which only CLANG understands AND it introduces compilation steps that impede static linking.

I'm surprised you now disagree with this technique, when I first introduced you to this in an e-mail off list you agreed with it.

So this patch, the only new CUDA tool that it calls is the FATBINARY tool which is done on the device-specific side of the toolchain so you can't possibly object to that. The CUDA toolchain already calls FATBINARY so it's not a novelty. That step is essential to making device-side objects identifiable by NVLINK (which we already call).

The only step you might object to is the partial linking step which, as I explained in my original post, I envisage will be improved over time as more toolchains support this scheme. I think this is a true solution to the problem. What you are proposing is a workaround that doesn't tackle the problem head-on.

I'm surprised you now disagree with this technique, when I first introduced you to this in an e-mail off list you agreed with it.

My words were I agree this is the best solution for NVPTX. In the same reply I asked how your proposal is supposed to work for other offloading targets which is now clear to require additional work, maybe even completely novel tools.
So now I disagree that it is the right solution for Clang because I think my proposal will cover all offloading targets. Please give me a bit time so that I can see if it works.

guraypp added a subscriber: guraypp.Jun 1 2018, 7:27 AM

Hmm, maybe the scope is much larger: I just tried linking an executable that references a declare target function in a shared library. My assumption was that this already works, given that libomptarget's registration functions can be called multiple times. Am I doing something wrong?

Hmm, maybe the scope is much larger: I just tried linking an executable that references a declare target function in a shared library. My assumption was that this already works, given that libomptarget's registration functions can be called multiple times. Am I doing something wrong?

I believe this is a limitation coming from the Cuda toolchain. Not even nvcc supports this case: https://stackoverflow.com/questions/35897002/cuda-nvcc-building-chain-of-libraries

Hmm, maybe the scope is much larger: I just tried linking an executable that references a declare target function in a shared library. My assumption was that this already works, given that libomptarget's registration functions can be called multiple times. Am I doing something wrong?

I believe this is a limitation coming from the Cuda toolchain. Not even nvcc supports this case: https://stackoverflow.com/questions/35897002/cuda-nvcc-building-chain-of-libraries

You are absolutely right, thanks for the link. Maybe we should also document somewhere that we don't support that either for OpenMP offloading to NVPTX?

I think this basically renders my approach useless as I meant to compile each device object file for offloading targets directly to a shared library. Those could have been put together at runtime by just loading (and registering) them in the right order. That way we would have been able to keep clang-offload-bundler in its current target agnostic form and didn't need to appease proprietary tools such as nvlink.

With that knowledge I see no other way than what this patch proposes. (I still don't particularly like it because it requires each toolchain to implement their own magic.) Sorry for the delay and my disagreement based on wrong assumptions that I wasn't able to verify as soon as I'd have liked to.

Hmm, maybe the scope is much larger: I just tried linking an executable that references a declare target function in a shared library. My assumption was that this already works, given that libomptarget's registration functions can be called multiple times. Am I doing something wrong?

I believe this is a limitation coming from the Cuda toolchain. Not even nvcc supports this case: https://stackoverflow.com/questions/35897002/cuda-nvcc-building-chain-of-libraries

You are absolutely right, thanks for the link. Maybe we should also document somewhere that we don't support that either for OpenMP offloading to NVPTX?

I think this basically renders my approach useless as I meant to compile each device object file for offloading targets directly to a shared library. Those could have been put together at runtime by just loading (and registering) them in the right order. That way we would have been able to keep clang-offload-bundler in its current target agnostic form and didn't need to appease proprietary tools such as nvlink.

With that knowledge I see no other way than what this patch proposes. (I still don't particularly like it because it requires each toolchain to implement their own magic.) Sorry for the delay and my disagreement based on wrong assumptions that I wasn't able to verify as soon as I'd have liked to.

No problem at all.

I will update the description of the patch with more information. I had some very useful e-mail exchanges with @sfantao which I will try to work into the description of the patch.

gtbercea edited the summary of this revision. (Show Details)Jun 4 2018, 10:02 AM
gtbercea edited the summary of this revision. (Show Details)Jun 4 2018, 10:05 AM
gtbercea edited the summary of this revision. (Show Details)
tra added a comment.Jun 5 2018, 3:48 PM

With the updated patch description + the discussion I'm OK with the approach from the general "how do we compile/use CUDA" point of view. I'll leave the question of whether the approach works for OpenMP to someone more familiar with it.

While I'm not completely convinced that [fatbin]->.c->[clang]->.o (with device code only)->[ld -r] -> host.o (host+device code) is ideal (things could be done with smaller number of tool invocations), it should help to deal with -rdc compilation until we get a chance to improve support for it in Clang. We may revisit and change this portion of the pipeline when clang can incorporate -rdc GPU binaries in a way compatible with CUDA tools.

tra removed a reviewer: tra.Jun 5 2018, 3:48 PM
tra added a subscriber: tra.
In D47394#1123044, @tra wrote:

While I'm not completely convinced that [fatbin]->.c->[clang]->.o (with device code only)->[ld -r] -> host.o (host+device code) is ideal (things could be done with smaller number of tool invocations), it should help to deal with -rdc compilation until we get a chance to improve support for it in Clang. We may revisit and change this portion of the pipeline when clang can incorporate -rdc GPU binaries in a way compatible with CUDA tools.

I think this should work with current trunk, Clang puts the GPU binary into a section called __nv_relfatbin when also passing -fcuda-rdc (see D42922).
What will probably result in problems are the registration functions as shown above by @gtbercea (undefined references...). But as we don't need them for OpenMP (we have our own registration machinery) it might be worth implementing something like -fno-cuda-registration. Maybe then clang -cc1 <host> -fcuda-include-gpubinary <device> -fcuda-rdc -fno-cuda-registration can be used to embed the device object, replacing the dance ending in ld -r?

@tra Thank you for your comments and help with the patch.

yaxunl added a subscriber: yaxunl.Jun 11 2018, 11:58 AM
gtbercea marked 3 inline comments as done.Jun 11 2018, 12:52 PM
gtbercea added inline comments.
include/clang/Driver/Compilation.h
314

This was a way to pass this information to the OpenMP NVPTX device toolchain.

Both the Driver OpenMP NVPTX toolchain need to agree on the usage of the new scheme (proposed in this patch) or the old scheme (the one that is in the compiler today).

lib/Driver/Compilation.cpp
287

The checks I added in the Driver will set this flag to true if all toolchains Clang offloads to support the skipping of the bundler/unbundler for object files. Currently only NVPTX toolchain can skip the bundler/unbundler for object files so the code path in this patch will be taken only for:

-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda

lib/Driver/Driver.cpp
3229

The driver needs to have the result of this check available. The flag is passed to the step which adds host-device dependencies. If the bundler can be skipped then the unbundling action is not required.

I guess this could be implemented in Compilation. Even so I would like it to happen only once like it does here and not every time someone queries the "can I skip the bundler" flag.

I wanted this check to happen only once hence why I put in on the driver side. The result of this check needs to be available in Driver.cpp and in Cuda.cpp files (see usage in this patch). Compilation keeps track of the flag because skipping the bundler is an all or nothing flag: you can skip the bundler/unbundler for object files if and only if all toolchains you are offloading to can skip it.

3248

Currently this is only meant to work with ld because that we know for sure is a linker which supports partial linking. If other linkers also support it then they can be added here. Not finding ld will lead to the old scheme being used.

You are correct the test needs to be fixed.

lib/Driver/ToolChains/Clang.cpp
6109

The linker should always support partial linking at this point, hence the assert. The linker not supporting partial linking is determined at a much earlier step so by this point we should already have that information. In the eventuality that some corner case arises that I may have missed then this assert is triggered. But it would be really unexpected for this assert to be triggered since choosing this action is based on the linker supporting partial linking.

lib/Driver/ToolChains/Cuda.cpp
551

I think my latest update to the patch description should clarify this part here.

gtbercea marked 2 inline comments as done.Jun 11 2018, 3:02 PM
gtbercea updated this revision to Diff 150947.Jun 12 2018, 8:22 AM

Added separate test.

gtbercea marked 2 inline comments as done.Jun 12 2018, 8:24 AM

Hi Doru,

Thanks for updating the patch. I've a few comments below.

include/clang/Driver/Compilation.h
314

I understand, but the way I see it is that it is the toolchain that skips the bundler not the compilation. I understand that as of this patch, you skip only if there is a single nvptx target. If you have more than one target, as some tests do, some toolchains will still need the bundler. So, we are making what happens with the nvptx target dependent of other toolchains. Is this an intended effect of this patch?

lib/Driver/Compilation.cpp
287

Ok, if that is the case, just add an assertion here.

lib/Driver/Driver.cpp
3229

Right, in these circumstances "can skip bundler" is the same as "do I have a single toolchain" and "is that toolchain nvptx". This is fairly inexpensive to do, so I don't really see the need to record this state in the driver. It will also be clearer what are the conditions for which you skip the bundler.

lib/Driver/ToolChains/Cuda.cpp
511

Why don't create fatbins instead of cubins in all cases. For the purposes of OpenMP they are equivalent, i.e. nvlink can interpret them in the same way, no?

532

I'd move this comment to the top of this session so that we know what is going on in the code above.

539

CamelCase

684

So, what if it is not a static library?

test/Driver/openmp-offload-gpu-linux.c
25

clang-offload-bundler should be sufficient here.

gtbercea marked 3 inline comments as done.Jul 31 2018, 6:19 AM

Answers to comments.

include/clang/Driver/Compilation.h
314

Bundler is skipped only for the OpenMP NVPTX toolchain. I'm not sure what you mean by "other toolchain".

lib/Driver/Compilation.cpp
287

If one of the toolchains in the list of toolchains can't skip then none of them skip. If all can skip then they all skip. What assertion would you like me to add?

lib/Driver/Driver.cpp
3229

That is true for now but if more toolchains get added to the list of toolchains that can skip the bundler then you want to factor it out and make it happen only once in a toolchain-independent point in the code. Otherwise you will carry that list of toolchains everywhere in the code where you need to do the check.

Also if you are to do this at toolchain level you will not be able to check if the other toolchains were able to skip or not. For now ALL toolchains must skip or ALL toolchains don't skip the bundler.

lib/Driver/ToolChains/Cuda.cpp
511

I'm not sure why the comment is attached to this particular line in the code.

But the reason why I don't use fatbins everywhere is because I want to leave the previous toolchain intact. So when the bundler is not skipped we do precisely what we did before.

gtbercea marked an inline comment as done.Jul 31 2018, 6:35 AM
sfantao added inline comments.Jul 31 2018, 12:39 PM
include/clang/Driver/Compilation.h
314

Is skipped for the NVPTX toolchain if there are no "other" device toolchains requested. Say I have a working pipeline that does static linking with nvptx correctly. Then on top of that I add another device to -fopenmp-targets, that pipeline will now fail even for nvptx, right?

lib/Driver/Compilation.cpp
287

If SkipOffloadBundler is set to true you don't expect it to be set to false afterwards, right? That should be asserted.

lib/Driver/ToolChains/Cuda.cpp
511

The comment was here, because this is where you generate the command to create the fatbin - no other intended meaning.

Given that fatbin can be linked with nvlink to get a device cubin the toolchain won't need to change regardless of whether bundling is used or not, for the bundler the device images are just bits.

gtbercea added inline comments.Jul 31 2018, 2:04 PM
include/clang/Driver/Compilation.h
314

It's a choice between skipping the bundler and running the current, default mode with the bundler enabled. If targets other than NVPTX are present then we default to using the bundler for all toolchains.
There is no hybrid mode enabled where some targets use the bundler and some don't.

lib/Driver/Compilation.cpp
287

That's correct, I can add that sure.

gtbercea updated this revision to Diff 159536.Aug 7 2018, 10:08 AM
gtbercea marked 3 inline comments as done.
  • Address comments.
lib/Driver/ToolChains/Cuda.cpp
684

Can it be anything else at this point?

gtbercea marked 2 inline comments as done.Aug 7 2018, 10:21 AM
gtbercea updated this revision to Diff 187979.Feb 22 2019, 1:41 PM
  • Update.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 22 2019, 1:41 PM
Herald added a subscriber: jdoerfert. · View Herald Transcript

Could you sketch for me how this will (potentially) work if we have multiple target vendors? The fatbin solution seems tailored to NVIDIA, but maybe I'm wrong here.

In any case, we need to make progress on this front and if this solution is compatible with other vendors we should get it in asap.

@xtian, @gregrodgers, @ddibyend please take a look or have someone take a look and comment.

lib/Driver/Driver.cpp
3972

unrelated

lib/Driver/ToolChains/Clang.cpp
6117

In "core-LLVM" we usually avoid these braces.

lib/Driver/ToolChains/Cuda.cpp
401

It might not be worth it to save CubinF here but create it 120 lines later instead

547

You cannot hardcode clang++, it could be C code and we don't want to cause interoperability problems and/or the warnings that will inevitably follow.

661

Could you add a comment here please?

687

By comparing this code with the one after the if (... endwith(".a")) it seems this treated a bit differently than a static library below. I mention it only because of the comment above.