This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/clang/Driver/
-
clang/
-
Driver/
-
Action.h
2/7
Compilation.h
-
Options.td
-
ToolChain.h
-
lib/Driver/
-
Driver/
-
Action.cpp
1/6
Compilation.cpp
3/8
Driver.cpp
-
ToolChain.cpp
-
ToolChains/
-
Clang.h
1/3
Clang.cpp
4/15
Cuda.cpp
-
test/Driver/
-
Driver/
1/1
openmp-offload-gpu-linux.c
-
openmp-offload-gpu.c
1/4
openmp-offload.c

Differential D47394

[OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain
Needs ReviewPublic

Authored by gtbercea on May 25 2018, 2:33 PM.

Download Raw Diff

Details

Reviewers

hfinkel
caomhin
carlo.bertolli
kkwli0
jdoerfert

Summary

So far, the clang-offload-bundler has been the default tool for bundling together various files types produced by the different OpenMP offloading toolchains supported by Clang. It does a great job for file types such as .bc, .ll, .ii, .ast. It is also used for bundling object files. Object files are special, in this case object files which contain sections meant to be executed on devices other than the host (such is the case of the OpenMP NVPTX toolchain). The bundling of object files prevents:

STATIC LINKING: These bundled object files can be part of static libraries which means that the object file requires an unbundling step. If an object file in a static library requires "unbundling" then we need to know the whereabouts of that library and of the files before the actual link step which makes it impossible to do static linking using the "-L/path/to/lib/folder -labc" flag.
INTEROPERABILITY WITH OTHER COMPILERS: These bundled object files can end up being passed between Clang and other compilers which may lead to incompatibilities: passing a bundled file from Clang to another compiler would lead to that compiler not being able to unbundle it. Passing an unbundled object file to Clang and therefore Clang not knowing that it doesn't need to unbundle it.

Goal:
Disable the use of the clang-offload-bundler for bundling/unbundling object files which contain OpenMP NVPTX device offloaded code. This applies to the case where the following set of flabold textgs are passed to Clang:
-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda
When the above condition is not met the compiler works as it does today by invoking the clang-offload-bundler for bundling/unbundling object files (at the cost of static linking and interoperability).
The clang-offload-bundler usage on files other than object files is not affected by this patch.

Extensibility
Although this patch disables bundling/unbundling of object files via the clang-offload-bundler for the OpenMP NVPTX device offloading toolchain ONLY, this functionality can be extended to other platforms/system where:

the device toolchain can produce a host-compatible object AND
partial linking of host objects is supported.

Current situation in trunk
In the current trunk the OpenMP device offloading toolchain performs the following steps depending on the input. Note: the clang-offload-bundler calls are part of the host toolchain but are shown here for clarity.

SCENARIO 1 (-c -o)

INPUT TO CLANG: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda input.cpp -c -o input.o
RELEVANT COMPILATION STEPS: PTXAS --------[.cubin]-------> clang-offload-bundler --bundle

SCENARIO 2 (input object file):

INPUT TO CLANG: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda input.o 
RELEVANT  COMPILATION STEPS: clang-offload-bundler --unbundle --------[.cubin]-------> NVLINK

SCENARIO 3 (static linking):

INPUT TO CLANG: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -L/path/to/lib -lstatic input.cpp
RELEVANT  COMPILATION STEPS: PTXAS --------[.cubin]-------> NVLINK [STATIC LINKING FLAGS ARE IGNORED]

SCENARIO 4 (C/C++ compilation):

INPUT TO CLANG: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda input.cpp
RELEVANT  COMPILATION STEPS: PTXAS --------[.cubin]-------> NVLINK

In the current trunk, the object on which the device toolchain operates is always a pure device object, i.e. a cubin. This only works when NVLINK can work on cubins directly, when these cubins are part of a static library or are bundled, NVLINK does not detect them anymore.

The solution:
The solution to this problem involves several changes:

A. Make the device object file detectable by NVLINK in all situations (even when it is part of a static library).
To do this we need to add 2 steps to the OpenMP NVPTX device offloading toolchain for the case when an object file is created:

SCENARIO 1 changes to:

PTXAS --------[.cubin AND .s]-------> FATBINARY --------[.c]-------> CLANG++ --------[.o]-------> clang-offload-bundler --bundle

B. Since the new object we create in SCENARIO 1 is a host object we no longer need a custom "bundling" scheme [FIXES INTEROPERABILITY]

SCENARIO 1 changes to:

PTXAS --------[.cubin AND .s]-------> FATBINARY --------[.c]-------> CLANG++ --------[.o]-------> ld -r

!!IMPORTANT!!!: ld -r is a host step shown here for completeness, it replaces the clang-offload-bundler call.

C. With changes A & B we don't need to perform unbundling because the object files can now be passed directly to the OpenMP NVPTX device offloading toolchain. NVLINK detects the device part in each file automatically (no need for special unbundling step here).

SCENARIO 2 changes to:

--------[input.o]-------> NVLINK

D. Enable static linking by passing the input flags directly to existing NVLINK. NVLINK can now detect device objects even when they are packed in a static library (since they were created using FATBINARY + CLANG++). [FIXES STATIC LINKING]

SCENARIO 3 changes to:

PTXAS --------[.cubin]-------> NVLINK -L/path/to/lib -lstatic

Note that SCENARIO 4 remains unchanged.

This patch implements changes A, B, C and D in one go.

Diff Detail

Repository

rC Clang

Build Status

Buildable 28427
Build 28426: arc lint + arc unit

Event Timeline

gtbercea created this revision.May 25 2018, 2:33 PM

Herald added subscribers: cfe-commits, guansong. · View Herald TranscriptMay 25 2018, 2:33 PM

gtbercea updated this revision to Diff 148677.May 25 2018, 3:04 PM

sfantao added a subscriber: sfantao.May 29 2018, 4:52 AM

@gtbercea, thanks for the patch.

INTEROPERABILITY WITH OTHER COMPILERS: These bundled object files can end up being passed between Clang and other compilers which may lead to incompatibilities: passing a bundled file from Clang to another compiler would lead to that compiler not being able to unbundle it. Passing an unbundled object file to Clang and therefore Clang not knowing that it doesn't need to unbundle it.

I don't understand the interoperability issue here. Clang convention to create the offloading records in the host binary is specific to clang and was discussed, defined and documented. We do not expect other compilers to understand it as we don't expect clang to understand that from other compilers. In terms of the binary itself, it is an ELF file that can be understood by other compilers/tools - albeit the linking would probably fail if the device section addresses are not defined.

I think that what you mean here is that by forcing nvcc as the host linker when one of the targets is a NVPTX target, you shift the (un)bundling part to nvlink and you will be compatible with host compilers supported by nvcc.

I understand that at the moment there is no immediate need for combining multiple targets, and there is an immediate need to support archives with offloading code. Therefore, changing the host linker in function of the offload target seems reasonable as that is what would help most users.

I just want to note that the separation of the toolchains in the driver and the support for multiple toolchains in the same binary were part of the reason we converged to the current design. The idea was to have a generic driver with all the details relative to bundling formats handled by a separate tool, the bundler. Of course the requirements can be reviewed at any time and priorities can change. However, I think it would be cleaner to have the nvlink compatible binary generated by the bundler in the current design. Just my two cents.

sfantao added inline comments.May 29 2018, 8:36 AM

include/clang/Driver/Compilation.h
125	Use CamelCase for class local variables.
314	Why is this a property of the compilation and not of a set of actions referring to a given target? That would allow one to combine in the same compilation targets requiring the bundler and targets that wouldn't.
lib/Driver/Compilation.cpp
287	Given the logic you have below, you are assuming this is not set to false ever. It would be wise to get an assertion here in case you end up having toolchains skipping and others don't. If that is just not supported a diagnostic should be added instead. The convention is that local variables use CamelCase.
lib/Driver/Driver.cpp
3213	CamelCase
3229	Can you just implement this check in the definition of `Compilation: canSkipClangOffloadBundler` and get rid of `setSkipOffloadBundler`? All the requirted information is already in `Compilation` under `C.getInputArgs()`.
3248	In the current implementation there is no distinction between what is meant for Windows/Linux. This check would only work on Linux and the test below would fail for bots running windows. Also, I think it makes more sense to have this check part of the `Toolchain` instead of doing it in the driver. The `Toolchain` definition knows the names of the third-party executables, the driver doesn't.
lib/Driver/ToolChains/Clang.cpp
6109	I believe this check should be done when the toolchain is created with all the required diagnostics. What happens if the linker does not support partial linking?
lib/Driver/ToolChains/Cuda.cpp
551	What prevents all this from being done in the bundler? If I understand it correctly, if the bundler implements this wrapping all the checks for librariers wouldn't be required and, only two changes would be required in the driver: generate fatbin instead of cubin. This is straightforward to do by changing the device assembling job. In terms of the loading of the kernels by the device API, doing it through fatbin or cubin should be equivalent except that fatbin enables storing the PTX format and JIT for newer GPUs. Use NVIDIA linker as host linker. This last requirement could be problematic if we get two targets attempting to use different (incompatible linkers). If we get this kind of incompatibility we should get the appropriate diagnostic.
test/Driver/openmp-offload.c
497	We need a test for the static linking. The host linker has to be nvcc in that case, right?

Just to clarify one thing in my last comment:

When I say that we didn't aim at having clang compatible with other compilers, I mean the OpenMP offloading descriptors, where all the variables and offloading entry points are. Of course we want to allow the resulting binaries to be compatible with linkers taking inputs of other compilers, so that you can have, e.g., OpenMP and CUDA supported in the same executable, even though working independently.

gtbercea added inline comments.May 29 2018, 9:01 AM

test/Driver/openmp-offload.c
497	The host linker is "ld". The "bundling" step is replaced (in the case of OpenMP NVPTX device offloading only) by a call to "ld -r" to partially link the 2 object files: the object file produced by the HOST toolchain and the object file produced by the OpenMP NVPTX device offloading toolchain (because we want to produce a single output).

gtbercea added inline comments.May 29 2018, 9:02 AM

test/Driver/openmp-offload.c
497	nvcc is not called at all in this patch.

I am not quite familiar with Clang driver set up, I will add Greg for more comments. But I have hacked one the latest YKT tree to support simple AMDGCN path the same way as NVPTX. The last patch is here

https://github.com/ROCm-Developer-Tools/hcc2-clang/commit/8c1cce0d39717c9e40ea70aea91e280673de756e

It is not upstreamed but we can compile the same binary for both nvptx and amdgcn cards as we designed.

If I understand this correctly, we are now switching nvlink as the default host linker whenever nvptx is involved. I am concerned that this may cause trouble for integration with other platforms. Maybe this path should be under a special option even for nvptx?

gtbercea added inline comments.May 29 2018, 11:08 AM

lib/Driver/ToolChains/Cuda.cpp
551	What prevents it is the fact that the bundler is called AFTER the HOST and DEVICE object files have been produced. The creation of the fatbin (FATBINARY + CALNG++) needs to happen within the NVPTX toolchain.

In D47394#1114848, @sfantao wrote:

Just to clarify one thing in my last comment:

When I say that we didn't aim at having clang compatible with other compilers, I mean the OpenMP offloading descriptors, where all the variables and offloading entry points are. Of course we want to allow the resulting binaries to be compatible with linkers taking inputs of other compilers, so that you can have, e.g., OpenMP and CUDA supported in the same executable, even though working independently.

Today you will have trouble linking against a Clang object file in another compiler that doesn't know anything about the clang-offload-bundler.

"Interoperability with other compilers" is probably a statement that's a bit too strong. At best it's kind of compatible with CUDA tools and I don't think it's feasible for other compilers. I.e. it will be useless for AMD GPUs and whatever compiler they use.

In general it sounds like you're going back to what regular CUDA compilation pipeline does:

[clang] C++->.ptx
[ptxas] .ptx -> .cubin
[fatbin] .cubin -> .fatbin
[clang] C++ + .fatbin -> host .o

On one hand I can see how being able to treat GPU-side binaries as any other host files is convenient. On the other hand, this convenience comes with the price of targeting only NVPTX. This seems contrary to OpenMP's goal of supporting many different kinds of accelerators. I'm not sure what's the consensus in the OpenMP community these days, but I vaguely recall that generic bundling/unbundling was explicitly chosen over vendor-specific encapsulation in host .o when the bundling was implemented. If the underlying reasons have changed since then it would be great to hear more details about that.

Assuming we do proceed with back-to-CUDA approach, one thing I'd consider would be using clang's -fcuda-include-gpubinary option which CUDA uses to include GPU code into the host object. You may be able to use it to avoid compiling and partially linking .fatbin and host .o.

In D47394#1115086, @tra wrote:

On one hand I can see how being able to treat GPU-side binaries as any other host files is convenient. On the other hand, this convenience comes with the price of targeting only NVPTX. This seems contrary to OpenMP's goal of supporting many different kinds of accelerators. I'm not sure what's the consensus in the OpenMP community these days, but I vaguely recall that generic bundling/unbundling was explicitly chosen over vendor-specific encapsulation in host .o when the bundling was implemented. If the underlying reasons have changed since then it would be great to hear more details about that.

I second this statement, static linking might come handy for all targets and Clang should try to avoid vendor specific solutions as much as possible.

In a discussion off-list I proposed adding constructor functions to all object files and handle them like shared libraries are already handled today (ie register separately and let the runtime figure out how to relocate symbols in different translation units). I don't have an implementation of that approach so I can't claim that it works and doesn't have a huge performance impact (which we don't want either), but it should be agnostic of the offloading target so it may be worth investigating.

In a discussion off-list I proposed adding constructor functions to all object files and handle them like shared libraries are already handled today (ie register separately and let the runtime figure out how to relocate symbols in different translation units). I don't have an implementation of that approach so I can't claim that it works and doesn't have a huge performance impact (which we don't want either), but it should be agnostic of the offloading target so it may be worth investigating.

I don't understand how this would work. Doing something like that would require reimplementing the GPU-code linker, which requires knowing proprietary information of the GPU binary format. I would know how to resolve all the relocations in the device code. In my view, the solution would only work (or at least be more easily implemented) if we don't have relocatable code.

Assuming we do proceed with back-to-CUDA approach, one thing I'd consider would be using clang's -fcuda-include-gpubinary option which CUDA uses to include GPU code into the host object. You may be able to use it to avoid compiling and partially linking .fatbin and host .o.

Cool, I agree this is worth investigating.

lib/Driver/ToolChains/Cuda.cpp
551	Why does it have to happen in NVPTX toolchain, you are making the NVPTX toolchain generate an ELF object from another toolchain, right? What I'm suggesting is to do the stuff that mixes two (or more) toolchains in the bundler. Your inputs are still a fatbin and a host file.
test/Driver/openmp-offload.c
497	Ok, so how do you link device code? I.e. if you have two compilation units that depend on each other (some definition in one unit is used in the other), where are they linked together? Something has to understand the two files resulting from your "ld -r" step, my understanding is that that something is nvcc that calls nvlink behind the scenes, right? So, nvcc will do the unbundling+linking bit, right?

simoatze added a subscriber: simoatze.May 30 2018, 9:59 AM

Assuming we do proceed with back-to-CUDA approach, one thing I'd consider would be using clang's -fcuda-include-gpubinary option which CUDA uses to include GPU code into the host object. You may be able to use it to avoid compiling and partially linking .fatbin and host .o.

I tried this example (https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/). It worked with NVCC but not with clang++. I can produce the main.o particle.o and v.o objects as relocatable (-fcuda-rdc) but the final step fails with a missing reference error.
This leads me to believe that embedding the CUDA fatbin code in the host object comes with limitations. If I were to change the OpenMP NVPTX toolchain to do the same then I would run into similar problems.

On the other hand., the example, ported to use OpenMP declare target regions (instead of device) it all compiles, links and runs correctly.

In general, I feel that if we go the way you propose then the solution is truly confined to NVPTX. If we instead implement a scheme like the one in this patch then we give other toolchains a chance to perhaps fill the nvlink "gap" and eventually be able to handle offloading in a similar manner and support static linking.

In D47394#1118223, @gtbercea wrote:

I tried this example (https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/). It worked with NVCC but not with clang++. I can produce the main.o particle.o and v.o objects as relocatable (-fcuda-rdc) but the final step fails with a missing reference error.

It's not clear what exactly you mean by the "final step" and what exactly was the error. Could you give me more details?

This leads me to believe that embedding the CUDA fatbin code in the host object comes with limitations. If I were to change the OpenMP NVPTX toolchain to do the same then I would run into similar problems.

It's a two-part problem.

In the end, we need to place GPU-side binary (whether it's an object or an executable) in a way that CUDA tools can recognize. You should end up with pretty much the same set of bits. If clang currently does not do that well enough, we should fix it.

Second part is what do we do about GPU-side object files. NVCC has some under-the-hood magic that invokes nvlink. If we invoke clang for the final linking phase, it has no idea that some of .o files may have GPU code in it that may need extra steps before we can pass everything to the linker to produce the host executable. IMO the linking of GPU-side objects should be done outside of clang. I.e. one could do it with an extra build rule which would invoke nvcc --device-link ... to link all GPU-side objects into a GPU executable, still wrapped in a host .o, which can then be linked into the host executable.

On the other hand., the example, ported to use OpenMP declare target regions (instead of device) it all compiles, links and runs correctly.

In general, I feel that if we go the way you propose then the solution is truly confined to NVPTX. If we instead implement a scheme like the one in this patch then we give other toolchains a chance to perhaps fill the nvlink "gap" and eventually be able to handle offloading in a similar manner and support static linking.

I'm not sure how is "fatbin + clang -fcuda-gpubinary" is any more confining to NVPTX than "fatbin + clang + ld -r" -- either way you rely on nvidia-specific tool. If at some point you find it too confining, changing either of those will require pretty much the same amount of work.

The error is related to lack of device linking, just like you explained two paragraphs down. This is the error I get:

main.o: In function `__cuda_module_ctor':
main.cu:(.text+0x674): undefined reference to `__cudaRegisterLinkedBinary__nv_c5b75865'

You nailed the problem on the head: the device linking step is the tricky bit.

The OpenMP NVPTX device offloading toolchain has the advantage that it already calls NVLINK (upstreamed a long time ago). This patch doesn't change that. This patch "fixes" (for a lack of a better word) the way in which objects are created on the device side. By adding the FATBINARY + CLANG++ steps, I ensure that the existing call to NVLINK will be able to "detect" the device-part of object files and archived object files (static libraries). This is not a valid statement in today's compiler in which NVLINK would not be able to do so when passed a static library.

In general, for offloading toolchains, I don't see the reliance on vendor specific tools as a problem if and only if the calls to vendor-specific tools remain confined to a device-speicifc toolchain. This patch respects this condition. All the calls to CUDA tools in this patch are part of the OpenMP NVPTX device offloading toolchain (which is an NVPTX device specific toolchain).

The only host-side change is the call to "ld -r" which replaces a call to the "openmp-offload-bundler" tool.

In D47394#1118393, @gtbercea wrote:
The error is related to lack of device linking, just like you explained two paragraphs down. This is the error I get:
main.o: In function `__cuda_module_ctor':
main.cu:(.text+0x674): undefined reference to `__cudaRegisterLinkedBinary__nv_c5b75865'

That's because I didn't implement linking for relocatable device code in CUDA, you have to use nvcc for that. Please see https://clang.llvm.org/docs/ReleaseNotes.html#cuda-support-in-clang and the original patch D42922.

The OpenMP NVPTX device offloading toolchain has the advantage that it already calls NVLINK (upstreamed a long time ago). This patch doesn't change that. This patch "fixes" (for a lack of a better word) the way in which objects are created on the device side. By adding the FATBINARY + CLANG++ steps, I ensure that the existing call to NVLINK will be able to "detect" the device-part of object files and archived object files (static libraries). This is not a valid statement in today's compiler in which NVLINK would not be able to do so when passed a static library.

In general, for offloading toolchains, I don't see the reliance on vendor specific tools as a problem if and only if the calls to vendor-specific tools remain confined to a device-speicifc toolchain. This patch respects this condition. All the calls to CUDA tools in this patch are part of the OpenMP NVPTX device offloading toolchain (which is an NVPTX device specific toolchain).

I disagree in this context because this patch currently means that static archives will only work with NVPTX and there is no clear path how to "fix" things for other offloading targets. I'll try to work on my proposal over the next few days (sorry, very busy week...), maybe I can put together a prototype of my idea.

kkwli0 added a subscriber: kkwli0.Jun 1 2018, 4:56 AM

I disagree in this context because this patch currently means that static archives will only work with NVPTX and there is no clear path how to "fix" things for other offloading targets. I'll try to work on my proposal over the next few days (sorry, very busy week...), maybe I can put together a prototype of my idea.

Other toolchains can also have static linking if they:

ditch the clang-offload-bundler for generating/consuming object files.
implement a link step on the device toolchain which can identify the vendor specific object file inside the host object file. (this is how the so called "bunlding" should have been done in the first place not using a custom tool which limits the functionality of the compiler). Identifying toolchain-specific objects/binaries is a task that belongs within the device-specific toolchain and SHOULD NOT be factored out because you can't treat object that are different by definition in the same way. Ignoring their differences leads to those object not being link-able. On top of that, factoring out introduces custom object formats which only CLANG understands AND it introduces compilation steps that impede static linking.

I'm surprised you now disagree with this technique, when I first introduced you to this in an e-mail off list you agreed with it.

So this patch, the only new CUDA tool that it calls is the FATBINARY tool which is done on the device-specific side of the toolchain so you can't possibly object to that. The CUDA toolchain already calls FATBINARY so it's not a novelty. That step is essential to making device-side objects identifiable by NVLINK (which we already call).

The only step you might object to is the partial linking step which, as I explained in my original post, I envisage will be improved over time as more toolchains support this scheme. I think this is a true solution to the problem. What you are proposing is a workaround that doesn't tackle the problem head-on.

In D47394#1118957, @gtbercea wrote:

I'm surprised you now disagree with this technique, when I first introduced you to this in an e-mail off list you agreed with it.

My words were I agree this is the best solution for NVPTX. In the same reply I asked how your proposal is supposed to work for other offloading targets which is now clear to require additional work, maybe even completely novel tools.
So now I disagree that it is the right solution for Clang because I think my proposal will cover all offloading targets. Please give me a bit time so that I can see if it works.

guraypp added a subscriber: guraypp.Jun 1 2018, 7:27 AM

Hmm, maybe the scope is much larger: I just tried linking an executable that references a declare target function in a shared library. My assumption was that this already works, given that libomptarget's registration functions can be called multiple times. Am I doing something wrong?

In D47394#1119056, @Hahnfeld wrote:

Hmm, maybe the scope is much larger: I just tried linking an executable that references a declare target function in a shared library. My assumption was that this already works, given that libomptarget's registration functions can be called multiple times. Am I doing something wrong?

I believe this is a limitation coming from the Cuda toolchain. Not even nvcc supports this case: https://stackoverflow.com/questions/35897002/cuda-nvcc-building-chain-of-libraries

In D47394#1119489, @gtbercea wrote:

In D47394#1119056, @Hahnfeld wrote:

Hmm, maybe the scope is much larger: I just tried linking an executable that references a declare target function in a shared library. My assumption was that this already works, given that libomptarget's registration functions can be called multiple times. Am I doing something wrong?

I believe this is a limitation coming from the Cuda toolchain. Not even nvcc supports this case: https://stackoverflow.com/questions/35897002/cuda-nvcc-building-chain-of-libraries

You are absolutely right, thanks for the link. Maybe we should also document somewhere that we don't support that either for OpenMP offloading to NVPTX?

I think this basically renders my approach useless as I meant to compile each device object file for offloading targets directly to a shared library. Those could have been put together at runtime by just loading (and registering) them in the right order. That way we would have been able to keep clang-offload-bundler in its current target agnostic form and didn't need to appease proprietary tools such as nvlink.

With that knowledge I see no other way than what this patch proposes. (I still don't particularly like it because it requires each toolchain to implement their own magic.) Sorry for the delay and my disagreement based on wrong assumptions that I wasn't able to verify as soon as I'd have liked to.

In D47394#1120255, @Hahnfeld wrote:

In D47394#1119489, @gtbercea wrote:

In D47394#1119056, @Hahnfeld wrote:

Hmm, maybe the scope is much larger: I just tried linking an executable that references a declare target function in a shared library. My assumption was that this already works, given that libomptarget's registration functions can be called multiple times. Am I doing something wrong?

I believe this is a limitation coming from the Cuda toolchain. Not even nvcc supports this case: https://stackoverflow.com/questions/35897002/cuda-nvcc-building-chain-of-libraries

You are absolutely right, thanks for the link. Maybe we should also document somewhere that we don't support that either for OpenMP offloading to NVPTX?

I think this basically renders my approach useless as I meant to compile each device object file for offloading targets directly to a shared library. Those could have been put together at runtime by just loading (and registering) them in the right order. That way we would have been able to keep clang-offload-bundler in its current target agnostic form and didn't need to appease proprietary tools such as nvlink.

With that knowledge I see no other way than what this patch proposes. (I still don't particularly like it because it requires each toolchain to implement their own magic.) Sorry for the delay and my disagreement based on wrong assumptions that I wasn't able to verify as soon as I'd have liked to.

No problem at all.

I will update the description of the patch with more information. I had some very useful e-mail exchanges with @sfantao which I will try to work into the description of the patch.

gtbercea edited the summary of this revision. (Show Details)Jun 4 2018, 10:02 AM

gtbercea edited the summary of this revision. (Show Details)Jun 4 2018, 10:05 AM

gtbercea edited the summary of this revision. (Show Details)

With the updated patch description + the discussion I'm OK with the approach from the general "how do we compile/use CUDA" point of view. I'll leave the question of whether the approach works for OpenMP to someone more familiar with it.

While I'm not completely convinced that [fatbin]->.c->[clang]->.o (with device code only)->[ld -r] -> host.o (host+device code) is ideal (things could be done with smaller number of tool invocations), it should help to deal with -rdc compilation until we get a chance to improve support for it in Clang. We may revisit and change this portion of the pipeline when clang can incorporate -rdc GPU binaries in a way compatible with CUDA tools.

tra removed a reviewer: tra.Jun 5 2018, 3:48 PM

tra added a subscriber: tra.

In D47394#1123044, @tra wrote:

While I'm not completely convinced that [fatbin]->.c->[clang]->.o (with device code only)->[ld -r] -> host.o (host+device code) is ideal (things could be done with smaller number of tool invocations), it should help to deal with -rdc compilation until we get a chance to improve support for it in Clang. We may revisit and change this portion of the pipeline when clang can incorporate -rdc GPU binaries in a way compatible with CUDA tools.

I think this should work with current trunk, Clang puts the GPU binary into a section called __nv_relfatbin when also passing -fcuda-rdc (see D42922).
What will probably result in problems are the registration functions as shown above by @gtbercea (undefined references...). But as we don't need them for OpenMP (we have our own registration machinery) it might be worth implementing something like -fno-cuda-registration. Maybe then clang -cc1 <host> -fcuda-include-gpubinary <device> -fcuda-rdc -fno-cuda-registration can be used to embed the device object, replacing the dance ending in ld -r?

@tra Thank you for your comments and help with the patch.

yaxunl added a subscriber: yaxunl.Jun 11 2018, 11:58 AM

gtbercea marked 3 inline comments as done.Jun 11 2018, 12:52 PM

gtbercea added inline comments.

include/clang/Driver/Compilation.h
314	This was a way to pass this information to the OpenMP NVPTX device toolchain. Both the Driver OpenMP NVPTX toolchain need to agree on the usage of the new scheme (proposed in this patch) or the old scheme (the one that is in the compiler today).
lib/Driver/Compilation.cpp
287	The checks I added in the Driver will set this flag to true if all toolchains Clang offloads to support the skipping of the bundler/unbundler for object files. Currently only NVPTX toolchain can skip the bundler/unbundler for object files so the code path in this patch will be taken only for: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda
lib/Driver/Driver.cpp
3229	The driver needs to have the result of this check available. The flag is passed to the step which adds host-device dependencies. If the bundler can be skipped then the unbundling action is not required. I guess this could be implemented in Compilation. Even so I would like it to happen only once like it does here and not every time someone queries the "can I skip the bundler" flag. I wanted this check to happen only once hence why I put in on the driver side. The result of this check needs to be available in Driver.cpp and in Cuda.cpp files (see usage in this patch). Compilation keeps track of the flag because skipping the bundler is an all or nothing flag: you can skip the bundler/unbundler for object files if and only if all toolchains you are offloading to can skip it.
3248	Currently this is only meant to work with ld because that we know for sure is a linker which supports partial linking. If other linkers also support it then they can be added here. Not finding ld will lead to the old scheme being used. You are correct the test needs to be fixed.
lib/Driver/ToolChains/Clang.cpp
6109	The linker should always support partial linking at this point, hence the assert. The linker not supporting partial linking is determined at a much earlier step so by this point we should already have that information. In the eventuality that some corner case arises that I may have missed then this assert is triggered. But it would be really unexpected for this assert to be triggered since choosing this action is based on the linker supporting partial linking.
lib/Driver/ToolChains/Cuda.cpp
551	I think my latest update to the patch description should clarify this part here.

gtbercea marked 2 inline comments as done.Jun 11 2018, 3:02 PM

Added separate test.

gtbercea marked 2 inline comments as done.Jun 12 2018, 8:24 AM

gtbercea added a reviewer: kkwli0.Jun 26 2018, 11:41 AM

Hi Doru,

Thanks for updating the patch. I've a few comments below.

include/clang/Driver/Compilation.h
314	I understand, but the way I see it is that it is the toolchain that skips the bundler not the compilation. I understand that as of this patch, you skip only if there is a single nvptx target. If you have more than one target, as some tests do, some toolchains will still need the bundler. So, we are making what happens with the nvptx target dependent of other toolchains. Is this an intended effect of this patch?
lib/Driver/Compilation.cpp
287	Ok, if that is the case, just add an assertion here.
lib/Driver/Driver.cpp
3229	Right, in these circumstances "can skip bundler" is the same as "do I have a single toolchain" and "is that toolchain nvptx". This is fairly inexpensive to do, so I don't really see the need to record this state in the driver. It will also be clearer what are the conditions for which you skip the bundler.
lib/Driver/ToolChains/Cuda.cpp
511	Why don't create fatbins instead of cubins in all cases. For the purposes of OpenMP they are equivalent, i.e. nvlink can interpret them in the same way, no?
532	I'd move this comment to the top of this session so that we know what is going on in the code above.
539	CamelCase
684	So, what if it is not a static library?
test/Driver/openmp-offload-gpu-linux.c
25	`clang-offload-bundler` should be sufficient here.

Answers to comments.

include/clang/Driver/Compilation.h
314	Bundler is skipped only for the OpenMP NVPTX toolchain. I'm not sure what you mean by "other toolchain".
lib/Driver/Compilation.cpp
287	If one of the toolchains in the list of toolchains can't skip then none of them skip. If all can skip then they all skip. What assertion would you like me to add?
lib/Driver/Driver.cpp
3229	That is true for now but if more toolchains get added to the list of toolchains that can skip the bundler then you want to factor it out and make it happen only once in a toolchain-independent point in the code. Otherwise you will carry that list of toolchains everywhere in the code where you need to do the check. Also if you are to do this at toolchain level you will not be able to check if the other toolchains were able to skip or not. For now ALL toolchains must skip or ALL toolchains don't skip the bundler.
lib/Driver/ToolChains/Cuda.cpp
511	I'm not sure why the comment is attached to this particular line in the code. But the reason why I don't use fatbins everywhere is because I want to leave the previous toolchain intact. So when the bundler is not skipped we do precisely what we did before.

gtbercea marked an inline comment as done.Jul 31 2018, 6:35 AM

sfantao added inline comments.Jul 31 2018, 12:39 PM

include/clang/Driver/Compilation.h
314	Is skipped for the NVPTX toolchain if there are no "other" device toolchains requested. Say I have a working pipeline that does static linking with nvptx correctly. Then on top of that I add another device to `-fopenmp-targets`, that pipeline will now fail even for nvptx, right?
lib/Driver/Compilation.cpp
287	If SkipOffloadBundler is set to true you don't expect it to be set to false afterwards, right? That should be asserted.
lib/Driver/ToolChains/Cuda.cpp
511	The comment was here, because this is where you generate the command to create the fatbin - no other intended meaning. Given that fatbin can be linked with nvlink to get a device cubin the toolchain won't need to change regardless of whether bundling is used or not, for the bundler the device images are just bits.

gtbercea added inline comments.Jul 31 2018, 2:04 PM

include/clang/Driver/Compilation.h
314	It's a choice between skipping the bundler and running the current, default mode with the bundler enabled. If targets other than NVPTX are present then we default to using the bundler for all toolchains. There is no hybrid mode enabled where some targets use the bundler and some don't.
lib/Driver/Compilation.cpp
287	That's correct, I can add that sure.

Address comments.

lib/Driver/ToolChains/Cuda.cpp
684	Can it be anything else at this point?

gtbercea marked 2 inline comments as done.Aug 7 2018, 10:21 AM

sandoval added a subscriber: sandoval.Aug 28 2018, 12:49 PM

Update.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 22 2019, 1:41 PM

Herald added a subscriber: jdoerfert. · View Herald Transcript

Harbormaster completed remote builds in B28427: Diff 187979.Feb 22 2019, 1:42 PM

ping

gtbercea added a child revision: D59028: [OpenMP] Enable on device linking with NVLINK to ignore dynamic libraries.Mar 6 2019, 12:18 PM

Could you sketch for me how this will (potentially) work if we have multiple target vendors? The fatbin solution seems tailored to NVIDIA, but maybe I'm wrong here.

In any case, we need to make progress on this front and if this solution is compatible with other vendors we should get it in asap.

@xtian, @gregrodgers, @ddibyend please take a look or have someone take a look and comment.

lib/Driver/Driver.cpp
3972	unrelated
lib/Driver/ToolChains/Clang.cpp
6117	In "core-LLVM" we usually avoid these braces.
lib/Driver/ToolChains/Cuda.cpp
401	It might not be worth it to save CubinF here but create it 120 lines later instead
547	You cannot hardcode clang++, it could be C code and we don't want to cause interoperability problems and/or the warnings that will inevitably follow.
661	Could you add a comment here please?
687	By comparing this code with the one after the `if (... endwith(".a"))` it seems this treated a bit differently than a static library below. I mention it only because of the comment above.

Herald added a subscriber: ormris. · View Herald TranscriptApr 29 2019, 7:31 PM

Hahnfeld removed a reviewer: Hahnfeld.Nov 16 2019, 9:48 AM

Herald added a reviewer: jdoerfert. · View Herald TranscriptNov 16 2019, 9:48 AM

Revision Contents

Path

Size

include/

clang/

Driver/

15 lines

13 lines

4 lines

2 lines

lib/

Driver/

7 lines

8 lines

74 lines

10 lines

ToolChains/

Clang.h

13 lines

Clang.cpp

43 lines

Cuda.cpp

166 lines

test/

Driver/

openmp-offload-gpu-linux.c

52 lines

openmp-offload-gpu.c

8 lines

openmp-offload.c

4 lines

Diff 187979

include/clang/Driver/Action.h

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	enum ActionClass {
AssembleJobClass,		AssembleJobClass,
LinkJobClass,		LinkJobClass,
LipoJobClass,		LipoJobClass,
DsymutilJobClass,		DsymutilJobClass,
VerifyDebugInfoJobClass,		VerifyDebugInfoJobClass,
VerifyPCHJobClass,		VerifyPCHJobClass,
OffloadBundlingJobClass,		OffloadBundlingJobClass,
OffloadUnbundlingJobClass,		OffloadUnbundlingJobClass,
		PartialLinkerJobClass,

JobClassFirst = PreprocessJobClass,		JobClassFirst = PreprocessJobClass,
JobClassLast = OffloadUnbundlingJobClass		JobClassLast = PartialLinkerJobClass
};		};

// The offloading kind determines if this action is binded to a particular		// The offloading kind determines if this action is binded to a particular
// programming model. Each entry reserves one bit. We also have a special kind		// programming model. Each entry reserves one bit. We also have a special kind
// to designate the host offloading tool chain.		// to designate the host offloading tool chain.
enum OffloadKind {		enum OffloadKind {
OFK_None = 0x00,		OFK_None = 0x00,

▲ Show 20 Lines • Show All 523 Lines • ▼ Show 20 Lines	ArrayRef<DependentActionInfo> getDependentActionsInfo() const {
return DependentActionInfoArray;		return DependentActionInfoArray;
}		}

static bool classof(const Action *A) {		static bool classof(const Action *A) {
return A->getKind() == OffloadUnbundlingJobClass;		return A->getKind() == OffloadUnbundlingJobClass;
}		}
};		};

		class PartialLinkerJobAction : public JobAction {
		void anchor() override;

		public:
		// Partial linking does not change the type of output.
		PartialLinkerJobAction(ActionList &Inputs);

		static bool classof(const Action *A) {
		return A->getKind() == PartialLinkerJobClass;
		}
		};

} // namespace driver		} // namespace driver
} // namespace clang		} // namespace clang

#endif // LLVM_CLANG_DRIVER_ACTION_H		#endif // LLVM_CLANG_DRIVER_ACTION_H

include/clang/Driver/Compilation.h

Show First 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	class Compilation {

/// Whether we're compiling for diagnostic purposes.		/// Whether we're compiling for diagnostic purposes.
bool ForDiagnostics = false;		bool ForDiagnostics = false;

/// Whether an error during the parsing of the input args.		/// Whether an error during the parsing of the input args.
bool ContainsError;		bool ContainsError;

/// Whether to keep temporary files regardless of -save-temps.		/// Whether to keep temporary files regardless of -save-temps.
bool ForceKeepTempFiles = false;		bool ForceKeepTempFiles = false;
		sfantaoUnsubmitted Done Reply Inline Actions Use CamelCase for class local variables. sfantao: Use CamelCase for class local variables.

		/// Whether the clang-offload-bundler can be skipped.
		bool SkipOffloadBundler = false;

public:		public:
Compilation(const Driver &D, const ToolChain &DefaultToolChain,		Compilation(const Driver &D, const ToolChain &DefaultToolChain,
llvm::opt::InputArgList *Args,		llvm::opt::InputArgList *Args,
llvm::opt::DerivedArgList *TranslatedArgs, bool ContainsError);		llvm::opt::DerivedArgList *TranslatedArgs, bool ContainsError);
~Compilation();		~Compilation();

const Driver &getDriver() const { return TheDriver; }		const Driver &getDriver() const { return TheDriver; }

▲ Show 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	public:
bool containsError() const { return ContainsError; }		bool containsError() const { return ContainsError; }

/// Redirect - Redirect output of this compilation. Can only be done once.		/// Redirect - Redirect output of this compilation. Can only be done once.
///		///
/// \param Redirects - array of optional paths. The array should have a size		/// \param Redirects - array of optional paths. The array should have a size
/// of three. The inferior process's stdin(0), stdout(1), and stderr(2) will		/// of three. The inferior process's stdin(0), stdout(1), and stderr(2) will
/// be redirected to the corresponding paths, if provided (not llvm::None).		/// be redirected to the corresponding paths, if provided (not llvm::None).
void Redirect(ArrayRef<Optional<StringRef>> Redirects);		void Redirect(ArrayRef<Optional<StringRef>> Redirects);

		/// Set whether the compilation can avoid calling the clang-offload-bundler
		/// for object file types.
		///
		/// \param skipBundler - bool value set once by the driver.
		void setSkipOffloadBundler(bool skipBundler);
		sfantaoUnsubmitted Done Reply Inline Actions Why is this a property of the compilation and not of a set of actions referring to a given target? That would allow one to combine in the same compilation targets requiring the bundler and targets that wouldn't. sfantao: Why is this a property of the compilation and not of a set of actions referring to a given…
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions This was a way to pass this information to the OpenMP NVPTX device toolchain. Both the Driver OpenMP NVPTX toolchain need to agree on the usage of the new scheme (proposed in this patch) or the old scheme (the one that is in the compiler today). gtbercea: This was a way to pass this information to the OpenMP NVPTX device toolchain. Both the Driver…
		sfantaoUnsubmitted Not Done Reply Inline Actions I understand, but the way I see it is that it is the toolchain that skips the bundler not the compilation. I understand that as of this patch, you skip only if there is a single nvptx target. If you have more than one target, as some tests do, some toolchains will still need the bundler. So, we are making what happens with the nvptx target dependent of other toolchains. Is this an intended effect of this patch? sfantao: I understand, but the way I see it is that it is the toolchain that skips the bundler not the…
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions Bundler is skipped only for the OpenMP NVPTX toolchain. I'm not sure what you mean by "other toolchain". gtbercea: Bundler is skipped only for the OpenMP NVPTX toolchain. I'm not sure what you mean by "other…
		sfantaoUnsubmitted Not Done Reply Inline Actions Is skipped for the NVPTX toolchain if there are no "other" device toolchains requested. Say I have a working pipeline that does static linking with nvptx correctly. Then on top of that I add another device to `-fopenmp-targets`, that pipeline will now fail even for nvptx, right? sfantao: Is skipped for the NVPTX toolchain if there are no "other" device toolchains requested. Say I…
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions It's a choice between skipping the bundler and running the current, default mode with the bundler enabled. If targets other than NVPTX are present then we default to using the bundler for all toolchains. There is no hybrid mode enabled where some targets use the bundler and some don't. gtbercea: It's a choice between skipping the bundler and running the current, default mode with the…

		/// Returns true when calls to the clang-offload-bundler are not required
		/// for object types.
		bool canSkipOffloadBundler() const;
};		};

} // namespace driver		} // namespace driver
} // namespace clang		} // namespace clang

#endif // LLVM_CLANG_DRIVER_COMPILATION_H		#endif // LLVM_CLANG_DRIVER_COMPILATION_H

include/clang/Driver/Options.td

	Show First 20 Lines • Show All 1,563 Lines • ▼ Show 20 Lines
	def fopenmp_cuda_mode : Flag<["-"], "fopenmp-cuda-mode">, Group<f_Group>,			def fopenmp_cuda_mode : Flag<["-"], "fopenmp-cuda-mode">, Group<f_Group>,
	Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;			Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;
	def fno_openmp_cuda_mode : Flag<["-"], "fno-openmp-cuda-mode">, Group<f_Group>,			def fno_openmp_cuda_mode : Flag<["-"], "fno-openmp-cuda-mode">, Group<f_Group>,
	Flags<[NoArgumentUnused, HelpHidden]>;			Flags<[NoArgumentUnused, HelpHidden]>;
	def fopenmp_cuda_force_full_runtime : Flag<["-"], "fopenmp-cuda-force-full-runtime">, Group<f_Group>,			def fopenmp_cuda_force_full_runtime : Flag<["-"], "fopenmp-cuda-force-full-runtime">, Group<f_Group>,
	Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;			Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;
	def fno_openmp_cuda_force_full_runtime : Flag<["-"], "fno-openmp-cuda-force-full-runtime">, Group<f_Group>,			def fno_openmp_cuda_force_full_runtime : Flag<["-"], "fno-openmp-cuda-force-full-runtime">, Group<f_Group>,
	Flags<[NoArgumentUnused, HelpHidden]>;			Flags<[NoArgumentUnused, HelpHidden]>;
				def fopenmp_use_target_bundling : Flag<["-"], "fopenmp-use-target-bundling">, Group<f_Group>,
				Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;
				def fno_openmp_use_target_bundling : Flag<["-"], "fno-openmp-use-target-bundling">, Group<f_Group>,
				Flags<[NoArgumentUnused, HelpHidden]>;
	def fopenmp_cuda_number_of_sm_EQ : Joined<["-"], "fopenmp-cuda-number-of-sm=">, Group<f_Group>,			def fopenmp_cuda_number_of_sm_EQ : Joined<["-"], "fopenmp-cuda-number-of-sm=">, Group<f_Group>,
	Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;			Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;
	def fopenmp_cuda_blocks_per_sm_EQ : Joined<["-"], "fopenmp-cuda-blocks-per-sm=">, Group<f_Group>,			def fopenmp_cuda_blocks_per_sm_EQ : Joined<["-"], "fopenmp-cuda-blocks-per-sm=">, Group<f_Group>,
	Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;			Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;
	def fopenmp_cuda_teams_reduction_recs_num_EQ : Joined<["-"], "fopenmp-cuda-teams-reduction-recs-num=">, Group<f_Group>,			def fopenmp_cuda_teams_reduction_recs_num_EQ : Joined<["-"], "fopenmp-cuda-teams-reduction-recs-num=">, Group<f_Group>,
	Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;			Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;
	def fopenmp_optimistic_collapse : Flag<["-"], "fopenmp-optimistic-collapse">, Group<f_Group>,			def fopenmp_optimistic_collapse : Flag<["-"], "fopenmp-optimistic-collapse">, Group<f_Group>,
	Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;			Flags<[CC1Option, NoArgumentUnused, HelpHidden]>;
	▲ Show 20 Lines • Show All 1,580 Lines • Show Last 20 Lines

include/clang/Driver/ToolChain.h

Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	private:

/// The list of toolchain specific path prefixes to search for programs.		/// The list of toolchain specific path prefixes to search for programs.
path_list ProgramPaths;		path_list ProgramPaths;

mutable std::unique_ptr<Tool> Clang;		mutable std::unique_ptr<Tool> Clang;
mutable std::unique_ptr<Tool> Assemble;		mutable std::unique_ptr<Tool> Assemble;
mutable std::unique_ptr<Tool> Link;		mutable std::unique_ptr<Tool> Link;
mutable std::unique_ptr<Tool> OffloadBundler;		mutable std::unique_ptr<Tool> OffloadBundler;
		mutable std::unique_ptr<Tool> PartialLinker;

Tool *getClang() const;		Tool *getClang() const;
Tool *getAssemble() const;		Tool *getAssemble() const;
Tool *getLink() const;		Tool *getLink() const;
Tool *getClangAs() const;		Tool *getClangAs() const;
Tool *getOffloadBundler() const;		Tool *getOffloadBundler() const;
		Tool *getPartialLinker() const;

mutable std::unique_ptr<SanitizerArgs> SanitizerArguments;		mutable std::unique_ptr<SanitizerArgs> SanitizerArguments;
mutable std::unique_ptr<XRayArgs> XRayArguments;		mutable std::unique_ptr<XRayArgs> XRayArguments;

/// The effective clang triple for the current Job.		/// The effective clang triple for the current Job.
mutable llvm::Triple EffectiveTriple;		mutable llvm::Triple EffectiveTriple;

/// Set the toolchain's effective clang triple.		/// Set the toolchain's effective clang triple.
▲ Show 20 Lines • Show All 443 Lines • Show Last 20 Lines

lib/Driver/Action.cpp

Show All 34 Lines	const char *Action::getClassName(ActionClass AC) {
case LipoJobClass: return "lipo";		case LipoJobClass: return "lipo";
case DsymutilJobClass: return "dsymutil";		case DsymutilJobClass: return "dsymutil";
case VerifyDebugInfoJobClass: return "verify-debug-info";		case VerifyDebugInfoJobClass: return "verify-debug-info";
case VerifyPCHJobClass: return "verify-pch";		case VerifyPCHJobClass: return "verify-pch";
case OffloadBundlingJobClass:		case OffloadBundlingJobClass:
return "clang-offload-bundler";		return "clang-offload-bundler";
case OffloadUnbundlingJobClass:		case OffloadUnbundlingJobClass:
return "clang-offload-unbundler";		return "clang-offload-unbundler";
		case PartialLinkerJobClass:
		return "partial-linker";
}		}

llvm_unreachable("invalid class");		llvm_unreachable("invalid class");
}		}

void Action::propagateDeviceOffloadInfo(OffloadKind OKind, const char *OArch) {		void Action::propagateDeviceOffloadInfo(OffloadKind OKind, const char *OArch) {
// Offload action set its own kinds on their dependences.		// Offload action set its own kinds on their dependences.
if (Kind == OffloadClass)		if (Kind == OffloadClass)
▲ Show 20 Lines • Show All 345 Lines • ▼ Show 20 Lines

OffloadBundlingJobAction::OffloadBundlingJobAction(ActionList &Inputs)		OffloadBundlingJobAction::OffloadBundlingJobAction(ActionList &Inputs)
: JobAction(OffloadBundlingJobClass, Inputs, Inputs.back()->getType()) {}		: JobAction(OffloadBundlingJobClass, Inputs, Inputs.back()->getType()) {}

void OffloadUnbundlingJobAction::anchor() {}		void OffloadUnbundlingJobAction::anchor() {}

OffloadUnbundlingJobAction::OffloadUnbundlingJobAction(Action *Input)		OffloadUnbundlingJobAction::OffloadUnbundlingJobAction(Action *Input)
: JobAction(OffloadUnbundlingJobClass, Input, Input->getType()) {}		: JobAction(OffloadUnbundlingJobClass, Input, Input->getType()) {}

		void PartialLinkerJobAction::anchor() {}

		PartialLinkerJobAction::PartialLinkerJobAction(ActionList &Inputs)
		: JobAction(PartialLinkerJobClass, Inputs, Inputs.front()->getType()) {}

lib/Driver/Compilation.cpp

	Show First 20 Lines • Show All 276 Lines • ▼ Show 20 Lines

	StringRef Compilation::getSysRoot() const {			StringRef Compilation::getSysRoot() const {
	return getDriver().SysRoot;			return getDriver().SysRoot;
	}			}

	void Compilation::Redirect(ArrayRef<Optional<StringRef>> Redirects) {			void Compilation::Redirect(ArrayRef<Optional<StringRef>> Redirects) {
	this->Redirects = Redirects;			this->Redirects = Redirects;
	}			}

				void Compilation::setSkipOffloadBundler(bool skipBundler) {
				SkipOffloadBundler = skipBundler;
				sfantaoUnsubmitted Done Reply Inline Actions Given the logic you have below, you are assuming this is not set to false ever. It would be wise to get an assertion here in case you end up having toolchains skipping and others don't. If that is just not supported a diagnostic should be added instead. The convention is that local variables use CamelCase. sfantao: Given the logic you have below, you are assuming this is not set to false ever. It would be…
				gtberceaAuthorUnsubmitted Not Done Reply Inline Actions The checks I added in the Driver will set this flag to true if all toolchains Clang offloads to support the skipping of the bundler/unbundler for object files. Currently only NVPTX toolchain can skip the bundler/unbundler for object files so the code path in this patch will be taken only for: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda gtbercea: The checks I added in the Driver will set this flag to true if all toolchains Clang offloads to…
				sfantaoUnsubmitted Not Done Reply Inline Actions Ok, if that is the case, just add an assertion here. sfantao: Ok, if that is the case, just add an assertion here.
				gtberceaAuthorUnsubmitted Not Done Reply Inline Actions If one of the toolchains in the list of toolchains can't skip then none of them skip. If all can skip then they all skip. What assertion would you like me to add? gtbercea: If one of the toolchains in the list of toolchains can't skip then none of them skip. If all…
				sfantaoUnsubmitted Not Done Reply Inline Actions If SkipOffloadBundler is set to true you don't expect it to be set to false afterwards, right? That should be asserted. sfantao: If SkipOffloadBundler is set to true you don't expect it to be set to false afterwards, right?
				gtberceaAuthorUnsubmitted Not Done Reply Inline Actions That's correct, I can add that sure. gtbercea: That's correct, I can add that sure.
				}

				bool Compilation::canSkipOffloadBundler() const {
				return SkipOffloadBundler;
				}

lib/Driver/Driver.cpp

Show First 20 Lines • Show All 2,984 Lines • ▼ Show 20 Lines	OffloadAction::HostDependence HDep(
/BoundArch=/nullptr, DDeps);		/BoundArch=/nullptr, DDeps);
return C.MakeAction<OffloadAction>(HDep, DDeps);		return C.MakeAction<OffloadAction>(HDep, DDeps);
}		}

/// Generate an action that adds a host dependence to a device action. The		/// Generate an action that adds a host dependence to a device action. The
/// results will be kept in this action builder. Return true if an error was		/// results will be kept in this action builder. Return true if an error was
/// found.		/// found.
bool addHostDependenceToDeviceActions(Action *&HostAction,		bool addHostDependenceToDeviceActions(Action *&HostAction,
const Arg *InputArg) {		const Arg *InputArg,
		bool SkipBundler) {
if (!IsValid)		if (!IsValid)
return true;		return true;

// If we are supporting bundling/unbundling and the current action is an		// If we are supporting bundling/unbundling and the current action is an
// input action of non-source file, we replace the host action by the		// input action of non-source file, we replace the host action by the
// unbundling action. The bundler tool has the logic to detect if an input		// unbundling action. The bundler tool has the logic to detect if an input
// is a bundle or not and if the input is not a bundle it assumes it is a		// is a bundle or not and if the input is not a bundle it assumes it is a
// host file. Therefore it is safe to create an unbundling action even if		// host file. Therefore it is safe to create an unbundling action even if
// the input is not a bundle.		// the input is not a bundle.
if (CanUseBundler && isa<InputAction>(HostAction) &&		if (CanUseBundler && isa<InputAction>(HostAction) &&
InputArg->getOption().getKind() == llvm::opt::Option::InputClass &&		InputArg->getOption().getKind() == llvm::opt::Option::InputClass &&
!types::isSrcFile(HostAction->getType())) {		!types::isSrcFile(HostAction->getType()) &&
		!SkipBundler) {
auto UnbundlingHostAction =		auto UnbundlingHostAction =
C.MakeAction<OffloadUnbundlingJobAction>(HostAction);		C.MakeAction<OffloadUnbundlingJobAction>(HostAction);
UnbundlingHostAction->registerDependentActionInfo(		UnbundlingHostAction->registerDependentActionInfo(
C.getSingleOffloadToolChain<Action::OFK_Host>(),		C.getSingleOffloadToolChain<Action::OFK_Host>(),
/BoundArch=/StringRef(), Action::OFK_Host);		/BoundArch=/StringRef(), Action::OFK_Host);
HostAction = UnbundlingHostAction;		HostAction = UnbundlingHostAction;
}		}

Show All 25 Lines	bool addHostDependenceToDeviceActions(Action *&HostAction,

return false;		return false;
}		}

/// Add the offloading top level actions to the provided action list. This		/// Add the offloading top level actions to the provided action list. This
/// function can replace the host action by a bundling action if the		/// function can replace the host action by a bundling action if the
/// programming models allow it.		/// programming models allow it.
bool appendTopLevelActions(ActionList &AL, Action *HostAction,		bool appendTopLevelActions(ActionList &AL, Action *HostAction,
const Arg *InputArg) {		const Arg *InputArg, bool usePartialLinkStep) {
// Get the device actions to be appended.		// Get the device actions to be appended.
ActionList OffloadAL;		ActionList OffloadAL;
for (auto *SB : SpecializedBuilders) {		for (auto *SB : SpecializedBuilders) {
if (!SB->isValid())		if (!SB->isValid())
continue;		continue;
SB->appendTopLevelActions(OffloadAL);		SB->appendTopLevelActions(OffloadAL);
}		}

// If we can use the bundler, replace the host action by the bundling one in		// If we can use the bundler, replace the host action by the bundling one in
// the resulting list. Otherwise, just append the device actions. For		// the resulting list. Otherwise, just append the device actions. For
// device only compilation, HostAction is a null pointer, therefore only do		// device only compilation, HostAction is a null pointer, therefore only do
// this when HostAction is not a null pointer.		// this when HostAction is not a null pointer.
if (CanUseBundler && HostAction && !OffloadAL.empty()) {		if (CanUseBundler && HostAction && !OffloadAL.empty()) {
// Add the host action to the list in order to create the bundling action.		// Add the host action to the list in order to create the bundling action.
OffloadAL.push_back(HostAction);		OffloadAL.push_back(HostAction);

// We expect that the host action was just appended to the action list		// We expect that the host action was just appended to the action list
// before this method was called.		// before this method was called.
assert(HostAction == AL.back() && "Host action not in the list??");		assert(HostAction == AL.back() && "Host action not in the list??");
		if (usePartialLinkStep)
		HostAction = C.MakeAction<PartialLinkerJobAction>(OffloadAL);
		else
HostAction = C.MakeAction<OffloadBundlingJobAction>(OffloadAL);		HostAction = C.MakeAction<OffloadBundlingJobAction>(OffloadAL);
AL.back() = HostAction;		AL.back() = HostAction;
} else		} else
AL.append(OffloadAL.begin(), OffloadAL.end());		AL.append(OffloadAL.begin(), OffloadAL.end());

// Propagate to the current host action (if any) the offload information		// Propagate to the current host action (if any) the offload information
// associated with the current input.		// associated with the current input.
if (HostAction)		if (HostAction)
HostAction->propagateHostOffloadInfo(InputArgToOffloadKindMap[InputArg],		HostAction->propagateHostOffloadInfo(InputArgToOffloadKindMap[InputArg],
▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	if (FinalPhase == phases::Preprocess \|\| Args.hasArg(options::OPT__SLASH_Y_)) {
// Rather than check for it everywhere, just remove clang-cl pch-related		// Rather than check for it everywhere, just remove clang-cl pch-related
// flags here.		// flags here.
Args.eraseArg(options::OPT__SLASH_Fp);		Args.eraseArg(options::OPT__SLASH_Fp);
Args.eraseArg(options::OPT__SLASH_Yc);		Args.eraseArg(options::OPT__SLASH_Yc);
Args.eraseArg(options::OPT__SLASH_Yu);		Args.eraseArg(options::OPT__SLASH_Yu);
YcArg = YuArg = nullptr;		YcArg = YuArg = nullptr;
}		}

		// Determine whether the bundler tool can be skipped based on the set
		// of triples provided to the -fopenmp-targets flag, if it is present.
		bool CanSkipClangOffloadBundler = false;
		if (!Args.hasArg(options::OPT_fopenmp_use_target_bundling)) {
		if (Arg *OpenMPTargets = C.getInputArgs().getLastArg(
		options::OPT_fopenmp_targets_EQ)) {
		if (OpenMPTargets->getValues().size() > 0) {
		sfantaoUnsubmitted Done Reply Inline Actions CamelCase sfantao: CamelCase
		unsigned triplesRequiringBundler = 0;
		for (const char *Val : OpenMPTargets->getValues()) {
		llvm::Triple TT(Val);

		// If the list of tripled contains an invalid triple or
		// contains a valid non-NVPTX triple then the bundler
		// can be used.
		if (TT.getArch() == llvm::Triple::UnknownArch \|\|
		(TT.getArch() != llvm::Triple::UnknownArch &&
		!TT.isNVPTX())) {
		triplesRequiringBundler++;
		}
		}
		CanSkipClangOffloadBundler = (triplesRequiringBundler == 0);
		C.setSkipOffloadBundler(CanSkipClangOffloadBundler);
		}
		sfantaoUnsubmitted Done Reply Inline Actions Can you just implement this check in the definition of `Compilation: canSkipClangOffloadBundler` and get rid of `setSkipOffloadBundler`? All the requirted information is already in `Compilation` under `C.getInputArgs()`. sfantao: Can you just implement this check in the definition of `Compilation…
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions The driver needs to have the result of this check available. The flag is passed to the step which adds host-device dependencies. If the bundler can be skipped then the unbundling action is not required. I guess this could be implemented in Compilation. Even so I would like it to happen only once like it does here and not every time someone queries the "can I skip the bundler" flag. I wanted this check to happen only once hence why I put in on the driver side. The result of this check needs to be available in Driver.cpp and in Cuda.cpp files (see usage in this patch). Compilation keeps track of the flag because skipping the bundler is an all or nothing flag: you can skip the bundler/unbundler for object files if and only if all toolchains you are offloading to can skip it. gtbercea: The driver needs to have the result of this check available. The flag is passed to the step…
		sfantaoUnsubmitted Not Done Reply Inline Actions Right, in these circumstances "can skip bundler" is the same as "do I have a single toolchain" and "is that toolchain nvptx". This is fairly inexpensive to do, so I don't really see the need to record this state in the driver. It will also be clearer what are the conditions for which you skip the bundler. sfantao: Right, in these circumstances "can skip bundler" is the same as "do I have a single toolchain"…
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions That is true for now but if more toolchains get added to the list of toolchains that can skip the bundler then you want to factor it out and make it happen only once in a toolchain-independent point in the code. Otherwise you will carry that list of toolchains everywhere in the code where you need to do the check. Also if you are to do this at toolchain level you will not be able to check if the other toolchains were able to skip or not. For now ALL toolchains must skip or ALL toolchains don't skip the bundler. gtbercea: That is true for now but if more toolchains get added to the list of toolchains that can skip…
		}
		}

		// Determine whether a linker which supports partial linking
		// exists. On linux systems ld provides this functionality, there
		// may be other linkers that work also.
		// TODO: test if linker supports partial linking i.e. -r
		// We know ld does so we will actually check if the linker
		// is ld instead but this needs to be replaced.
		bool CanDoPartialLinking = false;
		if (CanSkipClangOffloadBundler &&
		C.getInputArgs().hasArg(options::OPT_c)) {
		// The bundler can be replaced with a partilal linking step
		// only when outputing an object. For all other cases the
		// fallback solution is the clang-offload-bundler.
		StringRef LinkerName = C.getDefaultToolChain().GetLinkerPath();

		// TODO: test if linker supports partial linking i.e. -r
		// We know ld does so we will actually check if the linker
		sfantaoUnsubmitted Done Reply Inline Actions In the current implementation there is no distinction between what is meant for Windows/Linux. This check would only work on Linux and the test below would fail for bots running windows. Also, I think it makes more sense to have this check part of the `Toolchain` instead of doing it in the driver. The `Toolchain` definition knows the names of the third-party executables, the driver doesn't. sfantao: In the current implementation there is no distinction between what is meant for Windows/Linux.
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions Currently this is only meant to work with ld because that we know for sure is a linker which supports partial linking. If other linkers also support it then they can be added here. Not finding ld will lead to the old scheme being used. You are correct the test needs to be fixed. gtbercea: Currently this is only meant to work with ld because that we know for sure is a linker which…
		// is ld instead but this needs to be replaced.
		CanDoPartialLinking = LinkerName.endswith("/ld");
		}

// Builder to be used to build offloading actions.		// Builder to be used to build offloading actions.
OffloadingActionBuilder OffloadBuilder(C, Args, Inputs);		OffloadingActionBuilder OffloadBuilder(C, Args, Inputs);

// Construct the actions to perform.		// Construct the actions to perform.
HeaderModulePrecompileJobAction *HeaderModuleAction = nullptr;		HeaderModulePrecompileJobAction *HeaderModuleAction = nullptr;
ActionList LinkerInputs;		ActionList LinkerInputs;

llvm::SmallVector<phases::ID, phases::MaxNumberOfPhases> PL;		llvm::SmallVector<phases::ID, phases::MaxNumberOfPhases> PL;
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	if (YcArg) {
}		}
}		}

// Build the pipeline for this file.		// Build the pipeline for this file.
Action Current = C.MakeAction<InputAction>(InputArg, InputType);		Action Current = C.MakeAction<InputAction>(InputArg, InputType);

// Use the current host action in any of the offloading actions, if		// Use the current host action in any of the offloading actions, if
// required.		// required.
if (OffloadBuilder.addHostDependenceToDeviceActions(Current, InputArg))		// The action may contain a bundling step which should not be executed
		// if the toolchain we are targeting can produce object files that
		// are understood by the host linker.
		bool SkipBundler = (InputType == types::TY_Object) &&
		CanSkipClangOffloadBundler;
		if (OffloadBuilder.addHostDependenceToDeviceActions(
		Current, InputArg, SkipBundler))
break;		break;

for (SmallVectorImpl<phases::ID>::iterator i = PL.begin(), e = PL.end();		for (SmallVectorImpl<phases::ID>::iterator i = PL.begin(), e = PL.end();
i != e; ++i) {		i != e; ++i) {
phases::ID Phase = *i;		phases::ID Phase = *i;

// We are done if this step is past what the user requested.		// We are done if this step is past what the user requested.
if (Phase > FinalPhase)		if (Phase > FinalPhase)
Show All 35 Lines	for (SmallVectorImpl<phases::ID>::iterator i = PL.begin(), e = PL.end();

if (auto *HMA = dyn_cast<HeaderModulePrecompileJobAction>(NewCurrent))		if (auto *HMA = dyn_cast<HeaderModulePrecompileJobAction>(NewCurrent))
HeaderModuleAction = HMA;		HeaderModuleAction = HMA;

Current = NewCurrent;		Current = NewCurrent;

// Use the current host action in any of the offloading actions, if		// Use the current host action in any of the offloading actions, if
// required.		// required.
if (OffloadBuilder.addHostDependenceToDeviceActions(Current, InputArg))		if (OffloadBuilder.addHostDependenceToDeviceActions(
		Current, InputArg, SkipBundler))
break;		break;

if (Current->getType() == types::TY_Nothing)		if (Current->getType() == types::TY_Nothing)
break;		break;
}		}

// If we ended with something, add to the output list.		// If we ended with something, add to the output list.
if (Current)		if (Current)
Actions.push_back(Current);		Actions.push_back(Current);

// Add any top level actions generated for offloading.		// Add any top level actions generated for offloading.
OffloadBuilder.appendTopLevelActions(Actions, Current, InputArg);		OffloadBuilder.appendTopLevelActions(Actions, Current, InputArg,
		CanDoPartialLinking);
}		}

// Add a link action if necessary.		// Add a link action if necessary.
if (!LinkerInputs.empty()) {		if (!LinkerInputs.empty()) {
Action *LA = C.MakeAction<LinkJobAction>(LinkerInputs, types::TY_Image);		Action *LA = C.MakeAction<LinkJobAction>(LinkerInputs, types::TY_Image);
LA = OffloadBuilder.processHostLinkAction(LA);		LA = OffloadBuilder.processHostLinkAction(LA);
Actions.push_back(LA);		Actions.push_back(LA);
}		}
▲ Show 20 Lines • Show All 554 Lines • ▼ Show 20 Lines	InputInfo Driver::BuildJobsForActionNoCache(
Compilation &C, const Action A, const ToolChain TC, StringRef BoundArch,		Compilation &C, const Action A, const ToolChain TC, StringRef BoundArch,
bool AtTopLevel, bool MultipleArchs, const char *LinkingOutput,		bool AtTopLevel, bool MultipleArchs, const char *LinkingOutput,
std::map<std::pair<const Action *, std::string>, InputInfo> &CachedResults,		std::map<std::pair<const Action *, std::string>, InputInfo> &CachedResults,
Action::OffloadKind TargetDeviceOffloadKind) const {		Action::OffloadKind TargetDeviceOffloadKind) const {
llvm::PrettyStackTraceString CrashInfo("Building compilation jobs");		llvm::PrettyStackTraceString CrashInfo("Building compilation jobs");

InputInfoList OffloadDependencesInputInfo;		InputInfoList OffloadDependencesInputInfo;
bool BuildingForOffloadDevice = TargetDeviceOffloadKind != Action::OFK_None;		bool BuildingForOffloadDevice = TargetDeviceOffloadKind != Action::OFK_None;

		jdoerfertUnsubmitted Not Done Reply Inline Actions unrelated jdoerfert: unrelated
if (const OffloadAction *OA = dyn_cast<OffloadAction>(A)) {		if (const OffloadAction *OA = dyn_cast<OffloadAction>(A)) {
// The 'Darwin' toolchain is initialized only when its arguments are		// The 'Darwin' toolchain is initialized only when its arguments are
// computed. Get the default arguments for OFK_None to ensure that		// computed. Get the default arguments for OFK_None to ensure that
// initialization is performed before processing the offload action.		// initialization is performed before processing the offload action.
// FIXME: Remove when darwin's toolchain is initialized during construction.		// FIXME: Remove when darwin's toolchain is initialized during construction.
C.getArgsForToolChain(TC, BoundArch, Action::OFK_None);		C.getArgsForToolChain(TC, BoundArch, Action::OFK_None);

// The offload action is expected to be used in four different situations.		// The offload action is expected to be used in four different situations.
▲ Show 20 Lines • Show All 865 Lines • Show Last 20 Lines

lib/Driver/ToolChain.cpp

Show First 20 Lines • Show All 279 Lines • ▼ Show 20 Lines
}		}

Tool *ToolChain::getOffloadBundler() const {		Tool *ToolChain::getOffloadBundler() const {
if (!OffloadBundler)		if (!OffloadBundler)
OffloadBundler.reset(new tools::OffloadBundler(*this));		OffloadBundler.reset(new tools::OffloadBundler(*this));
return OffloadBundler.get();		return OffloadBundler.get();
}		}

		Tool *ToolChain::getPartialLinker() const {
		if (!PartialLinker)
		PartialLinker.reset(new tools::PartialLinker(*this));
		return PartialLinker.get();
		}

Tool *ToolChain::getTool(Action::ActionClass AC) const {		Tool *ToolChain::getTool(Action::ActionClass AC) const {
switch (AC) {		switch (AC) {
case Action::AssembleJobClass:		case Action::AssembleJobClass:
return getAssemble();		return getAssemble();

case Action::LinkJobClass:		case Action::LinkJobClass:
return getLink();		return getLink();

Show All 13 Lines	Tool *ToolChain::getTool(Action::ActionClass AC) const {
case Action::MigrateJobClass:		case Action::MigrateJobClass:
case Action::VerifyPCHJobClass:		case Action::VerifyPCHJobClass:
case Action::BackendJobClass:		case Action::BackendJobClass:
return getClang();		return getClang();

case Action::OffloadBundlingJobClass:		case Action::OffloadBundlingJobClass:
case Action::OffloadUnbundlingJobClass:		case Action::OffloadUnbundlingJobClass:
return getOffloadBundler();		return getOffloadBundler();

		case Action::PartialLinkerJobClass:
		return getPartialLinker();

}		}

llvm_unreachable("Invalid tool kind.");		llvm_unreachable("Invalid tool kind.");
}		}

static StringRef getArchNameForCompilerRTLib(const ToolChain &TC,		static StringRef getArchNameForCompilerRTLib(const ToolChain &TC,
const ArgList &Args) {		const ArgList &Args) {
const llvm::Triple &Triple = TC.getTriple();		const llvm::Triple &Triple = TC.getTriple();
▲ Show 20 Lines • Show All 640 Lines • Show Last 20 Lines

lib/Driver/ToolChains/Clang.h

Show First 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	void ConstructJob(Compilation &C, const JobAction &JA,
const llvm::opt::ArgList &TCArgs,		const llvm::opt::ArgList &TCArgs,
const char *LinkingOutput) const override;		const char *LinkingOutput) const override;
void ConstructJobMultipleOutputs(Compilation &C, const JobAction &JA,		void ConstructJobMultipleOutputs(Compilation &C, const JobAction &JA,
const InputInfoList &Outputs,		const InputInfoList &Outputs,
const InputInfoList &Inputs,		const InputInfoList &Inputs,
const llvm::opt::ArgList &TCArgs,		const llvm::opt::ArgList &TCArgs,
const char *LinkingOutput) const override;		const char *LinkingOutput) const override;
};		};

		/// Partial linker tool.
		class LLVM_LIBRARY_VISIBILITY PartialLinker final : public Tool {
		public:
		PartialLinker(const ToolChain &TC)
		: Tool("PartialLinker", "partial-linker", TC) {}

		bool hasIntegratedCPP() const override { return false; }
		void ConstructJob(Compilation &C, const JobAction &JA,
		const InputInfo &Output, const InputInfoList &Inputs,
		const llvm::opt::ArgList &TCArgs,
		const char *LinkingOutput) const override;
		};
} // end namespace tools		} // end namespace tools

} // end namespace driver		} // end namespace driver
} // end namespace clang		} // end namespace clang

#endif // LLVM_CLANG_LIB_DRIVER_TOOLCHAINS_CLANG_H		#endif // LLVM_CLANG_LIB_DRIVER_TOOLCHAINS_CLANG_H

lib/Driver/ToolChains/Clang.cpp

Show First 20 Lines • Show All 6,077 Lines • ▼ Show 20 Lines	void ClangAs::ConstructJob(Compilation &C, const JobAction &JA,

assert(Input.isFilename() && "Invalid input.");		assert(Input.isFilename() && "Invalid input.");
CmdArgs.push_back(Input.getFilename());		CmdArgs.push_back(Input.getFilename());

const char *Exec = getToolChain().getDriver().getClangProgramPath();		const char *Exec = getToolChain().getDriver().getClangProgramPath();
C.addCommand(llvm::make_unique<Command>(JA, *this, Exec, CmdArgs, Inputs));		C.addCommand(llvm::make_unique<Command>(JA, *this, Exec, CmdArgs, Inputs));
}		}

		// Begin partial linking

		void PartialLinker::ConstructJob(Compilation &C, const JobAction &JA,
		const InputInfo &Output,
		const InputInfoList &Inputs,
		const llvm::opt::ArgList &TCArgs,
		const char *LinkingOutput) const {
		// The version with only one output is expected to refer to a bundling job.
		assert(isa<PartialLinkerJobAction>(JA) && "Expecting partial linking job!");

		// The partial linking command line (using ld as example):
		// ld -r input1.o input2.o -o single-file.o
		ArgStringList CmdArgs;

		// Ensure conditions are met for doing partial linking instead of bundling.
		assert(TCArgs.hasArg(options::OPT_c) &&
		"Can only use partial linking for object file generation.");
		assert(C.canSkipOffloadBundler() &&
		"Offload bundler cannot be skipped.");

		// TODO: the assert may be removed once a more elaborate checking is in
		// place in the Driver.
		StringRef LinkerName = getToolChain().GetLinkerPath();
		assert(LinkerName.endswith("/ld") && "Partial linking not supported.");
		sfantaoUnsubmitted Done Reply Inline Actions I believe this check should be done when the toolchain is created with all the required diagnostics. What happens if the linker does not support partial linking? sfantao: I believe this check should be done when the toolchain is created with all the required…
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions The linker should always support partial linking at this point, hence the assert. The linker not supporting partial linking is determined at a much earlier step so by this point we should already have that information. In the eventuality that some corner case arises that I may have missed then this assert is triggered. But it would be really unexpected for this assert to be triggered since choosing this action is based on the linker supporting partial linking. gtbercea: The linker should always support partial linking at this point, hence the assert. The linker…

		// Enable partial linking.
		CmdArgs.push_back(TCArgs.MakeArgString("-r"));

		// Add input files.
		for (unsigned I = 0; I < Inputs.size(); ++I) {
		CmdArgs.push_back(TCArgs.MakeArgString(Inputs[I].getFilename()));
		}
		jdoerfertUnsubmitted Not Done Reply Inline Actions In "core-LLVM" we usually avoid these braces. jdoerfert: In "core-LLVM" we usually avoid these braces.

		// Add output file.
		CmdArgs.push_back(TCArgs.MakeArgString("-o"));
		CmdArgs.push_back(TCArgs.MakeArgString(Output.getFilename()));

		// Add partial linker command.
		C.addCommand(llvm::make_unique<Command>(
		JA, *this, TCArgs.MakeArgString(getToolChain().GetLinkerPath()),
		CmdArgs, None));
		}

// Begin OffloadBundler		// Begin OffloadBundler

void OffloadBundler::ConstructJob(Compilation &C, const JobAction &JA,		void OffloadBundler::ConstructJob(Compilation &C, const JobAction &JA,
const InputInfo &Output,		const InputInfo &Output,
const InputInfoList &Inputs,		const InputInfoList &Inputs,
const llvm::opt::ArgList &TCArgs,		const llvm::opt::ArgList &TCArgs,
const char *LinkingOutput) const {		const char *LinkingOutput) const {
// The version with only one output is expected to refer to a bundling job.		// The version with only one output is expected to refer to a bundling job.
▲ Show 20 Lines • Show All 143 Lines • Show Last 20 Lines

lib/Driver/ToolChains/Cuda.cpp

Show First 20 Lines • Show All 391 Lines • ▼ Show 20 Lines	void NVPTX::Assembler::ConstructJob(Compilation &C, const JobAction &JA,

// Pass -v to ptxas if it was passed to the driver.		// Pass -v to ptxas if it was passed to the driver.
if (Args.hasArg(options::OPT_v))		if (Args.hasArg(options::OPT_v))
CmdArgs.push_back("-v");		CmdArgs.push_back("-v");

CmdArgs.push_back("--gpu-name");		CmdArgs.push_back("--gpu-name");
CmdArgs.push_back(Args.MakeArgString(CudaArchToString(gpu_arch)));		CmdArgs.push_back(Args.MakeArgString(CudaArchToString(gpu_arch)));
CmdArgs.push_back("--output-file");		CmdArgs.push_back("--output-file");
CmdArgs.push_back(Args.MakeArgString(TC.getInputFilename(Output)));		const char *CubinF = Args.MakeArgString(TC.getInputFilename(Output));
		CmdArgs.push_back(CubinF);
		jdoerfertUnsubmitted Not Done Reply Inline Actions It might not be worth it to save CubinF here but create it 120 lines later instead jdoerfert: It might not be worth it to save CubinF here but create it 120 lines later instead
for (const auto& II : Inputs)		for (const auto& II : Inputs)
CmdArgs.push_back(Args.MakeArgString(II.getFilename()));		CmdArgs.push_back(Args.MakeArgString(II.getFilename()));

for (const auto& A : Args.getAllArgValues(options::OPT_Xcuda_ptxas))		for (const auto& A : Args.getAllArgValues(options::OPT_Xcuda_ptxas))
CmdArgs.push_back(Args.MakeArgString(A));		CmdArgs.push_back(Args.MakeArgString(A));

bool Relocatable = false;		bool Relocatable = false;
if (JA.isOffloading(Action::OFK_OpenMP))		if (JA.isOffloading(Action::OFK_OpenMP))
Show All 9 Lines	if (Relocatable)
CmdArgs.push_back("-c");		CmdArgs.push_back("-c");

const char *Exec;		const char *Exec;
if (Arg *A = Args.getLastArg(options::OPT_ptxas_path_EQ))		if (Arg *A = Args.getLastArg(options::OPT_ptxas_path_EQ))
Exec = A->getValue();		Exec = A->getValue();
else		else
Exec = Args.MakeArgString(TC.GetProgramPath("ptxas"));		Exec = Args.MakeArgString(TC.GetProgramPath("ptxas"));
C.addCommand(llvm::make_unique<Command>(JA, *this, Exec, CmdArgs, Inputs));		C.addCommand(llvm::make_unique<Command>(JA, *this, Exec, CmdArgs, Inputs));

		// For OpenMP targets offloaded to an NVIDIA device offloading, call the
		// NVIDIA tools that make the object file discoverable by NVLINK.
		// Wrap the resulting fatbinary file into a host-friendly object file to
		// be linked with the host object file.
		if (JA.isDeviceOffloading(Action::OFK_OpenMP) &&
		Args.hasArg(options::OPT_c) &&
		C.canSkipOffloadBundler()) {
		ArgStringList FatbinaryCmdArgs;
		FatbinaryCmdArgs.push_back(TC.getTriple().isArch64Bit() ? "-64" : "-32");

		ArgStringList CompilerCmdArgs;
		CompilerCmdArgs.push_back(Args.MakeArgString("-c"));
		CompilerCmdArgs.push_back(Args.MakeArgString("-o"));
		CompilerCmdArgs.push_back(Args.MakeArgString(Output.getFilename()));
		CompilerCmdArgs.push_back(Args.MakeArgString(llvm::Twine("-I") +
		TC.CudaInstallation.getBinPath() + llvm::Twine("/../include")));

		// Create fatbin file using fatbinary executable.
		SmallString<128> OrigOutputFileName =
		llvm::sys::path::filename(Output.getFilename());

		// Create fatbin file.
		const char *FatbinF;
		if (C.getDriver().isSaveTempsEnabled()) {
		llvm::sys::path::replace_extension(OrigOutputFileName, "fatbin");
		FatbinF = C.getArgs().MakeArgString(OrigOutputFileName.c_str());
		} else {
		llvm::sys::path::replace_extension(OrigOutputFileName, "");
		OrigOutputFileName =
		C.getDriver().GetTemporaryPath(OrigOutputFileName, "fatbin");
		FatbinF =
		C.addTempFile(C.getArgs().MakeArgString(OrigOutputFileName.c_str()));
		}
		FatbinaryCmdArgs.push_back(
		Args.MakeArgString(llvm::Twine("--create=") + FatbinF));

		// Create fatbin file wrapper using fatbinary executable.
		const char *WrappedFatbinF;
		llvm::sys::path::replace_extension(OrigOutputFileName, "fatbin.c");
		if (C.getDriver().isSaveTempsEnabled())
		WrappedFatbinF = C.getArgs().MakeArgString(OrigOutputFileName);
		else
		WrappedFatbinF =
		C.addTempFile(C.getArgs().MakeArgString(OrigOutputFileName));
		FatbinaryCmdArgs.push_back(
		Args.MakeArgString(llvm::Twine("--embedded-fatbin=") +
		WrappedFatbinF));

		// Continue assembling the host compiler arguments.
		CompilerCmdArgs.push_back(Args.MakeArgString(WrappedFatbinF));

		StringRef GPUArch = Args.getLastArgValue(options::OPT_march_EQ);
		assert(!GPUArch.empty() && "At least one GPU Arch required for nvlink.");

		for (const auto& II : Inputs) {
		SmallString<128> OrigInputFileName =
		llvm::sys::path::filename(II.getFilename());

		if (II.getType() == types::TY_LLVM_IR \|\|
		II.getType() == types::TY_LTO_IR \|\|
		II.getType() == types::TY_LTO_BC \|\|
		II.getType() == types::TY_LLVM_BC) {
		C.getDriver().Diag(diag::err_drv_no_linker_llvm_support)
		<< getToolChain().getTripleString();
		continue;
		}

		// Currently, we only pass the input files to the linker, we do not pass
		// any libraries that may be valid only for the host. Any static
		// libraries will be handled at the link stage.
		if (!II.isFilename() \|\| OrigInputFileName.endswith(".a"))
		continue;

		auto *A = II.getAction();
		assert(A->getInputs().size() == 1 &&
		"Device offload action is expected to have a single input");
		CudaArch gpu_arch = StringToCudaArch(GPUArch);

		// We need to pass an Arch of the form "sm_XX" for cubin files and
		// "compute_XX" for ptx.
		const char *Arch =
		(II.getType() == types::TY_PP_Asm)
		? CudaVirtualArchToString(VirtualArchForCudaArch(gpu_arch))
		: GPUArch.str().c_str();
		sfantaoUnsubmitted Done Reply Inline Actions Why don't create fatbins instead of cubins in all cases. For the purposes of OpenMP they are equivalent, i.e. nvlink can interpret them in the same way, no? sfantao: Why don't create fatbins instead of cubins in all cases. For the purposes of OpenMP they are…
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure why the comment is attached to this particular line in the code. But the reason why I don't use fatbins everywhere is because I want to leave the previous toolchain intact. So when the bundler is not skipped we do precisely what we did before. gtbercea: I'm not sure why the comment is attached to this particular line in the code. But the reason…
		sfantaoUnsubmitted Not Done Reply Inline Actions The comment was here, because this is where you generate the command to create the fatbin - no other intended meaning. Given that fatbin can be linked with nvlink to get a device cubin the toolchain won't need to change regardless of whether bundling is used or not, for the bundler the device images are just bits. sfantao: The comment was here, because this is where you generate the command to create the fatbin - no…
		const char *PtxF =
		C.addTempFile(C.getArgs().MakeArgString(II.getFilename()));
		FatbinaryCmdArgs.push_back("--cmdline=--compile-only");
		FatbinaryCmdArgs.push_back(
		Args.MakeArgString(llvm::Twine("--image=profile=") +
		Arch + ",file=" + PtxF));
		FatbinaryCmdArgs.push_back(
		Args.MakeArgString(llvm::Twine("--image=profile=") +
		GPUArch.str().c_str() + "@" + Arch + ",file=" + CubinF));
		}

		FatbinaryCmdArgs.push_back(Args.MakeArgString("--cuda"));
		FatbinaryCmdArgs.push_back(Args.MakeArgString("--device-c"));

		for (const auto& A : Args.getAllArgValues(options::OPT_Xcuda_fatbinary))
		FatbinaryCmdArgs.push_back(Args.MakeArgString(A));

		// fatbinary --create=ompprint.fatbin -64
		// --image=profile=compute_35,file=ompprint.compute_35.ptx
		// --image=profile=sm_35@compute_35,file=ompprint.compute_35.sm_35.cubin
		// --embedded-fatbin=ompprint.fatbin.c --cuda --device-c
		sfantaoUnsubmitted Not Done Reply Inline Actions I'd move this comment to the top of this session so that we know what is going on in the code above. sfantao: I'd move this comment to the top of this session so that we know what is going on in the code…
		const char *Exec = Args.MakeArgString(TC.GetProgramPath("fatbinary"));
		C.addCommand(llvm::make_unique<Command>(
		JA, *this, Exec, FatbinaryCmdArgs, Inputs));

		// Come up with a unique name for the fatbin segment. The name uses
		// the hash of the full path of the file.
		std::hash<std::string> HashFn;
		sfantaoUnsubmitted Done Reply Inline Actions CamelCase sfantao: CamelCase
		size_t hash = HashFn(llvm::sys::path::filename(Output.getFilename()));
		CompilerCmdArgs.push_back(
		Args.MakeArgString(llvm::Twine("-D__NV_MODULE_ID=") +
		llvm::Twine(hash)));

		// clang++ -c ompprint.fatbin.c -I/path/to/cuda/include/dir
		const char *CompilerExec =
		Args.MakeArgString(TC.GetProgramPath("clang++"));
		jdoerfertUnsubmitted Not Done Reply Inline Actions You cannot hardcode clang++, it could be C code and we don't want to cause interoperability problems and/or the warnings that will inevitably follow. jdoerfert: You cannot hardcode clang++, it could be C code and we don't want to cause interoperability…
		C.addCommand(llvm::make_unique<Command>(
		JA, *this, CompilerExec, CompilerCmdArgs, Inputs));
		}
}		}
		sfantaoUnsubmitted Done Reply Inline Actions What prevents all this from being done in the bundler? If I understand it correctly, if the bundler implements this wrapping all the checks for librariers wouldn't be required and, only two changes would be required in the driver: generate fatbin instead of cubin. This is straightforward to do by changing the device assembling job. In terms of the loading of the kernels by the device API, doing it through fatbin or cubin should be equivalent except that fatbin enables storing the PTX format and JIT for newer GPUs. Use NVIDIA linker as host linker. This last requirement could be problematic if we get two targets attempting to use different (incompatible linkers). If we get this kind of incompatibility we should get the appropriate diagnostic. sfantao: What prevents all this from being done in the bundler? If I understand it correctly, if the…
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions What prevents it is the fact that the bundler is called AFTER the HOST and DEVICE object files have been produced. The creation of the fatbin (FATBINARY + CALNG++) needs to happen within the NVPTX toolchain. gtbercea: What prevents it is the fact that the bundler is called AFTER the HOST and DEVICE object files…
		sfantaoUnsubmitted Not Done Reply Inline Actions Why does it have to happen in NVPTX toolchain, you are making the NVPTX toolchain generate an ELF object from another toolchain, right? What I'm suggesting is to do the stuff that mixes two (or more) toolchains in the bundler. Your inputs are still a fatbin and a host file. sfantao: Why does it have to happen in NVPTX toolchain, you are making the NVPTX toolchain generate an…
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions I think my latest update to the patch description should clarify this part here. gtbercea: I think my latest update to the patch description should clarify this part here.

static bool shouldIncludePTX(const ArgList &Args, const char *gpu_arch) {		static bool shouldIncludePTX(const ArgList &Args, const char *gpu_arch) {
bool includePTX = true;		bool includePTX = true;
for (Arg *A : Args) {		for (Arg *A : Args) {
if (!(A->getOption().matches(options::OPT_cuda_include_ptx_EQ) \|\|		if (!(A->getOption().matches(options::OPT_cuda_include_ptx_EQ) \|\|
A->getOption().matches(options::OPT_no_cuda_include_ptx_EQ)))		A->getOption().matches(options::OPT_no_cuda_include_ptx_EQ)))
continue;		continue;
A->claim();		A->claim();
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	void NVPTX::OpenMPLinker::ConstructJob(Compilation &C, const JobAction &JA,
// Assume that the directory specified with --libomptarget_nvptx_path		// Assume that the directory specified with --libomptarget_nvptx_path
// contains the static library libomptarget-nvptx.a.		// contains the static library libomptarget-nvptx.a.
if (const Arg *A = Args.getLastArg(options::OPT_libomptarget_nvptx_path_EQ))		if (const Arg *A = Args.getLastArg(options::OPT_libomptarget_nvptx_path_EQ))
CmdArgs.push_back(Args.MakeArgString(Twine("-L") + A->getValue()));		CmdArgs.push_back(Args.MakeArgString(Twine("-L") + A->getValue()));

// Add paths specified in LIBRARY_PATH environment variable as -L options.		// Add paths specified in LIBRARY_PATH environment variable as -L options.
addDirectoryList(Args, CmdArgs, "-L", "LIBRARY_PATH");		addDirectoryList(Args, CmdArgs, "-L", "LIBRARY_PATH");

		if (C.canSkipOffloadBundler())
		Args.AddAllArgs(CmdArgs, options::OPT_L);
		jdoerfertUnsubmitted Not Done Reply Inline Actions Could you add a comment here please? jdoerfert: Could you add a comment here please?

// Add paths for the default clang library path.		// Add paths for the default clang library path.
SmallString<256> DefaultLibPath =		SmallString<256> DefaultLibPath =
llvm::sys::path::parent_path(TC.getDriver().Dir);		llvm::sys::path::parent_path(TC.getDriver().Dir);
llvm::sys::path::append(DefaultLibPath, "lib" CLANG_LIBDIR_SUFFIX);		llvm::sys::path::append(DefaultLibPath, "lib" CLANG_LIBDIR_SUFFIX);
CmdArgs.push_back(Args.MakeArgString(Twine("-L") + DefaultLibPath));		CmdArgs.push_back(Args.MakeArgString(Twine("-L") + DefaultLibPath));

// Add linking against library implementing OpenMP calls on NVPTX target.		// Add linking against library implementing OpenMP calls on NVPTX target.
CmdArgs.push_back("-lomptarget-nvptx");		CmdArgs.push_back("-lomptarget-nvptx");

for (const auto &II : Inputs) {		for (const auto &II : Inputs) {
if (II.getType() == types::TY_LLVM_IR \|\|		if (II.getType() == types::TY_LLVM_IR \|\|
II.getType() == types::TY_LTO_IR \|\|		II.getType() == types::TY_LTO_IR \|\|
II.getType() == types::TY_LTO_BC \|\|		II.getType() == types::TY_LTO_BC \|\|
II.getType() == types::TY_LLVM_BC) {		II.getType() == types::TY_LLVM_BC) {
C.getDriver().Diag(diag::err_drv_no_linker_llvm_support)		C.getDriver().Diag(diag::err_drv_no_linker_llvm_support)
<< getToolChain().getTripleString();		<< getToolChain().getTripleString();
continue;		continue;
}		}

// Currently, we only pass the input files to the linker, we do not pass		if (!II.isFilename()) {
// any libraries that may be valid only for the host.		// Anything that's not a file name is potentially a static library
if (!II.isFilename())		// so treat it as such.
		sfantaoUnsubmitted Done Reply Inline Actions So, what if it is not a static library? sfantao: So, what if it is not a static library?
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions Can it be anything else at this point? gtbercea: Can it be anything else at this point?
		if (C.canSkipOffloadBundler())
		CmdArgs.push_back(C.getArgs().MakeArgString(llvm::Twine("-l") +
		II.getInputArg().getValue()));
		jdoerfertUnsubmitted Not Done Reply Inline Actions By comparing this code with the one after the `if (... endwith(".a"))` it seems this treated a bit differently than a static library below. I mention it only because of the comment above. jdoerfert: By comparing this code with the one after the `if (... endwith(".a"))` it seems this treated a…
continue;		continue;
		}

		StringRef OrigInputFileName =
		llvm::sys::path::filename(II.getBaseInput());
		if (OrigInputFileName.endswith(".a")) {
		const char *StaticLibName =
		C.getArgs().MakeArgString(II.getFilename());
		CmdArgs.push_back(StaticLibName);
		} else {
		// If the original input is not an object file then it means the
		// assembly step has actually produced a cubin so we need to
		// rename it accordingly.
		if ((!C.canSkipOffloadBundler() && OrigInputFileName.endswith(".o")) \|\|
		(C.canSkipOffloadBundler() && !OrigInputFileName.endswith(".o"))) {
		// Create cubin file name and add it as a temporary file.
		SmallString<256> Filename(II.getFilename());
		llvm::sys::path::replace_extension(Filename, "cubin");
const char *CubinF = C.addTempFile(		const char *CubinF = C.addTempFile(
C.getArgs().MakeArgString(getToolChain().getInputFilename(II)));		C.getArgs().MakeArgString(Filename.str()));

CmdArgs.push_back(CubinF);		CmdArgs.push_back(CubinF);
		} else {
		CmdArgs.push_back(II.getFilename());
		}
		}
}		}

AddOpenMPLinkerScript(getToolChain(), C, Output, Inputs, Args, CmdArgs, JA);		AddOpenMPLinkerScript(getToolChain(), C, Output, Inputs, Args, CmdArgs, JA);

const char *Exec =		const char *Exec =
Args.MakeArgString(getToolChain().GetProgramPath("nvlink"));		Args.MakeArgString(getToolChain().GetProgramPath("nvlink"));
C.addCommand(llvm::make_unique<Command>(JA, *this, Exec, CmdArgs, Inputs));		C.addCommand(llvm::make_unique<Command>(JA, *this, Exec, CmdArgs, Inputs));
}		}
▲ Show 20 Lines • Show All 307 Lines • Show Last 20 Lines

test/Driver/openmp-offload-gpu-linux.c

This file was added.

				///
				/// Perform driver tests for OpenMP offloading on Linux systems
				///

				// UNSUPPORTED: system-windows

				// REQUIRES: clang-driver
				// REQUIRES: x86-registered-target
				// REQUIRES: powerpc-registered-target
				// REQUIRES: nvptx-registered-target

				/// Check cubin file generation and partial linking with ld
				// RUN: %clang -### -target powerpc64le-unknown-linux-gnu -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda \
				// RUN: -no-canonical-prefixes -save-temps %s -c 2>&1 \
				// RUN: \| FileCheck -check-prefix=CHK-PTXAS-CUBIN-BUNDLING %s

				// CHK-PTXAS-CUBIN-BUNDLING: clang{{.}}" "-o" "[[PTX:.\.s]]"
				// CHK-PTXAS-CUBIN-BUNDLING-NEXT: ptxas{{.}}" "--output-file" "[[CUBIN:.\.cubin]]" {{.*}}"[[PTX]]"
				// CHK-PTXAS-CUBIN-BUNDLING: fatbinary{{.}}" "--create=[[FATBIN:.\.fatbin]]" "
				// CHK-PTXAS-CUBIN-BUNDLING-SAME: --embedded-fatbin=[[FATBINC:.*\.fatbin.c]]" "
				// CHK-PTXAS-CUBIN-BUNDLING-SAME: --cmdline=--compile-only" "--image=profile={{.*}}[[PTX]]" "
				// CHK-PTXAS-CUBIN-BUNDLING-SAME: --image=profile={{.*}}file=[[CUBIN]]" "--cuda" "--device-c"
				// CHK-PTXAS-CUBIN-BUNDLING: clang++{{.}}" "-c" "-o" "[[HOSTDEV:.\.o]]"{{.*}}" "[[FATBINC]]" "-D__NV_MODULE_ID=
				// CHK-PTXAS-CUBIN-BUNDLING-NOT: clang-offload-bundler
				// CHK-PTXAS-CUBIN-BUNDLING: ld" "-r" "[[HOSTDEV]]" "{{.}}.o" "-o" "{{.}}.o"
				sfantaoUnsubmitted Done Reply Inline Actions `clang-offload-bundler` should be sufficient here. sfantao: `clang-offload-bundler` should be sufficient here.

				/// ###########################################################################

				/// Check object file unbundling is not happening when skipping bundler
				// RUN: touch %t.o
				// RUN: %clang -### -target powerpc64le-unknown-linux-gnu -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda \
				// RUN: -no-canonical-prefixes -save-temps %t.o 2>&1 \
				// RUN: \| FileCheck -check-prefix=CHK-CUBIN-UNBUNDLING-NVLINK %s

				/// Use DAG to ensure that object file has not been unbundled.
				// CHK-CUBIN-UNBUNDLING-NVLINK-DAG: nvlink{{.}}" {{.}}"[[OBJ:.*\.o]]"
				// CHK-CUBIN-UNBUNDLING-NVLINK-DAG: ld{{.}}" {{.}}"[[OBJ]]"

				/// ###########################################################################

				/// Check object file generation is not happening when skipping bundler
				// RUN: touch %t1.o
				// RUN: touch %t2.o
				// RUN: %clang -### -no-canonical-prefixes -target powerpc64le-unknown-linux-gnu -fopenmp=libomp \
				// RUN: -fopenmp-targets=nvptx64-nvidia-cuda %t1.o %t2.o 2>&1 \
				// RUN: \| FileCheck -check-prefix=CHK-TWOCUBIN %s
				/// Check cubin file generation and usage by nvlink when toolchain has BindArchAction
				// RUN: %clang -### -no-canonical-prefixes -target x86_64-apple-darwin17.0.0 -fopenmp=libomp \
				// RUN: -fopenmp-targets=nvptx64-nvidia-cuda %t1.o %t2.o 2>&1 \
				// RUN: \| FileCheck -check-prefix=CHK-TWOCUBIN %s

				// CHK-TWOCUBIN: nvlink{{.}}openmp-offload-{{.}}.o" "{{.}}openmp-offload-{{.}}.o"

test/Driver/openmp-offload-gpu.c

	Show First 20 Lines • Show All 71 Lines • ▼ Show 20 Lines
	// CHK-UNBUNDLING-PTXAS-CUBIN-NVLINK-DAG: clang-offload-bundler{{.}}" "-type=s" {{.}}"-outputs={{.*}}[[PTX]]			// CHK-UNBUNDLING-PTXAS-CUBIN-NVLINK-DAG: clang-offload-bundler{{.}}" "-type=s" {{.}}"-outputs={{.*}}[[PTX]]
	// CHK-UNBUNDLING-PTXAS-CUBIN-NVLINK-DAG-SAME: "-unbundle"			// CHK-UNBUNDLING-PTXAS-CUBIN-NVLINK-DAG-SAME: "-unbundle"
	// CHK-UNBUNDLING-PTXAS-CUBIN-NVLINK: nvlink{{.}}" {{.}}"[[CUBIN]]"			// CHK-UNBUNDLING-PTXAS-CUBIN-NVLINK: nvlink{{.}}" {{.}}"[[CUBIN]]"

	/// ###########################################################################			/// ###########################################################################

	/// Check cubin file generation and bundling			/// Check cubin file generation and bundling
	// RUN: %clang -### -target powerpc64le-unknown-linux-gnu -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda \			// RUN: %clang -### -target powerpc64le-unknown-linux-gnu -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda \
	// RUN: -no-canonical-prefixes -save-temps %s -c 2>&1 \			// RUN: -no-canonical-prefixes -save-temps %s -c -fopenmp-use-target-bundling 2>&1 \
	// RUN: \| FileCheck -check-prefix=CHK-PTXAS-CUBIN-BUNDLING %s			// RUN: \| FileCheck -check-prefix=CHK-PTXAS-CUBIN-BUNDLING %s

	// CHK-PTXAS-CUBIN-BUNDLING: clang{{.}}" "-o" "[[PTX:.\.s]]"			// CHK-PTXAS-CUBIN-BUNDLING: clang{{.}}" "-o" "[[PTX:.\.s]]"
	// CHK-PTXAS-CUBIN-BUNDLING-NEXT: ptxas{{.}}" "--output-file" "[[CUBIN:.\.cubin]]" {{.*}}"[[PTX]]"			// CHK-PTXAS-CUBIN-BUNDLING-NEXT: ptxas{{.}}" "--output-file" "[[CUBIN:.\.cubin]]" {{.*}}"[[PTX]]"
	// CHK-PTXAS-CUBIN-BUNDLING: clang-offload-bundler{{.}}" "-type=o" {{.}}"-inputs={{.*}}[[CUBIN]]			// CHK-PTXAS-CUBIN-BUNDLING: clang-offload-bundler{{.}}" "-type=o" {{.}}"-inputs={{.*}}[[CUBIN]]

	/// ###########################################################################			/// ###########################################################################

	/// Check cubin file unbundling and usage by nvlink			/// Check cubin file unbundling and usage by nvlink
	// RUN: touch %t.o			// RUN: touch %t.o
	// RUN: %clang -### -target powerpc64le-unknown-linux-gnu -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda \			// RUN: %clang -### -target powerpc64le-unknown-linux-gnu -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda \
	// RUN: -no-canonical-prefixes -save-temps %t.o %S/Inputs/in.so 2>&1 \			// RUN: -no-canonical-prefixes -save-temps %t.o %S/Inputs/in.so -fopenmp-use-target-bundling 2>&1 \
	// RUN: \| FileCheck -check-prefix=CHK-CUBIN-UNBUNDLING-NVLINK %s			// RUN: \| FileCheck -check-prefix=CHK-CUBIN-UNBUNDLING-NVLINK %s

	/// Use DAG to ensure that cubin file has been unbundled.			/// Use DAG to ensure that cubin file has been unbundled.
	// CHK-CUBIN-UNBUNDLING-NVLINK-NOT: clang-offload-bundler{{.}}" "-type=o"{{.}}in.so			// CHK-CUBIN-UNBUNDLING-NVLINK-NOT: clang-offload-bundler{{.}}" "-type=o"{{.}}in.so
	// CHK-CUBIN-UNBUNDLING-NVLINK-DAG: nvlink{{.}}" {{.}}"[[CUBIN:.*\.cubin]]"			// CHK-CUBIN-UNBUNDLING-NVLINK-DAG: nvlink{{.}}" {{.}}"[[CUBIN:.*\.cubin]]"
	// CHK-CUBIN-UNBUNDLING-NVLINK-DAG: clang-offload-bundler{{.}}" "-type=o" {{.}}"-outputs={{.*}}[[CUBIN]]			// CHK-CUBIN-UNBUNDLING-NVLINK-DAG: clang-offload-bundler{{.}}" "-type=o" {{.}}"-outputs={{.*}}[[CUBIN]]
	// CHK-CUBIN-UNBUNDLING-NVLINK-DAG-SAME: "-unbundle"			// CHK-CUBIN-UNBUNDLING-NVLINK-DAG-SAME: "-unbundle"
	// CHK-CUBIN-UNBUNDLING-NVLINK-NOT: clang-offload-bundler{{.}}" "-type=o"{{.}}in.so			// CHK-CUBIN-UNBUNDLING-NVLINK-NOT: clang-offload-bundler{{.}}" "-type=o"{{.}}in.so

	/// ###########################################################################			/// ###########################################################################

	/// Check cubin file generation and usage by nvlink			/// Check cubin file generation and usage by nvlink
	// RUN: touch %t1.o			// RUN: touch %t1.o
	// RUN: touch %t2.o			// RUN: touch %t2.o
	// RUN: %clang -### -no-canonical-prefixes -target powerpc64le-unknown-linux-gnu -fopenmp=libomp \			// RUN: %clang -### -no-canonical-prefixes -target powerpc64le-unknown-linux-gnu -fopenmp=libomp \
	// RUN: -fopenmp-targets=nvptx64-nvidia-cuda %t1.o %t2.o 2>&1 \			// RUN: -fopenmp-targets=nvptx64-nvidia-cuda %t1.o %t2.o -fopenmp-use-target-bundling 2>&1 \
	// RUN: \| FileCheck -check-prefix=CHK-TWOCUBIN %s			// RUN: \| FileCheck -check-prefix=CHK-TWOCUBIN %s
	/// Check cubin file generation and usage by nvlink when toolchain has BindArchAction			/// Check cubin file generation and usage by nvlink when toolchain has BindArchAction
	// RUN: %clang -### -no-canonical-prefixes -target x86_64-apple-darwin17.0.0 -fopenmp=libomp \			// RUN: %clang -### -no-canonical-prefixes -target x86_64-apple-darwin17.0.0 -fopenmp=libomp \
	// RUN: -fopenmp-targets=nvptx64-nvidia-cuda %t1.o %t2.o 2>&1 \			// RUN: -fopenmp-targets=nvptx64-nvidia-cuda %t1.o %t2.o -fopenmp-use-target-bundling 2>&1 \
	// RUN: \| FileCheck -check-prefix=CHK-TWOCUBIN %s			// RUN: \| FileCheck -check-prefix=CHK-TWOCUBIN %s

	// CHK-TWOCUBIN: nvlink{{.}}openmp-offload-{{.}}.cubin" "{{.}}openmp-offload-{{.}}.cubin"			// CHK-TWOCUBIN: nvlink{{.}}openmp-offload-{{.}}.cubin" "{{.}}openmp-offload-{{.}}.cubin"

	/// ###########################################################################			/// ###########################################################################

	/// Check PTXAS is passed -c flag when offloading to an NVIDIA device using OpenMP.			/// Check PTXAS is passed -c flag when offloading to an NVIDIA device using OpenMP.
	// RUN: %clang -### -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -no-canonical-prefixes %s 2>&1 \			// RUN: %clang -### -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -no-canonical-prefixes %s 2>&1 \
	▲ Show 20 Lines • Show All 160 Lines • Show Last 20 Lines

test/Driver/openmp-offload.c

	Show First 20 Lines • Show All 474 Lines • ▼ Show 20 Lines
	// CHK-BUJOBS-ST: clang{{.}}" "-cc1" "-triple" "x86_64-pc-linux-gnu" "-aux-triple" "powerpc64le-unknown-linux" {{.}}"-S" {{.}}"-fopenmp" {{.}}"-o" "			// CHK-BUJOBS-ST: clang{{.}}" "-cc1" "-triple" "x86_64-pc-linux-gnu" "-aux-triple" "powerpc64le-unknown-linux" {{.}}"-S" {{.}}"-fopenmp" {{.}}"-o" "
	// CHK-BUJOBS-ST-SAME: [[T2ASM:[^\\/]+\.s]]" "-x" "ir" "{{.*}}[[T2BC]]"			// CHK-BUJOBS-ST-SAME: [[T2ASM:[^\\/]+\.s]]" "-x" "ir" "{{.*}}[[T2BC]]"
	// CHK-BUJOBS-ST: clang{{.}}" "-cc1as" "-triple" "x86_64-pc-linux-gnu" "-filetype" "obj" {{.}}"-o" "			// CHK-BUJOBS-ST: clang{{.}}" "-cc1as" "-triple" "x86_64-pc-linux-gnu" "-filetype" "obj" {{.}}"-o" "
	// CHK-BUJOBS-ST-SAME: [[T2OBJ:[^\\/]+\.o]]" "{{.*}}[[T2ASM]]"			// CHK-BUJOBS-ST-SAME: [[T2OBJ:[^\\/]+\.o]]" "{{.*}}[[T2ASM]]"

	// Create host object and bundle.			// Create host object and bundle.
	// CHK-BUJOBS: clang{{.}}" "-cc1" "-triple" "powerpc64le-unknown-linux" {{.}}"-emit-obj" {{.}}"-fopenmp" {{.}}"-o" "			// CHK-BUJOBS: clang{{.}}" "-cc1" "-triple" "powerpc64le-unknown-linux" {{.}}"-emit-obj" {{.}}"-fopenmp" {{.}}"-o" "
	// CHK-BUJOBS-SAME: [[HOSTOBJ:[^\\/]+\.o]]" "-x" "ir" "{{.*}}[[HOSTBC]]"			// CHK-BUJOBS-SAME: [[HOSTOBJ:[^\\/]+\.o]]" "-x" "ir" "{{.*}}[[HOSTBC]]"
	// CHK-BUJOBS: clang-offload-bundler{{.*}}" "-type=o" "-targets=openmp-powerpc64le-ibm-linux-gnu,openmp-x86_64-pc-linux-gnu,host-powerpc64le-unknown-linux" "-outputs=			// CHK-BUJOBS: clang-offload-bundler{{.}}" "-type=o"{{.}}"-targets=openmp-powerpc64le-ibm-linux-gnu,openmp-x86_64-pc-linux-gnu,host-powerpc64le-unknown-linux" "-outputs=
	// CHK-BUJOBS-SAME: [[RES:[^\\/]+\.o]]" "-inputs={{.}}[[T1OBJ]],{{.}}[[T2OBJ]],{{.*}}[[HOSTOBJ]]"			// CHK-BUJOBS-SAME: [[RES:[^\\/]+\.o]]" "-inputs={{.}}[[T1OBJ]],{{.}}[[T2OBJ]],{{.*}}[[HOSTOBJ]]"
	// CHK-BUJOBS-ST: clang{{.}}" "-cc1" "-triple" "powerpc64le-unknown-linux" {{.}}"-S" {{.}}"-fopenmp" {{.}}"-o" "			// CHK-BUJOBS-ST: clang{{.}}" "-cc1" "-triple" "powerpc64le-unknown-linux" {{.}}"-S" {{.}}"-fopenmp" {{.}}"-o" "
	// CHK-BUJOBS-ST-SAME: [[HOSTASM:[^\\/]+\.s]]" "-x" "ir" "{{.*}}[[HOSTBC]]"			// CHK-BUJOBS-ST-SAME: [[HOSTASM:[^\\/]+\.s]]" "-x" "ir" "{{.*}}[[HOSTBC]]"
	// CHK-BUJOBS-ST: clang{{.}}" "-cc1as" "-triple" "powerpc64le-unknown-linux" "-filetype" "obj" {{.}}"-o" "			// CHK-BUJOBS-ST: clang{{.}}" "-cc1as" "-triple" "powerpc64le-unknown-linux" "-filetype" "obj" {{.}}"-o" "
	// CHK-BUJOBS-ST-SAME: [[HOSTOBJ:[^\\/]+\.o]]" "{{.*}}[[HOSTASM]]"			// CHK-BUJOBS-ST-SAME: [[HOSTOBJ:[^\\/]+\.o]]" "{{.*}}[[HOSTASM]]"
	// CHK-BUJOBS-ST: clang-offload-bundler{{.*}}" "-type=o" "-targets=openmp-powerpc64le-ibm-linux-gnu,openmp-x86_64-pc-linux-gnu,host-powerpc64le-unknown-linux" "-outputs=			// CHK-BUJOBS-ST: clang-offload-bundler{{.}}" "-type=o"{{.}}"-targets=openmp-powerpc64le-ibm-linux-gnu,openmp-x86_64-pc-linux-gnu,host-powerpc64le-unknown-linux" "-outputs=
	// CHK-BUJOBS-ST-SAME: [[RES:[^\\/]+\.o]]" "-inputs={{.}}[[T1OBJ]],{{.}}[[T2OBJ]],{{.*}}[[HOSTOBJ]]"			// CHK-BUJOBS-ST-SAME: [[RES:[^\\/]+\.o]]" "-inputs={{.}}[[T1OBJ]],{{.}}[[T2OBJ]],{{.*}}[[HOSTOBJ]]"

	/// ###########################################################################			/// ###########################################################################

	/// Check separate compilation with offloading - unbundling jobs construct			/// Check separate compilation with offloading - unbundling jobs construct
	// RUN: touch %t.i			// RUN: touch %t.i
	// RUN: %clang -### -fopenmp=libomp -o %t.out -lsomelib -target powerpc64le-linux -fopenmp-targets=powerpc64le-ibm-linux-gnu,x86_64-pc-linux-gnu %t.i -no-canonical-prefixes 2>&1 \			// RUN: %clang -### -fopenmp=libomp -o %t.out -lsomelib -target powerpc64le-linux -fopenmp-targets=powerpc64le-ibm-linux-gnu,x86_64-pc-linux-gnu %t.i -no-canonical-prefixes 2>&1 \
	// RUN: \| FileCheck -check-prefix=CHK-UBJOBS %s			// RUN: \| FileCheck -check-prefix=CHK-UBJOBS %s
				sfantaoUnsubmitted Done Reply Inline Actions We need a test for the static linking. The host linker has to be nvcc in that case, right? sfantao: We need a test for the static linking. The host linker has to be nvcc in that case, right?
				gtberceaAuthorUnsubmitted Not Done Reply Inline Actions The host linker is "ld". The "bundling" step is replaced (in the case of OpenMP NVPTX device offloading only) by a call to "ld -r" to partially link the 2 object files: the object file produced by the HOST toolchain and the object file produced by the OpenMP NVPTX device offloading toolchain (because we want to produce a single output). gtbercea: The host linker is "ld". The "bundling" step is replaced (in the case of OpenMP NVPTX device…
				gtberceaAuthorUnsubmitted Not Done Reply Inline Actions nvcc is not called at all in this patch. gtbercea: nvcc is not called at all in this patch.
				sfantaoUnsubmitted Not Done Reply Inline Actions Ok, so how do you link device code? I.e. if you have two compilation units that depend on each other (some definition in one unit is used in the other), where are they linked together? Something has to understand the two files resulting from your "ld -r" step, my understanding is that that something is nvcc that calls nvlink behind the scenes, right? So, nvcc will do the unbundling+linking bit, right? sfantao: Ok, so how do you link device code? I.e. if you have two compilation units that depend on each…
	// RUN: %clang -### -fopenmp=libomp -o %t.out -lsomelib -target powerpc64le-linux -fopenmp-targets=powerpc64le-ibm-linux-gnu,x86_64-pc-linux-gnu %t.i -save-temps -no-canonical-prefixes 2>&1 \			// RUN: %clang -### -fopenmp=libomp -o %t.out -lsomelib -target powerpc64le-linux -fopenmp-targets=powerpc64le-ibm-linux-gnu,x86_64-pc-linux-gnu %t.i -save-temps -no-canonical-prefixes 2>&1 \
	// RUN: \| FileCheck -check-prefix=CHK-UBJOBS-ST %s			// RUN: \| FileCheck -check-prefix=CHK-UBJOBS-ST %s
	// RUN: touch %t.o			// RUN: touch %t.o
	// RUN: %clang -### -fopenmp=libomp -o %t.out -lsomelib -target powerpc64le-linux -fopenmp-targets=powerpc64le-ibm-linux-gnu,x86_64-pc-linux-gnu %t.o -no-canonical-prefixes 2>&1 \			// RUN: %clang -### -fopenmp=libomp -o %t.out -lsomelib -target powerpc64le-linux -fopenmp-targets=powerpc64le-ibm-linux-gnu,x86_64-pc-linux-gnu %t.o -no-canonical-prefixes 2>&1 \
	// RUN: \| FileCheck -check-prefix=CHK-UBJOBS2 %s			// RUN: \| FileCheck -check-prefix=CHK-UBJOBS2 %s
	// RUN: %clang -### -fopenmp=libomp -o %t.out -lsomelib -target powerpc64le-linux -fopenmp-targets=powerpc64le-ibm-linux-gnu,x86_64-pc-linux-gnu %t.o %S/Inputs/in.so -save-temps -no-canonical-prefixes 2>&1 \			// RUN: %clang -### -fopenmp=libomp -o %t.out -lsomelib -target powerpc64le-linux -fopenmp-targets=powerpc64le-ibm-linux-gnu,x86_64-pc-linux-gnu %t.o %S/Inputs/in.so -save-temps -no-canonical-prefixes 2>&1 \
	// RUN: \| FileCheck -check-prefix=CHK-UBJOBS2-ST %s			// RUN: \| FileCheck -check-prefix=CHK-UBJOBS2-ST %s

	▲ Show 20 Lines • Show All 152 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchainNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 187979

include/clang/Driver/Action.h

include/clang/Driver/Compilation.h

include/clang/Driver/Options.td

include/clang/Driver/ToolChain.h

lib/Driver/Action.cpp

lib/Driver/Compilation.cpp

lib/Driver/Driver.cpp

lib/Driver/ToolChain.cpp

lib/Driver/ToolChains/Clang.h

lib/Driver/ToolChains/Clang.cpp

lib/Driver/ToolChains/Cuda.cpp

test/Driver/openmp-offload-gpu-linux.c

test/Driver/openmp-offload-gpu.c

test/Driver/openmp-offload.c

[OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain
Needs ReviewPublic