This is an archive of the discontinued LLVM Phabricator instance.

[OPENMP] Driver support for OpenMP offloading
AbandonedPublic

Authored by sfantao on May 20 2015, 10:42 AM.

Details

Summary

With a full implementation of OpenMP 3.1. already available upstream, we aim at continuing that work and add support for OpenMP 4.0 as well. One important component introduced by OpenMP 4.0 is offloading which enables the execution of a given structured block to be transferred to a device other than the host.

An implementation for OpenMP offloading infrastructure in clang is proposed in http://goo.gl/L1rnKJ. This document is already a second iteration that includes contributions from several vendors and members of the LLVM community. It was published in http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084304.html for discussion by the community, and so far we didn’t have any major concern about the design.

Unlike other OpenMP components, offloading requires support from the compiler driver given that for the same source file, several (host and target) objects will be generated using potentially different toolchains. At the same time, the compiler needs to have a mechanism to relate variables in the host with the ones generated with target, so communication between toolchains is required. The way this relation is supported by the driver will also have implications in the code generation.

This patch proposes an implementation of the driver support for offloading. The following summarizes the main changes this patch introduces:

a) clang can be invoked with -fopenmp=libiom5 -omptargets=triple1,…,tripleN, where triplei are the target triples the user wants to be able to offload to.

b) driver detects whether the offloading triples are valid or not and if the corresponding toolchain is prepared to offload. This patch only enables offloading for Linux toolchains.

c) Each target compiler phase takes the host IR (result of the host compiler phase) as a second input. This will enable the host generation to specify the variables that should be emitted for the target in the form of metadata and this metadata could be read by the target frontend.

d) Given that the same host IR result info is used by the different toolchains, the driver keeps a cache of results in order to avoid the job that generates a given result to be emitted twice.

e) Offloading leverages the argument translation functionality in order to convert host arguments into target arguments. This is currently used to make sure a shared library is always produced by the target toolchain - a library that can be loaded by the OpenMP runtime library.

f) The target shared libraries are embedded into the host binary by using a linker script produced by the driver and passed to the host linker.

g) The driver passes to the frontend offloading a command that specify if the frontend is producing code for a target. This is required as the code generation for target and host have to be different.

h) A full path to the original source file is passed to the frontend so it can be used to produce unique IDs that are the same for the host and targets.

Thanks!
Samuel

Diff Detail

Event Timeline

sfantao updated this revision to Diff 26158.May 20 2015, 10:42 AM
sfantao retitled this revision from to [OPENMP] Driver support for OpenMP offloading.
sfantao updated this object.
sfantao edited the test plan for this revision. (Show Details)
sfantao added a subscriber: Unknown Object (MLST).
rjmccall edited edge metadata.May 21 2015, 9:55 AM

Hmm. Using the host IR as an implicit line of communication is an interesting approach. Can you expand on what kind of information needs to flow from host to target there, or at least link to a place in the previous discussion?

sfantao added a comment.EditedMay 21 2015, 2:20 PM

Hi John

Thanks for looking into this patch!

Sure, let me expand on the host-target communication. Just a little bit of context before I do that:

During code generation, the target frontend has to decide whether a given declaration or target region has to be emitted or not. Without any information communicated from the host frontend, this decision become complicated for cases like:

  • #pragma omp target regions in static functions or class members;
  • static declarations delimitted by #pragma omp declare target regions that end up not being used;
  • #pragma omp target in template functions

In order for the target frontend to correctly identify all the declarations that need to be emitted it would have to, somehow, emulate the actions done by the host frontend which would turn the code generation messy in places that do not even relate with OpenMP.

On top of that, in order to have an efficient mapping between host and target entries (global declarations/target regions)
table (this is discussed in the document, in section 5.1, where __tgt_offload_entry is introduced) the compiler would have to emit the corresponding entries in the host and target side in the same order. This is useful for devices whose toolchain maintain the order of the symbols given that the order of the entries in the host and target tables will be the same after linking. So knowing an index would be enough to do the mapping. In order for that to happen, the target frontend would have to know that order, which would be also hard to extract if no information is communicated form the host.

So, the information that needs to be propagated to make what I described above possible is basically i) declaration mangled names and ii) order they were emitted. This information could be communicated in the form of metadata that is emitted by the host frontend when the module is released and loaded by the target frontend when CGOpenMPRuntime is created. This information has however to be coded in slightly different ways for different kinds of declarations. Let me explain this with an example:

//######################################
#pragma omp declare target
struct MyClass{
  ...

  MyClass &add(MyClass &op){...}
  
  MyClass &add(int (&op)[N]){...}
  
  bool eq(MyClass &op){...}

  MyClass() {...}
  
  ~MyClass() {...}
};

MyClass C;
MyClass D;
#pragma omp end declare target 	

void foo(){
  int AA[N];
  MyClass H, T;
  MyClass HC;
  
  ...

  #pragma omp target
  {
    MyClass TC;
    T.add(AA).add(HC);
  }
  
  if (H.eq(T)) {...}
  
  #pragma omp target
  {
    T.add(AA);
  } 
}
//######################################

I was planning the metadata for this example to look more or less like this:

; Named metadata that encloses all the offloading information
!openmp.offloading.info = !{!1, !2, !3, !4, !5, !6, !7, !8, !9, !10}

; Global variables that require a map between host and target:
; Entry 0 -> ID for this type of metadata (0)
; Entry 1 -> Mangled name of the variable
; Entry 2 -> Order it was emitted
!1 = !{i32 0, !"C", i32 0}
!2 = !{i32 0, !"D", i32 2}

; Functions with target regions
; Entry 0 -> ID for this type of metadata (1)
; Entry 1 -> Mangled name of the function that was emitted for the host and encloses target regions
; Entry 2-n -> Order the target regions in the functions (in the same sequence the statements are found) are emitted 
!3 = !{i32 1, !"_Z3foov", i32 4, i32 5}

; Global initializers
; Entry 0 -> ID for this type of metadata (2)
; Entry 1-n -> Order the initializers are emitted in descending order of priority (we will require a target region per set of initializers with the same priority)
!4 = !{i32 2, i32 6}

; Global Dtors
; Entry 0 -> ID for this type of metadata (3)
; Entry 1 -> Mangled name of the variable to be destructed 
; Entry 2 -> Order the destructor was emitted (we will have a target region per variable being destroyed - this can probably be optimized)
!5 = !{i32 3, !"C", i32 1}
!6 = !{i32 3, !"D", i32 3}

; Other functions that should be emitted in the target but do not require to be mapped to the host
; Entry 0 -> ID for this type of metadata (4)
; Entry 1 -> Mangled name of the function that has to be emitted.
!7 = !{i32 4, !"_ZN7MyClass3addERA64_i"}
!8 = !{i32 4, !"_ZN7MyClass3addERS_"}
!9 = !{i32 4, !"_ZN7MyClassC2Ev"}
!10 = !{i32 4, !"_ZN7MyClassD2Ev"}

I realize this is the kind of information I should propose as a patch to the codegen part of offloading, but I think it makes sense to discuss it now as the driver has to enable it.

I also foresee the communication between target and host to be useful for other cases, like the propagation of alias information from host to target. I don’t have have however a proposal for that at this moment.

Hope I haven’t been either too brief or too exhaustive! Let me know if I can clarify anything else for you.

Thanks!
Samuel

Are there any other comments or questions about this patch?

Many thanks!
Samuel

I've just noticed Chad is owning the Compiler driver, so I believe he should also be added to the list of reviewer of this patch.

Thanks!
Samuel

echristo edited edge metadata.Jun 15 2015, 1:04 PM

Quite a big patch, I'd definitely like to take a look at this as well. It relates to how some of the cuda work is progressing too.

Thanks!

-eric

Quite a big patch, I'd definitely like to take a look at this as well. It relates to how some of the cuda work is progressing too.

Thanks!

-eric

Thanks eric,

Please let me know any comments you may have.

I agree the patch is quite big... I had a hard time trying to find a better partition that would make sense - this requires small but related changes in several places - and mapped to something meaningful in terms of the regression tests. If you see a good way to partition the patch let me know and I would gladly do it.

Thanks again!
Samuel

tra added a subscriber: tra.Aug 19 2015, 5:06 PM

I think this has to be updated for the current trunk...

sfantao updated this revision to Diff 36263.EditedOct 1 2015, 11:22 AM

This diff refactors the original patch and is rebased on top of the latests offloading changes inserted for CUDA.

Here I don't touch the CUDA support. I tried, however, to have the implementation modular enough so that it could eventually be combined with the CUDA implementation. In my view OpenMP offloading is more general in the sense that it does not refer to a given tool chain, instead it uses existing toolchains to generate code for offloading devices. So, I believe that a tool chain (which I did not include in this patch) targeting NVPTX will be able to handle both CUDA and OpenMP offloading models.

Chris, Art, I understand you have worked out the latest CUDA changes so any feedback from you is greatly appreciated!

Here are few more details about this diff:

a) Add tool to bundle and unbundle corresponding host and device files into a single one.

One of the goals of OpenMP offloading is to enable users to offload with little effort, by annotating the code with a few pragmas. I'd also like to save users the trouble of changing their existent applications' build system. So having the compiler always return a single file instead of one for the host and each target even if the user is doing separate compilation is desirable.

This diff includes a tool named clang-offload-bundled (happy to change the name or even include it in the driver if someone thinks it is the best direction to go) that is used on all input files that are not source files to unbundle them, and on top level jobs that are not linking jobs to bundle the results obtained for host and each target.

The format of the bundled files is currently very simple: text formats are concatenated with comments that have a magic string and target identifying triple in between, and binary formats have a header that contains the triple and the offset and size of the code for host and each target.

This tool still has to be improved in the future to deal with archive files so that each individual file in the archive is properly dealt with. We see that archives are very commonly used in current application to combine separate compilation results. So I'm convinced users would enjoy this feature.

b ) The building of the driver actions is unchanged.

I don't create device specific actions. Instead only the bundling/unbundling are inserted as first or last action if the file type requires that.

c) Add offloading kind to ToolChain

Offloading does not require a new toolchain to be created. Existent toolchains are used and the offloading kind is used to drive specific behavior in each toolchain so that valid device code is generated.

This is a major difference from what is currently done for CUDA. But I guess the CUDA implementation easily fits this design and the Nvidia GPU toolchain could be reused for both CUDA and OpenMP offloading.

d) Use Job results cache to easily use host results in device actions and vice-versa.

An array of the results for each job is kept so that the device job can use the result previously generated for the host and used it as input or vice-versa.

In OpenMP the device declarations have be communicated from the host frontend to the device frontend. So this is used to conveniently pass that information. Unlike CUDA, OpenMP doesn't have already outline functions with "device" attributes that the frontend can rely on to make the decision on what to be emitted or not.

The result cache can also be updated to keep the required information for the CUDA implementation to decide host/device binaries combining (injection is the term used in the code). I don't have a concrete proposal for that however, given that is not clear to me what are the plans for CUDA to support separate compilation, I understand that the CUDA binary is inserted directly in host IR (Art, can you shed some light on this?).

e) Use compiler generated linker script to do the device/host code combining and correctly support separate compilation.

Currently the OpenMP support in the toolchains is only implemented for Generic GCC targets and a linker script is used to embed the resulting device images into the host binary ELF sections. Also, the linker script defines the symbols that are emitted during code generation so that the address of the images can be easily retrieved.

f) Minor refactoring of the existing code to enable reusing.

I've outlined some of the exiting code into static function so that it could be reused by the new offloading related hooks.

Any comments/remarks are very welcome!

Thanks!
Samuel

Currently trying to test, but

  1. Offloading to the same target isn't supported (x86_64-unknown-linux-gnu as host and device) - this was working with clang-omp

The produced IR isn't showing any calls to the target library and on linkage it complains:

undefined reference to `.omp_offloading.img_start.x86_64-unknown-linux-gnu'
undefined reference to `.omp_offloading.img_end.x86_64-unknown-linux-gnu'
undefined reference to `.omp_offloading.entries_begin'
undefined reference to `.omp_offloading.entries_end'
undefined reference to `.omp_offloading.entries_begin'
undefined reference to `.omp_offloading.entries_end'

(btw: clang-offload-bundler saves the IR file to $TMP with -S -emit-llvm, this seems to be a bug - I had to use --save-temps)

  1. I can't seem to figure out the target triple for NVIDIA GPUs. It should be nvptx[64]-nvidia-cuda which gives me
include/llvm/Option/Option.h:101: const llvm::opt::Option llvm::opt::Option::getAlias() const: Assertion `Info && "Must have a valid info!"' failed.

In clang-omp it was nvptxsm_35-nvidia-cuda but this is now invalid...

tra added a comment.Oct 7 2015, 4:50 PM

This diff refactors the original patch and is rebased on top of the latests offloading changes inserted for CUDA.

Here I don't touch the CUDA support. I tried, however, to have the implementation modular enough so that it could eventually be combined with the CUDA implementation. In my view OpenMP offloading is more general in the sense that it does not refer to a given tool chain, instead it uses existing toolchains to generate code for offloading devices. So, I believe that a tool chain (which I did not include in this patch) targeting NVPTX will be able to handle both CUDA and OpenMP offloading models.

What do you mean by "does not refer to a given toolchain"? Do you have the toolchain patch available?

Creating a separate toolchain for CUDA was a crutch that was available to craft appropriate cc1 command line for device-side compilation using existing toolchain. It works, but it's rather rigid arrangement. Creating a NVPTX toolchain which can be parameterized to produce CUDA or OpenMP would be an improvement.

Ideally toolchain tweaking should probably be done outside of the toolchain itself so that it can be used with any combination of {CUDA or OpenMP target tweaks}x{toolchains capable of generating target code}.

b ) The building of the driver actions is unchanged.

I don't create device specific actions. Instead only the bundling/unbundling are inserted as first or last action if the file type requires that.

Could you elaborate on that? The way I read it, the driver sees linear chain of compilation steps plus bundling/unbundling at the beginning/end and that each action would result in multiple compiler invocations, presumably per target.

If that's the case, then it may present a bit of a challenge in case one part of compilation depends on results of another. That's the case for CUDA where results of device-side compilation must be present for host-side compilation so we can generate additional code to initialize it at runtime.

c) Add offloading kind to ToolChain

Offloading does not require a new toolchain to be created. Existent toolchains are used and the offloading kind is used to drive specific behavior in each toolchain so that valid device code is generated.

This is a major difference from what is currently done for CUDA. But I guess the CUDA implementation easily fits this design and the Nvidia GPU toolchain could be reused for both CUDA and OpenMP offloading.

Sounds good. I'd be happy to make necessary make CUDA support use it.

d) Use Job results cache to easily use host results in device actions and vice-versa.

An array of the results for each job is kept so that the device job can use the result previously generated for the host and used it as input or vice-versa.

Nice. That's something that will be handy for CUDA and may help to avoid passing bits of info about other jobs explicitly throughout the driver.

The result cache can also be updated to keep the required information for the CUDA implementation to decide host/device binaries combining (injection is the term used in the code). I don't have a concrete proposal for that however, given that is not clear to me what are the plans for CUDA to support separate compilation, I understand that the CUDA binary is inserted directly in host IR (Art, can you shed some light on this?).

Currently CUDA depends on libcudart which assumes that GPU code and its initialization is done the way nvcc does it. Currently we do include PTX assembly (as in readable text) generated on device side into host-side IR *and* generate some host data structures and init code to register GPU binaries with libcudart. I haven't figured out a way to compile host/device sides of CUDA without a host-side compilation depending on device results.

Long-term we're considering implementing CUDA runtime support based on plain driver interface which would give us more control over where we keep GPU code and how we initialize it. Then we could simplify things and, for example, incorporate GPU code via linker script. Alas for the time being we're stuck with libcudart and sequential device and host compilation phases.

As for separate compilation -- compilation part is doable. It's using the results of such compilation that becomes tricky. CUDA's triple-bracket kernel launch syntax depends on libcudart and will not work, because we would not generate init code. You can still launch kernels manually using raw driver API, but it's quite a bit more convoluted.

--Artem

include/clang/Driver/Driver.h
208

re -> are

210

"If offloading is not supported" perhaps?

lib/Driver/Driver.cpp
2090

"has to be"

sfantao updated this revision to Diff 36816.Oct 7 2015, 5:40 PM

Make the offloading ELF sections consistent with what is in http://reviews.llvm.org/D12614.

Fix bug in AtTopLevel flag, so that the bundling job is considered always top level job.

Fix several typos.

Art, Jonas,

Thanks for the comments!

Currently trying to test, but

  1. Offloading to the same target isn't supported (x86_64-unknown-linux-gnu as host and device) - this was working with clang-omp

The produced IR isn't showing any calls to the target library and on linkage it complains:

undefined reference to `.omp_offloading.img_start.x86_64-unknown-linux-gnu'
undefined reference to `.omp_offloading.img_end.x86_64-unknown-linux-gnu'
undefined reference to `.omp_offloading.entries_begin'
undefined reference to `.omp_offloading.entries_end'
undefined reference to `.omp_offloading.entries_begin'
undefined reference to `.omp_offloading.entries_end'

I assume you were trying this using the diff in http://reviews.llvm.org/D12614. There was an inconsistency in the names of the ELF sections and symbols defined by the linker script in these two patches. This is now fixed.

Note that if you are using the libomptarget library from clang-omp, you need to replace in the code .openmptgt_host_entries by .omp_offloading.entries. I changed the names so that all of them are consistent with what is already in place for other OpenMP directives.

I also changed the files generation so that different files are used even if target and host have the same triple.

Please, let me know if it still does not work for you.

(btw: clang-offload-bundler saves the IR file to $TMP with -S -emit-llvm, this seems to be a bug - I had to use --save-temps)

Yes, the bundling job was not being marked as top level. It is now fixed!

  1. I can't seem to figure out the target triple for NVIDIA GPUs. It should be nvptx[64]-nvidia-cuda which gives me
include/llvm/Option/Option.h:101: const llvm::opt::Option llvm::opt::Option::getAlias() const: Assertion `Info && "Must have a valid info!"' failed.

In clang-omp it was nvptxsm_35-nvidia-cuda but this is now invalid...

I didn't implement the triples logic for the nvptx targets yet. I'll port that from clang-omp once we have the basic functionality working upstream.

I'll address Art's comments in a separate message.

Thanks again,
Samuel

include/clang/Driver/Driver.h
208

Fixed!

210

Fixed!

lib/Driver/Driver.cpp
2090

Fixed!

In D9888#262325, @tra wrote:

This diff refactors the original patch and is rebased on top of the latests offloading changes inserted for CUDA.

Here I don't touch the CUDA support. I tried, however, to have the implementation modular enough so that it could eventually be combined with the CUDA implementation. In my view OpenMP offloading is more general in the sense that it does not refer to a given tool chain, instead it uses existing toolchains to generate code for offloading devices. So, I believe that a tool chain (which I did not include in this patch) targeting NVPTX will be able to handle both CUDA and OpenMP offloading models.

What do you mean by "does not refer to a given toolchain"? Do you have the toolchain patch available?

I mean not having to create a toolchain for a specific offloading model. OpenMP offloading is meant for any target and possibility many different targets simultaneously, so having a toolchain for each combination would be overwhelming.

I don't have a patch for the toolchain out for review yet. I'm planing to port what we have in clang-omp for the NVPTX toolchain once I have the host functionality in place. In there (https://github.com/clang-omp/clang_trunk/tree/master/lib/Driver) the Driver is implemented in a different way, I guess the version I'm proposing here is much cleaner. However, the ToolChains shouldn't be that different. All the tweaking is moved to the Tool itself, and I imagine I can drive that using the ToolChain offloading kind I'm proposing here. In https://github.com/clang-omp/clang_trunk/blob/master/lib/Driver/Tools.cpp I basically pick some arguments to forward to the tool and do some tricks to include libdevice in compilation when required. Do you think something like that could also work for CUDA?

Creating a separate toolchain for CUDA was a crutch that was available to craft appropriate cc1 command line for device-side compilation using existing toolchain. It works, but it's rather rigid arrangement. Creating a NVPTX toolchain which can be parameterized to produce CUDA or OpenMP would be an improvement.

Ideally toolchain tweaking should probably be done outside of the toolchain itself so that it can be used with any combination of {CUDA or OpenMP target tweaks}x{toolchains capable of generating target code}.

I agree. I decided to move all the offloading tweaking to the tools, given that that is what clang tool already does: customizes the arguments based on the ToolChain that is passed to it.

b ) The building of the driver actions is unchanged.

I don't create device specific actions. Instead only the bundling/unbundling are inserted as first or last action if the file type requires that.

Could you elaborate on that? The way I read it, the driver sees linear chain of compilation steps plus bundling/unbundling at the beginning/end and that each action would result in multiple compiler invocations, presumably per target.

If that's the case, then it may present a bit of a challenge in case one part of compilation depends on results of another. That's the case for CUDA where results of device-side compilation must be present for host-side compilation so we can generate additional code to initialize it at runtime.

That's right. I try to tackle the challenge of passing host/device results to device/host jobs by using a cache of results as I had described in d). The goal here is to add the flexibility required to accommodate different offloading models. In OpenMP we use host compile results in device compile jobs, and device link results in host link jobs whereas in CUDA the assemble result is used in compile job. I believe that we can have that cache to include whatever information is required to suit all needs.

c) Add offloading kind to ToolChain

Offloading does not require a new toolchain to be created. Existent toolchains are used and the offloading kind is used to drive specific behavior in each toolchain so that valid device code is generated.

This is a major difference from what is currently done for CUDA. But I guess the CUDA implementation easily fits this design and the Nvidia GPU toolchain could be reused for both CUDA and OpenMP offloading.

Sounds good. I'd be happy to make necessary make CUDA support use it.

Great! Thanks.

d) Use Job results cache to easily use host results in device actions and vice-versa.

An array of the results for each job is kept so that the device job can use the result previously generated for the host and used it as input or vice-versa.

Nice. That's something that will be handy for CUDA and may help to avoid passing bits of info about other jobs explicitly throughout the driver.

The result cache can also be updated to keep the required information for the CUDA implementation to decide host/device binaries combining (injection is the term used in the code). I don't have a concrete proposal for that however, given that is not clear to me what are the plans for CUDA to support separate compilation, I understand that the CUDA binary is inserted directly in host IR (Art, can you shed some light on this?).

Currently CUDA depends on libcudart which assumes that GPU code and its initialization is done the way nvcc does it. Currently we do include PTX assembly (as in readable text) generated on device side into host-side IR *and* generate some host data structures and init code to register GPU binaries with libcudart. I haven't figured out a way to compile host/device sides of CUDA without a host-side compilation depending on device results.

Long-term we're considering implementing CUDA runtime support based on plain driver interface which would give us more control over where we keep GPU code and how we initialize it. Then we could simplify things and, for example, incorporate GPU code via linker script. Alas for the time being we're stuck with libcudart and sequential device and host compilation phases.

As for separate compilation -- compilation part is doable. It's using the results of such compilation that becomes tricky. CUDA's triple-bracket kernel launch syntax depends on libcudart and will not work, because we would not generate init code. You can still launch kernels manually using raw driver API, but it's quite a bit more convoluted.

Ok, I see. I am not aware of what exactly libcudart does, but I can elaborate on what the OpenMP offloading implementation we have in place does:

We have a descriptor that is registered with the runtime library (we generate a function for that called before any global initializers are executed ), this descriptor has (among other things) fields that are initialized with the symbols defined by the linker script (so that the runtime library can immediately get the CUDA module) and also the names of the kernels (in OpenMP with don't have user-defined names for these kernels, so we generate some mangling to make sure they are unique). While launching the kernel, the runtime gets a pointer from which he can easily retrieve the name, and the CUDA driver API is used to get the CUDA function to be launched. We have been successfully generating a CUDA module that works well with separate compilation using ptxas and nvlink.

Part of my work is also port the runtime library in clang-omp to the LLLVM OpenMP project. I see CUDA as a simplified version of what OpenMP does, given that the user controls the data mappings explicitly, so I am sure we can find some synergies in the runtime library too and you may be able to use something that we already have in there.

Thanks!
Samuel

--Artem

[...]

I assume you were trying this using the diff in http://reviews.llvm.org/D12614. There was an inconsistency in the names of the ELF sections and symbols defined by the linker script in these two patches. This is now fixed.

Note that if you are using the libomptarget library from clang-omp, you need to replace in the code .openmptgt_host_entries by .omp_offloading.entries. I changed the names so that all of them are consistent with what is already in place for other OpenMP directives.

I also changed the files generation so that different files are used even if target and host have the same triple.

Please, let me know if it still does not work for you.

Thanks for your help, a small test program now seems to work!

[...]

I didn't implement the triples logic for the nvptx targets yet. I'll port that from clang-omp once we have the basic functionality working upstream.

Ok, I'll wait then. Thanks for your work and finally upstreaming this!
Jonas

Are there any more comments/suggestions about this patch?

Thanks!
Samuel

sfantao updated this revision to Diff 37903.Oct 20 2015, 11:59 AM

Move clang-offload-bundler to to a separate review: http://reviews.llvm.org/D13909.

This patch depends on http://reviews.llvm.org/D13909.

sfantao updated this revision to Diff 39594.Nov 6 2015, 2:35 PM

Rebase.

This comment was removed by sfantao.

Will this somewhen receive a final review and get merged?

@rsmith could you possibly take a look at this one? It has been around for roughly 8 months now and hasn't received much feedback

jprice added a subscriber: jprice.Mar 9 2016, 9:03 AM
rsmith added a reviewer: tra.Mar 18 2016, 9:45 AM
rsmith edited edge metadata.Mar 18 2016, 10:00 AM

@echristo, you asked for time to review this; if you still want to, please can you do so?
@tra, it looks like you're happy with this design (and with moving the CUDA offloading support in this direction), please let us know if not!

include/clang/Driver/Options.td
1579–1580

This is an unfortunate flag name; -oblah already means something. Is this name chosen for compatibility with some other system, or could we change it to, say, -fopenmp-targets=?

lib/Driver/Tools.cpp
286

s -> is

Hi Richard,

Thanks for your review. I partitioned some of the stuff I am proposing here in smaller patches:

http://reviews.llvm.org/D18170
http://reviews.llvm.org/D18171
http://reviews.llvm.org/D18172

These patches already try to incorporate the feedback I got in http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html related with the generation of actions.

Thanks again,
Samuel

include/clang/Driver/Options.td
1579–1580

You are right, we are now using -fomptargets in codegen exactly because of that. I can change it to -fopenmp-targets= we don't have any compatibility issues at this point.

lib/Driver/Tools.cpp
286

I'll fix it.

mkuron added a subscriber: mkuron.Mar 19 2016, 1:14 AM

The three smaller patches into which you divided this one appear to be missing some things. For example, AddOpenMPLinkerScript in lib/Driver/Tools.cpp from this patch appears to still be necessary to get the desired functionality, but it is not present in any of the three.

Hi Michael,

The three smaller patches into which you divided this one appear to be missing some things. For example, AddOpenMPLinkerScript in lib/Driver/Tools.cpp from this patch appears to still be necessary to get the desired functionality, but it is not present in any of the three.

Those three patches do not add any OpenMP specific code yet, so they do not cover the whole implementation I have here. I am doing things in a slightly different way in the new patches given the feedback I had in the mailing list and I am waiting to review to see if the approach I have in there is acceptable. If so, I'll continue with the OpenMP related patches afterwards.

Thanks,
Samuel

First I'd like to note that the code quality here is really high, most of my comments are higher level design decisions going with the driver and the implementation here rather than that.

One meta comment: offload appears to be something that could be used for CUDA and OpenMP (and OpenACC etc) as a term. I think we should either merge these concepts or pick a different name :)

Thanks for all of your work and patience here! The rest of the comments are inline.

-eric

include/clang/Driver/Driver.h
192–195

Example?

198–199

Any reason?

408–417

This function is starting to get a little silly. Perhaps we should look into refactoring such that this doesn't need to be "the one function that rules them all". Perhaps a different ownership model for the things that are arguments here?

lib/Driver/Compilation.cpp
65–66

Hmm?

lib/Driver/Driver.cpp
216–217

This can probably be done separately? Can you split this out and make it generally useful?

2001–2017

Might be time to make some specialized versions of this function. This may take it from "ridiculously confusing" to "code no one should ever look at" :)

lib/Driver/Tools.cpp
5845

Should we get the offload bundler in first so that the interface is there and testable? (Honest question, no particular opinion here). Though the command lines there will affect how this code is written.

test/OpenMP/target_driver.c
42–48

Do we really think the phases should be a DAG check?

55

How do you pass options to individual omptargets? e.g. -mvsx or -mavx2?

sfantao marked 8 inline comments as done.Apr 6 2016, 6:53 PM

Hi Eric,

Thanks for the review!

As you are probably a aware, I started partitioning this patch following your initial concern related with the size of this patch and the feedback I got from http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html. I am keeping this patch as it shows the big picture of what I am trying to accomplish, so if you prefer to add other higher level suggesting here that's perfectly fine. Let me know if there is a more proper way to link patches.

So, I am incorporating your suggestions here in the partioned patches as specified in the inline comments. The partitioned patches are http://reviews.llvm.org/D18170, http://reviews.llvm.org/D18171 and http://reviews.llvm.org/D18172.

One meta comment: offload appears to be something that could be used for CUDA and OpenMP (and OpenACC etc) as a term. I think we should either merge these concepts or pick a different name :)

Yes, I agree. I am now using offloading. I only refer to the programming model name if the code relates to something specific of that programming model.

Thanks again,
Samuel

include/clang/Driver/Driver.h
192–195

I got rid of this extra toolchain cache and I am organizing it in a multimap by offload kind as Art suggested in http://reviews.llvm.org/D18170. That avoids the multiple containers for the offloading toolchains (this one and the ordered one).

198–199

Currently in OpenMP any directive that relates with offloading supports a device() clause that basically specifies which device to use for that region or data transfer. E.g.

void foo() {
 ...
}

void bar(int i) {
  #pragma omp target device(i)
   foo();
}

... here foo is going to be executed on the device i. The problem is that the device is an integer - it does not tell which device type it is - so it is up to the implementation to decide how i is interpreted. So, if we have a system with two GPUs and two DSP devices. We may bind 0-1 to the GPUs and 2-3 to the DSPs.

My goal with preserving the order of the toolchains was to allow codegen to leverage that information and make a better decision on how to bind devices to integers. Maybe, if the user requests the GPU toolchain first he may be interested in prioritizing its use, so the first IDs would map to GPUs. Making a long story short, this is only about preserving information so that codegen can use it.

In any case, this is going to change in the future as the OpenMP language committee is working on having a device identifier to use instead of an integer. So, if you prefer remove the ordered out of the name, I am not opposed to that.

408–417

This has changed a little in recent CUDA work, in the version http://reviews.llvm.org/D18171 is based on, Result is returned instead of being passed by reference, and we have a `string/action-result map. I'll have to add to that string the offloading kind eventually, but in the partitioned patches I didn't touch that yet.

Do you suggest having that cache owned by the driver instead of passing it along?

lib/Driver/Compilation.cpp
65–66

This relates in some extend to your other question: how do we pass device-specific options.

So, right now we are relying on the host options to derive device-specific options. This hook was meant to make the tuning of the host options so that things that do not make sense on the device are filtered. Also, the device resulting image is usually a shared library so it that can be easily loaded, this hook is also used to specify the options that result in a shared library, even if the host options don't ask for a host shared library.

Can you think of a better way to abstract this?

lib/Driver/Driver.cpp
216–217

Given the feedback I got in http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html, I end up moving most the functionality that I have in jobs creation to the creation of actions. Having a action graph that shows the offloading specifics was desired feature. As a result, what gets more complex is the dump of the actions.

In http://reviews.llvm.org/D18171 I have an example on how that dump looks like. That patch also proposes a unified offloading action that should be reused by the different offloading programming models. Does this address your concern?

2001–2017

I agree. This function is really messy... :S

In http://reviews.llvm.org/D18171 I am proposing collapseOffloadingAction that drives the collapsing of offload actions and abstracts some of the complexity in selectToolForJob. Do you think that goes in the right direction, or you think I should do something else?

lib/Driver/Tools.cpp
5845

Yes, sure, I proposed an implementation of the bundler, using a generic format in http://reviews.llvm.org/D13909. Let me know any comments you have about that specific component.

I still need to add testing specific to http://reviews.llvm.org/D13909, which I didn't yet because I didn't know where it was supposed to live - maybe in the Driver? Do you have an opinion about that?

Also, in http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html, the generic opinion was that the bundler should use the host object format to bundle whenever possible. So, I also have to add a default behavior for the binary bundler when the input is an object file. For the other input types, I don't think there were any strong opinions. Do you happen to have one?

In any case, I was planing to add the object file specific bundling in a separate patch, which seems to me a natural way to partition the bundler functionality. Does that sound like a good plan?

test/OpenMP/target_driver.c
42–48

Using a DAG seemed to me a robust way to test that. I'd have to double check, but several map containers are used for the inputs and actions, so the order may depend on the implementation of the container. I was just trying to use a safe way to test.

Do you prefer to change this to the exact sequence I am getting?

55

Well, currently I don't. In http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html I was proposing something to tackle that, but the opinion was that it was somewhat secondary and the driver design should be settled first.

What I as proposing was some sort of group option associated with the device triple. The idea was to avoid proliferation of device specific options and reuse what we already have, just organize it groups so that i could be forwarded to the right tool chain. The goal was to make things like this possible:

clang -mcpu=pwr8 -target-offload=nvptx64-nvidia-cuda -fopenmp -mcpu=sm_35 -target-offload=nvptx64-nvidia-cuda -fcuda -mcpu=sm_32 a.c

... where mcpu is used to specify the cpu/gpu for the different tool chains and programing models. This would also be useful to specify include and library paths that only make sense to the device.

Do you have any opinion about that?

I think these changes have been contributed to trunk in multiple commits so this can be closed?

sfantao abandoned this revision.Oct 28 2016, 3:53 AM
sfantao marked 8 inline comments as done.

Hi Jonas,

I think these changes have been contributed to trunk in multiple commits so this can be closed?

You're right, this can be closed now.

Thanks!
Samuel