Page MenuHomePhabricator

[OpenMP] Initial implementation of OpenMP offloading library - libomptarget device RTLs.
ClosedPublic

Authored by grokos on Nov 2 2015, 12:06 PM.

Details

Summary

This patch is a partition of the original patch posted in http://reviews.llvm.org/D14031.

This patch implements the device runtime library whose interface is used in the code generation for OpenMP offloading devices. Currently there is a single device RTL written in CUDA meant to CUDA enabled GPUs. The interface is a variation of the kmpc interface that includes some extra calls to do thread and storage management that only make sense for a GPU target.

Depends on http://reviews.llvm.org/D14031.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
grokos updated this revision to Diff 126487.Dec 11 2017, 5:35 PM
grokos marked 12 inline comments as done.
grokos added inline comments.
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
116–122 ↗(On Diff #126036)

I left it there. The logic now is as follows:

  1. If the user has specified a compiler via -DLIBOMPTARGET_NVPTX_CUDA_COMPILER then we'll use that.
  2. If not, then we check whether CMAKE_C_COMPILER is clang; if yes, then we'll use that.
  3. Otherwise, we try to search for a clang compiler. If we don't find any, then we don't build bclib.
124–139 ↗(On Diff #126036)
  1. Changed the linker-setting logic to the same as the compiler-setting logic above.
  2. Clang can be used to link but with different command line options. For consistency I am sticking to llvm-link. My rationale is that if we are building libomptarget with clang then llvm-link must also be around, so if CMAKE_C_COMPILER_ID equals "Clang", then use llvm-link.
libomptarget/deviceRTLs/nvptx/src/libcall.cu
401–402 ↗(On Diff #126036)

It explains how compare-and-swap works.

guansong added inline comments.Dec 11 2017, 9:33 PM
libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu
20 ↗(On Diff #126487)

if you move the include "state-queue.h" line in this cu and omp_data.cu into omptarget-nvptx.h, in my local build, right after conter_group.h

you will be able to common the global data tables in one location, in omptarget-nvptx.h, instead of having two copies.

grokos updated this revision to Diff 126555.Dec 12 2017, 8:17 AM
grokos marked an inline comment as done.
grokos added inline comments.
libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu
20 ↗(On Diff #126487)

Right, I moved it to omptarget-nvptx.h in the new diff. Thanks!

gtbercea accepted this revision.Dec 20 2017, 2:01 AM
Hahnfeld requested changes to this revision.Dec 20 2017, 2:13 AM

Good to go @Hahnfeld ?

No, I didn't have time to look at the latest changes due to travel. I've already added some commets to the build system that I still think is not optimal yet.

libomptarget/deviceRTLs/nvptx/CMakeLists.txt
116–122 ↗(On Diff #126036)

I still don't like searching a compiler in the PATH that is different from CMAKE_C_COMPILER. That is not what the user requested and we should report that. Please just error out.

124–139 ↗(On Diff #126036)

I disagree: CMAKE_C_COMPILER doesn't need to be in PATH so we might not be able to find llvm-link or end up with a different installation. And using clang for linking would be consistent with the usual behavior (build system invokes compiler with object files and the compiler driver invokes the linker). Conclusion: It is better and possible to use clang for linking and we should do it.

This revision now requires changes to proceed.Dec 20 2017, 2:13 AM

Thanks @Hahnfeld just making sure this goes ahead! :)

grokos added inline comments.Dec 20 2017, 11:58 AM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
124–139 ↗(On Diff #126036)

After looking into this, from what I could find clang cannot link bitcode files into a single bitcode file - this can only be done with llvm-link. What clang can do is link multiple .bc files into a target-specific object file (which is not what we want here) and for this it needs libLTO and the LLVM gold plugin (which are not necessarily installed on every system and asking the user to provide paths makes things more complex). If anyone knows of a simple way to make clang link multiple .bc files into a single bitcode library, please let me know. Otherwise, I'm afraid we'll have to stick to llvm-link.

hfinkel added inline comments.Dec 20 2017, 12:50 PM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
124–139 ↗(On Diff #126036)

I don't believe that Clang itself has any configuration in which it will do what llvm-link does. I'd be in favor of such a mode, and maybe this is a good motivating use case, but others might disagree (i.e., if you're working with bitcode, then we might assume that you have the LLVM utilities around).

We should probably search for llvm-link first in the directory containing clang (as it, indeed, might not be in the user's path). We would need to document, moreover, that LIBOMPTARGET_NVPTX_SELECTED_BC_LINKER need to be set to llvm-link from the right installation if there's not one in Clang's bin directory and the one in the path is not right.

Hahnfeld added inline comments.Dec 20 2017, 12:59 PM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
124–139 ↗(On Diff #126036)

@grokos Looks like you are right. I agree with @hfinkel, I would have also proposed searching in the directory of the clang binary. I think we might even omit searching in PATH and just defer to the user so we can avoid any trouble with incompatible tools. Thoughts?

Two global remarks:

  1. I think we agreed on having <thread id> % <warp size> instead of bit operations.
  2. Somewhat related, there are currently at least four different "symbols" for the warp size: warpSize, WARPSIZE, DS_Max_Worker_Warp_Size, and DS_Max_Warp_Number. This needs to be exactly one that is used throughout the runtime.
libomptarget/deviceRTLs/nvptx/src/counter_groupi.h
14 ↗(On Diff #126555)

Does this file need to include cuda_runtime.h?

libomptarget/deviceRTLs/nvptx/src/interface.h
360–361 ↗(On Diff #126555)

Will the host runtime use the same entry point? Because it's not yet present in libomp

364–368 ↗(On Diff #126555)

I think this went away in previous update of this patch.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
74 ↗(On Diff #121993)

I don't see this implemented?

libomptarget/deviceRTLs/nvptx/src/option.h
20–43 ↗(On Diff #126555)

Only MAX_THREADS_PER_TEAM and WARPSIZE are used, the rest can go away.

69–76 ↗(On Diff #126555)

Not needed, remove.

libomptarget/deviceRTLs/nvptx/src/parallel.cu
458 ↗(On Diff #126555)

Typo: Do nothing. Start the new sentence with a capital letter and and with a full stop.

libomptarget/deviceRTLs/nvptx/src/reduction.cu
138–155 ↗(On Diff #126555)

This isn't used anymore, please remove.

172–174 ↗(On Diff #126555)
  1. This has a lot of commented code.
  2. Please don't duplicate the function header, this will avoid mismatches in the future.
313–315 ↗(On Diff #126555)

I think this header shouldn't be duplicated either.

Two global remarks:

  1. I think we agreed on having <thread id> % <warp size> instead of bit operations.

What's wrong with bit-wise operations as long as they are documented? I think we should keep them and then comment what it is that they do.

  1. Somewhat related, there are currently at least four different "symbols" for the warp size: warpSize, WARPSIZE, DS_Max_Worker_Warp_Size, and DS_Max_Warp_Number. This needs to be exactly one that is used throughout the runtime.

I agree about the collapsing of the first two, maybe the 3rd. The 4th is the maximum number of warps per team AFAICT not the warp size (sure, it happens to be 32).

Two global remarks:

  1. I think we agreed on having <thread id> % <warp size> instead of bit operations.

What's wrong with bit-wise operations as long as they are documented? I think we should keep them and then comment what it is that they do.

Please see my discussion with Samuel: The code means to do a modulo and the bit operation is an optimization that the compiler can do.

  1. Somewhat related, there are currently at least four different "symbols" for the warp size: warpSize, WARPSIZE, DS_Max_Worker_Warp_Size, and DS_Max_Warp_Number. This needs to be exactly one that is used throughout the runtime.

I agree about the collapsing of the first two, maybe the 3rd. The 4th is the maximum number of warps per team AFAICT not the warp size (sure, it happens to be 32).

You are right, DS_Max_Warp_Number means something else and only accidentally has the same value.

Two global remarks:

  1. I think we agreed on having <thread id> % <warp size> instead of bit operations.

What's wrong with bit-wise operations as long as they are documented? I think we should keep them and then comment what it is that they do.

Please see my discussion with Samuel: The code means to do a modulo and the bit operation is an optimization that the compiler can do.

Good point. Will this optimization be applied by the compiler regardless of kernel size, placement in the code etc?

  1. Somewhat related, there are currently at least four different "symbols" for the warp size: warpSize, WARPSIZE, DS_Max_Worker_Warp_Size, and DS_Max_Warp_Number. This needs to be exactly one that is used throughout the runtime.

I agree about the collapsing of the first two, maybe the 3rd. The 4th is the maximum number of warps per team AFAICT not the warp size (sure, it happens to be 32).

You are right, DS_Max_Warp_Number means something else and only accidentally has the same value.

guansong added inline comments.Jan 12 2018, 9:37 AM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
158 ↗(On Diff #126555)

For cuda bc files, a CUDA install will have bc files for different arches, such as

/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_35.10.bc
/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_30.10.bc
/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_50.10.bc
/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_20.10.bc

Should we consider to build different bc files for the end user?

gtbercea added inline comments.Jan 12 2018, 9:49 AM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
158 ↗(On Diff #126555)

What do you mean by that?

Does this patch do what you mean: https://reviews.llvm.org/D41724 ?

tra added a subscriber: tra.Jan 12 2018, 9:52 AM
tra added inline comments.
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
158 ↗(On Diff #126555)

That's no longer true for CUDA-9 -- it has a single bitcode file for all architectures.

Is OMP runtime for GPU going to provide anything that a) needs to have the same API for all GPUs and, b) has to be heavily GPU-specific under the hood? If not, then one common library would probably suffice.

Oh, and you need to add documentation about new CMake flags to the README.rst. You can probably get most of that from an old revision of D40920...

grokos updated this revision to Diff 131110.Jan 23 2018, 11:27 AM
grokos marked 10 inline comments as done.

Made corrections according to the feedback I got. Regarding the general comments:

  1. I found only one more instance of threadID & (WARPSIZE-1) and replaced it with threadID % WARPSIZE. There are a few more bitwise operations in the code of the form threadID & ~(WARPSIZE-1) (note the bitwise NOT) which do something else - they round threadID down to the nearest multiple of WARPSIZE. Doing this with arithmetic operators would require something like threadID - threadID%WARPSIZE which I doubt the compiler could optimize (not to mention that the bitwise version is actually more easily recognizable as a round-to-a-multiple-of-2^x operation than the arithmetic version).
  2. I replaced all references to the built-in warpSize with the macro WARPSIZE. warpSize would have the advantage that it makes the code forward-compatible (if the warp size ever changes, the change will be reflected on this variable); however, it is not a compile-time constant and that would prevent the compiler from making certain optimizations (especially since the warp size is used in modulo operations for which we have assumed that the compiler can transform the modulo operation into bitwise operations). DS_Max_Worker_Warp_Size was also replaced with WARPSIZE.
libomptarget/deviceRTLs/nvptx/src/interface.h
360–361 ↗(On Diff #126555)

Yes, I will add the same function to the host runtime. It's not there yet, but it's on my list.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
74 ↗(On Diff #121993)

I removed DS_Max_Worker_Warp_Size_Bit_Mask completely as it is not needed anymore (we now use x % WARPSIZE).

libomptarget/deviceRTLs/nvptx/src/option.h
20–43 ↗(On Diff #126555)

THREAD_ABSOLUTE_LIMIT is also used (in the definition of MAX_THREADS_PER_TEAM).

Regarding the bitcode compiler/linker, we agreed on skipping looking for a suitable compiler in PATH, so for consistency I also skip looking in PATH for a linker. CMake only tries to locate llvm-link in the same directory as the clang binary (if clang is the CMAKE_C_COMPILER) or uses whatever the user has specified. Following @hfinkel 's suggestions, I updated the documentation to describe this behavior.

I think this finally looks reasonably good to land: I'm going to accept this once the last few minor comments are addressed. So if anyone has concerns, please raise them now!

@grokos Can you please upload a new version with changes for my inline comments? This will make it easier to spot errors in the last changes. Afterwards, I think the code should be formatted with clang-format. Unfortunately, it doesn't pick up .cu files by default but I think we might be able to tweak this with a configuration file?

libomptarget/deviceRTLs/nvptx/CMakeLists.txt
96 ↗(On Diff #131110)

clang in the PATH isn't true anymore.

108–109 ↗(On Diff #131110)

We cannot satisfy the user's request, so should probably be libomptarget_error_say?

121–122 ↗(On Diff #131110)

libomptarget_error_say as well

libomptarget/deviceRTLs/nvptx/src/interface.h
360–361 ↗(On Diff #126555)

Then you need to agree on the interface with Intel, just saying...

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu
45–50 ↗(On Diff #131110)

Not used?

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
72 ↗(On Diff #131110)

I think all other usages of >> DS_Max_Worker_Warp_Size_Bits can be replaced with % WARPSIZE or am I missing something in here?

libomptarget/deviceRTLs/nvptx/src/option.h
20–43 ↗(On Diff #126555)

Yeah, but in the same file and only once. So you can just do #define MAX_THREADS_PER_TEAM 1024 ;-)

libomptarget/deviceRTLs/nvptx/src/reduction.cu
238 ↗(On Diff #131110)

I think this endif needs to go before the last closing } to also finish the function in the if case.

386–388 ↗(On Diff #131110)

Here too, I think endif needs to go before the return ThreadId == 0.

grokos marked 9 inline comments as done.

I created a child revision with the final changes.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu
45–50 ↗(On Diff #131110)

Probably a leftover, removed.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
72 ↗(On Diff #131110)

You mean div WARPSIZE, not mod... I assume you prefer the division over shifting... I'll change those instances in the new diff. WARPSIZE is a compile-time constant and a power of 2, so the compiler should be able optimize the division away.

libomptarget/deviceRTLs/nvptx/src/reduction.cu
238 ↗(On Diff #131110)

Good catch, I forgot to move the #endif as well....

I created a child revision with the final changes.

Nah, please upload it here to keep it in one place. I just meant to not mix up the last fixes and the formatting. So:

  1. Upload a new (full) diff with the last changes.
  2. Upload (yet another) diff which addresses the formatting.
grokos updated this revision to Diff 131617.Jan 26 2018, 10:24 AM
grokos marked 3 inline comments as done.
grokos updated this revision to Diff 131619.Jan 26 2018, 10:39 AM

Fixed formatting with clang-format.

Hahnfeld added inline comments.Jan 26 2018, 10:39 AM
libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
72 ↗(On Diff #131110)

Yeah, sure! To be precise, I want to avoid any mismatch between WARPSIZE and 2 ** DS_Max_Worker_Warp_Size_Bits. You should now be able to delete DS_Max_Worker_Warp_Size_Bits, its last usages are gone ;-)

grokos updated this revision to Diff 131621.Jan 26 2018, 10:43 AM
grokos marked an inline comment as done.

Removed obsolete enum constant.

Hahnfeld accepted this revision.Jan 28 2018, 2:17 AM

LGTM

FYI: I just committed rC323615 so that git-clang-format will pick up changes to CUDA files in the future. Until then we need to do this manually with clang-format...

This revision is now accepted and ready to land.Jan 28 2018, 2:17 AM
omalyshe added inline comments.
README.rst
289 ↗(On Diff #131110)

Should it be "the NVPTX device RTL" like in the previous item?

grokos marked an inline comment as done.Jan 29 2018, 5:58 AM
grokos added inline comments.
README.rst
289 ↗(On Diff #131110)

Thanks for pointing it out, I've changed it!

This revision was automatically updated to reflect the committed changes.
grokos marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: jfb. · View Herald Transcript