This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Initial implementation of OpenMP offloading library - libomptarget device RTLs.
ClosedPublic

Authored by grokos on Nov 2 2015, 12:06 PM.

Details

Summary

This patch is a partition of the original patch posted in http://reviews.llvm.org/D14031.

This patch implements the device runtime library whose interface is used in the code generation for OpenMP offloading devices. Currently there is a single device RTL written in CUDA meant to CUDA enabled GPUs. The interface is a variation of the kmpc interface that includes some extra calls to do thread and storage management that only make sense for a GPU target.

Depends on http://reviews.llvm.org/D14031.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Hahnfeld added inline comments.Nov 8 2017, 9:18 AM
libomptarget/deviceRTLs/nvptx/src/counter_group.h
15–16

Can we also have this prefixed with _OMPTARGET_NVPTX like the other header files?

18–21

Needed?

This file should probably include option.h which defines Counter

libomptarget/deviceRTLs/nvptx/src/critical.cu
15–16

needed?

libomptarget/deviceRTLs/nvptx/src/data_sharing.cu
28–31 ↗(On Diff #121993)

Remove if not needed

34–35 ↗(On Diff #121993)

Suppose we have 1024 threads, why is 1023 & (~31) = 992 the master thread?

40 ↗(On Diff #121993)

DS_Max_Worker_Warp_Size instead of its hard-coded value? Or is 32 the length of an integer? (Another reason to avoid magic numbers!)

48–51 ↗(On Diff #121993)

Remove if not needed

132 ↗(On Diff #121993)

Can we have this in a function with a corresponding name?

315–318 ↗(On Diff #121993)

Delete

libomptarget/deviceRTLs/nvptx/src/libcall.cu
18

Yes, this doesn't sound like a good idea to have that hard-coded...

libomptarget/deviceRTLs/nvptx/src/omp_data.cu
1 ↗(On Diff #121993)

omp_data.cu

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
71

1 << DS_Max_Worker_Warp_Size_Log2?

74

Bits instead of Log2? I think that's far more common...

75

First, this value needs a comment what it is used for and maybe we can drop the Log2 is its a mast related to the value, not the logarithm.

From what I understand, this value has has the 5 LSBs set to 1 which should be equivalent to DS_Max_Worker_Warp_Size - 1? (this also avoid relying on 32 bit integers)

I talked to @ABataev and Clang never generates calls to __kmpc_atomic_ functions. I've never heard about __array_atomic but I guess they aren't used either as array reductions are performed element-wise.

libomptarget/deviceRTLs/nvptx/src/interface.h
463

You can remove the next ~820 lines.

libomptarget/deviceRTLs/nvptx/src/reduction.cu
21–33

Remove

97–116

It's still there...

111

I think you can remove the next ~730 lines until __kmpc_shuffle_int32

663–666

binop is a template argument so the compiler will do that for you. Anyway, I think these functions can be removed altogether now!

952–994

These macros are never used.

libomptarget/deviceRTLs/nvptx/src/stdio.cu
19

I think this function isn't used anymore?

Hahnfeld added inline comments.Nov 9 2017, 11:56 AM
libomptarget/deviceRTLs/nvptx/src/debug.cu
22–23

This is used only once in libcall.cu. Please inline and remove this file entirely

34–66

Never generated by current compilers, remove

libomptarget/deviceRTLs/nvptx/src/libcall.cu
394–395

I think the comment is right here: This looks like it is setting the lock if it is unset!

402

Can you check that these are still used? Especially the debug functions?

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
101

This won't compile with current Clang 5.0.0:

error: __shared__ variable 'DataSharingState' cannot be 'extern'
grokos marked 26 inline comments as done.Nov 21 2017, 9:46 AM

I've responded to the majority of comments. We are now waiting for some other people to reply to questions related to code they wrote. I'll update the diff once all questions have been answered.

libomptarget/deviceRTLs/nvptx/CMakeLists.txt
64

OK, I changed the default to sm_30 and added a comment that this is what clang uses by default as well.

However, the default may change to sm_35 due to the implementation of __kmpc_reduce_conditional_lastprivate which involves a call to atomicMax on 64-bit integers, which in turn requires a minimum of sm_35. Let's keep this issue open until we decide how to proceed.

74–77

I think we should keep this one. I added the LIBOMPTARGET_NVPTX_DEBUG flag to the list of NVPTX Cmake options in Build_With_Cmake.txt.

121

I'm not sure I fully understand the point here. Can you elaborate?

180

I don't know why this directory was included, but it's not needed - I've removed it.

libomptarget/deviceRTLs/nvptx/src/counter_group.h
18–21

Right, cuda.h is not needed. I've included option.h

libomptarget/deviceRTLs/nvptx/src/critical.cu
15–16

Probably a leftover, removed.

libomptarget/deviceRTLs/nvptx/src/data_sharing.cu
34–35 ↗(On Diff #121993)

The master warp is the last one and the master thread is the first thread of that warp.

132 ↗(On Diff #121993)

It's a single line which calls a built-in. Too simple for a dedicated function. Anyway, what do you propose?

libomptarget/deviceRTLs/nvptx/src/debug.cu
22–23

I removed the entire debug.cu, it was obsolete.

libomptarget/deviceRTLs/nvptx/src/libcall.cu
18

Getting the clock frequency in device code cannot be done. We can only query it on the host.

I tried having a device global variable TICK and set it via a call to cuModuleGetGlobal(..., "TICK") from the CUDA plugin (the plugin can query the frequency via cudaGetDeviceProperties). This solution did not work because libomptarget-nvptx.a is a static library so the clock frequency should be set at compile time. We cannot use dynamic libraries (yet?) because the CUDA toolchain does not support dynamic linking.

Eventually I implemented omp_get_wtime()' using the %globaltimer` register. That's the only viable option. If the register gets removed in the future, there's nothing we can do.

omp_get_wtick() is probably not needed. No one will ever query the time between clock ticks from within device code... I left the function there so that the linker can find it but it prints a message that this functionality is not implemented.

I leave this issue open for further discussion anyway.

394–395

This is what omp_test_lock should do - I've removed the comment.

402

I checked with the Fortran people and those functions are still used, so we'll keep them.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
75

I've changed its definition to the proposed DS_Max_Worker_Warp_Size - 1 which is much clearer. This mask is used to get (threadID mod Warp_Size). I've added a comment.

101

This looks like a clang problem. Shared variables can be extern. This code can be compiled with clang 4.0, maybe we should submit a bug report for clang 5.0?

libomptarget/deviceRTLs/nvptx/src/reduction.cu
952–994

Removed.

libomptarget/deviceRTLs/nvptx/src/stdio.cu
19

Correct, I removed the entire file.

arpith-jacob added inline comments.Nov 21 2017, 9:57 AM
libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
101

Clang doesn't support this attribute and it requires additional investigation. For now, I would disable building the bclib version of libomptarget-nvptx so that the runtime library is built using nvcc.

Hahnfeld added inline comments.Nov 21 2017, 10:33 AM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
64

As I've said offline, we then have to change the default now. I suppose the won't compile otherwise?

121

Currently, this will try to use the just-built compiler when building in-tree with LLVM and Clang. It does so by adding a dependence to the clang target which means that the files are recompiled whenever the compiler changes, ie the clang target has a newer modification timestamp.

IMO we shouldn't do this but always build with the CMAKE_C_COMPILER that all the other code uses. If that's a (recent) Clang, we can build the bclib. If not and it's for example a GCC, we should just deactivate bclib. Note that no other (runtime) library (neither libomp nor host libomptarget nor other libraries as libc++ or compiler-rt) use the just-built compiler for building.

libomptarget/deviceRTLs/nvptx/src/data_sharing.cu
132 ↗(On Diff #121993)

There are other single-line functions and it is not obvious to me how __BALLOT_SYNC(0xFFFFFFFF, true) returns CurActiveThreads - btw. is this a number or as mask?

libomptarget/deviceRTLs/nvptx/src/libcall.cu
18

That's not a justification, I really doubt that anyone will call omp_get_wtime() either.

Why not just return 1 nanosecond (the resolution of %globaltimer) for omp_get_wtick as I proposed?

402

I'm especially concerned about the debug functions as we removed most of them

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
101

This was explicitly introduced in https://reviews.llvm.org/D25125. The documentation suggests that the error is correct:

__device__, __shared__, and __constant__ variables cannot be defined as external using the extern keyword. The only exception is for dynamically allocated __shared__ variables as described in __shared__.

(http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#qualifiers)

arpith-jacob added inline comments.Nov 21 2017, 10:55 AM
libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
101

That restriction only applies in whole program compilation mode, not separate compilation mode. We need to teach Clang/llvm to support separate compilation mode.

When compiling in the whole program compilation mode (see the nvcc user manual for a description of this mode), __device__, __shared__, and __constant__ variables cannot be defined as external using the extern keyword. The only exception is for dynamically allocated __shared__ variables as described in __shared__.

When compiling in the separate compilation mode (see the nvcc user manual for a description of this mode), __device__, __shared__, and __constant__ variables can be defined as external using the extern keyword. nvlink will generate an error when it cannot find a definition for an external variable (unless it is a dynamically allocated __shared__ variable).
gtbercea added inline comments.Nov 27 2017, 5:42 PM
libomptarget/deviceRTLs/nvptx/src/omp_data.cu
44 ↗(On Diff #121993)

Is this Data Sharing support needed in this patch?

sfantao added inline comments.Nov 29 2017, 3:23 AM
libomptarget/deviceRTLs/nvptx/src/data_sharing.cu
40 ↗(On Diff #121993)

Here 32 is both the GPU warp size and the size of the integer __popc() takes, I agree we should get an enumerator to indicate that. What is tricky here, is that the function assumes that the warp size is the same as the 32-bit integer. This assumption is made in CUDA so, I think is fine we also do it here.

Actually, here I think the name of the function getWarpMasterActiveThreadId is misleading. This function returns the number of active threads in the current warp whose ID in the warp is lower than the current one. If there are no active threads with lower ID, then the current thread is the warp master. In other words, this function returns zero, if the current thread is the warp master.

This function seems to only be used in IsWarpMasterActiveThread. So I suggest remove the current implementation of IsWarpMasterActiveThread and rename this function to IsWarpMasterActiveThread and have it return:

return __popc(Sh) == 0u;
libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
75

Using DS_Max_Worker_Warp_Size - 1 its fine here. This is mostly used to calculate the ID of a thread in its warp, i.e. threadID() % warpSize which is equivalent to threadID() & DS_Max_Worker_Warp_Size. The compiler should be smart enough to convert one to the other if warpSize is a constant.

I'd rather prefer do make little optimisations explicit in the code, but if one thinks this extra macro is too much, one could revisit all uses and see if warpSize is constant and use threadID() % warpSize throughout the code. I don't see any reason for making warpSize non-constant anyway.

Hahnfeld added inline comments.Nov 29 2017, 6:52 AM
libomptarget/deviceRTLs/nvptx/src/data_sharing.cu
40 ↗(On Diff #121993)

I'm not sure I get your arguments: Let's start with the function __popc() which returns the number of bits set to 1 in the argument. This just happens to be a 32bit integer which has nothing to do with the warp size as far as I can see.

If this function really does what you described it deserves documentation or the code needs to be rewritten so that others can understand. This might be related to what __BALLOT_SYNC(0xFFFFFFFF, true) as I've already criticized below.

At the end, I don't understand what your proposal is: The function name you mention is still the same and the return statement basically inverts the value returned of __popc()

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
75

Do you mean threadID() % warpSize is equivalent to threadID() & (DS_Max_Worker_Warp_Size - 1)? Because otherwise I disagree.

I agree that the compiler will convert the usage. IMO threadID() % warpSize is a lot more obvious than the bit operations the compiler might use.

sfantao added inline comments.Nov 29 2017, 8:00 AM
libomptarget/deviceRTLs/nvptx/src/data_sharing.cu
40 ↗(On Diff #121993)

I'm not sure I get your arguments: Let's start with the function __popc() which returns the number of bits set to 1 in the argument. This just happens to be a 32bit integer which has nothing to do with the warp size as far as I can see.

If this function really does what you described it deserves documentation or the code needs to be rewritten so that others can understand. This might be related to what __BALLOT_SYNC(0xFFFFFFFF, true) as I've already criticized below.

The function is documented in http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__INT.html#group__CUDA__MATH__INTRINSIC__INT_1g43c9c7d2b9ebf202ff1ef5769989be46. Using 32-bit integers to do mask operations has all to do with the warp size in the sense that if we had larger warp sizes we would have to change the literals (including the ones used with __ballot) and use different versions or a combination of multiple __popc().

At the end, I don't understand what your proposal is: The function name you mention is still the same and the return statement basically inverts the value returned of __popc()

All I am suggesting is to remove the current implementation of IsWarpMasterActiveThread, and rename getWarpMasterActiveThreadId to IsWarpMasterActiveThread. If __popc returns zero means the current thread is the master thread, so comparing it with zero will return the correct result. This is what IsWarpMasterActiveThread does right now.

I agree we can improve the comments to document all this.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
75

Do you mean threadID() % warpSize is equivalent to threadID() & (DS_Max_Worker_Warp_Size - 1) ?

Yes, that is what I mean. Of course, this assumes warp size is a power of 2.

I agree that the compiler will convert the usage. IMO threadID() % warpSize is a lot more obvious than the bit operations the compiler might use.

I don't have a strong opinion here. I agree that threadID() % warpSize is easier to read and I think its fine we go with that. It's just that if we intend the implementation to be threadID() & (DS_Max_Worker_Warp_Size - 1) I don't see why we shouldn't write that in the code.

Hahnfeld added inline comments.Nov 29 2017, 8:19 AM
libomptarget/deviceRTLs/nvptx/src/data_sharing.cu
40 ↗(On Diff #121993)

Btw: If we just compare __popc(x) == 0 then we know x == 0 because x has zero bits set to 1. So no point in doing __popc() at all?

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
75
  1. IMO source code is meant to be understood by humans who are normally much more familiar with mathematical operations (modulo) than bit operations - even programmers.
  2. Because it assumes a power of 2. If the compiler can prove that - great, do the optimization.

(If we happen to find performance implications, we can still change it and refactor it to a function and add the documentation why this is necessary, what it assumes and how it works)

sfantao added inline comments.Nov 29 2017, 12:10 PM
libomptarget/deviceRTLs/nvptx/src/data_sharing.cu
40 ↗(On Diff #121993)

Btw: If we just compare popc(x) == 0 then we know x == 0 because x has zero bits set to 1. So no point in doing popc() at all?

True, given that we are using that only to compare against zero, evaluating x ==0 is sufficient.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
75

Fine by me.

grokos updated this revision to Diff 126036.Dec 7 2017, 1:50 PM
grokos marked 26 inline comments as done.
grokos added a project: Restricted Project.

Responded to comments and modified the code accordingly.

libomptarget/deviceRTLs/nvptx/CMakeLists.txt
64

Correct, sm_35 is the minimum required. I've set the default to 35 and prepared a clang patch to do the same in the compiler.

152–159

Arpith claims that the nvptx runtime uses intrinsics that are only available with certain PTX versions (e.g. Volta SM, Cuda 9, PTX60). Also, the logic here has been changed to match clang's choice of ptx version.

185–187

Indeed, multiple GPU architectures do not work with the current compiler. I've removed that support.

libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt
2 ↗(On Diff #121993)

Let's keep it here since the design described is the specialization for the nvptx device. The main algorithms are indeed implemented in the runtime.

libomptarget/deviceRTLs/nvptx/src/data_sharing.cu
40 ↗(On Diff #121993)

OK, so I've replaced the magic number 32 with enum macro DS_Max_Worker_Warp_Size (which is equal to 32 and to the warp size), removed IsWarpMasterActiveThread and renamed getWarpMasterActiveThreadId to IsWarpMasterActiveThread. Also, I'm just returning Sh == 0, no need for __popc() at all.

132 ↗(On Diff #121993)

I added a short function with a description of what is does:

// Find the active threads in the warp - return a mask whose n-th bit is set if
// the n-th thread in the warp is active.
__device__ static unsigned getActiveThreadsMask() {
  return __BALLOT_SYNC(0xFFFFFFFF, true);
}

What __ballot_sync does can be found here. For ALL threads in the warp (mask = 0xFFFFFFFF) we evaluate the predicate (true). The result is a mask whose n-th bit is set if the n-th thread of the warp is active.

libomptarget/deviceRTLs/nvptx/src/libcall.cu
18

OK, I got your point. It's better than having nothing. I've changed the code to return 1ns.

402

Fortran people insist that all `_xlf_* functions below are still needed. If they ever become obsolete we can then remove them.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
75

So we'll go with the modulo solution.

101

OK, for the time being the default is set to disabled so as not to create any problems while building the runtime.

grokos updated this revision to Diff 126042.Dec 7 2017, 2:04 PM

Typos....

Some comments, still mostly about the build system for the bclib

libomptarget/deviceRTLs/nvptx/CMakeLists.txt
64–68
  1. No need for different cases if we don't support building multiple architectures for now.
  2. LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY needs to be defined as CACHE variable, with a default of 35 in this case.
74–77

Needs also to be defined as CACHE variable.

105–106

Remove, as discussed we should not depend on in-tree components being built.

110–112

This does not depend on a Clang checkout anymore, remove

117–123

(Probably this case will go then too...)

125–140
  1. Do not depend on in-tree components.
  2. I think you should be able to do the "link" step with Clang as well, can you please test this?
172–176

No need for different cases if we don't support multiple architectures.

201

Should not depend on LIBOMPTARGET_NVPTX_SELECTED_BC_LINKER_FROM_TREE as discussed

libomptarget/deviceRTLs/nvptx/src/counter_group.h
18–21

stdlib.h and stdio.h are still there...

libomptarget/deviceRTLs/nvptx/src/libcall.cu
18

TICK can go away then (hopefully)

30

Please reuse TIMER_PRECISION

402–403

What's this comment about?

Here is a general comment, we should follow Alex's suggestion to make a generic GPU device and extend that to NVPTX and AMDGCN. My current idea is:

  1. think all the current spell of nvptx as generic_gpu
  2. add template to some functions, and instantiate them with nvptx and amdgcn required types
  3. for example the warp size in nvptx is 32 while in amdgcn is 64

Thoughts?

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
101

I know this issue. We do have a patch on clang 6.0 to allow this. But I am not confident enough to submit it I guess. Who should we talk to?

Here is a general comment, we should follow Alex's suggestion to make a generic GPU device and extend that to NVPTX and AMDGCN. My current idea is:

  1. think all the current spell of nvptx as generic_gpu
  2. add template to some functions, and instantiate them with nvptx and amdgcn required types
  3. for example the warp size in nvptx is 32 while in amdgcn is 64

Thoughts?

No, not in this patch which already is very large:

  1. The size makes it hard to review and update. This slows down the entire process of getting this into trunk.
  2. This patch is already circumventing the "normal" LLVM process of well-tested patches that introduce one functionality after another. (That said, I still think we don't have an alternative to the current approach: The whole interface is set and already partially implemented in the compiler, so we need to do this in one pass because splitting makes no sense.)

I'll happily review patches that introduce support for AMDGCN after a first version of the code is committed but I refuse to let this in with this revision.

grokos updated this revision to Diff 126487.Dec 11 2017, 5:35 PM
grokos marked 12 inline comments as done.
grokos added inline comments.Dec 11 2017, 5:36 PM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
117–123

I left it there. The logic now is as follows:

  1. If the user has specified a compiler via -DLIBOMPTARGET_NVPTX_CUDA_COMPILER then we'll use that.
  2. If not, then we check whether CMAKE_C_COMPILER is clang; if yes, then we'll use that.
  3. Otherwise, we try to search for a clang compiler. If we don't find any, then we don't build bclib.
125–140
  1. Changed the linker-setting logic to the same as the compiler-setting logic above.
  2. Clang can be used to link but with different command line options. For consistency I am sticking to llvm-link. My rationale is that if we are building libomptarget with clang then llvm-link must also be around, so if CMAKE_C_COMPILER_ID equals "Clang", then use llvm-link.
libomptarget/deviceRTLs/nvptx/src/libcall.cu
402–403

It explains how compare-and-swap works.

guansong added inline comments.Dec 11 2017, 9:33 PM
libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu
21

if you move the include "state-queue.h" line in this cu and omp_data.cu into omptarget-nvptx.h, in my local build, right after conter_group.h

you will be able to common the global data tables in one location, in omptarget-nvptx.h, instead of having two copies.

grokos updated this revision to Diff 126555.Dec 12 2017, 8:17 AM
grokos marked an inline comment as done.
grokos added inline comments.
libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu
21

Right, I moved it to omptarget-nvptx.h in the new diff. Thanks!

gtbercea accepted this revision.Dec 20 2017, 2:01 AM
Hahnfeld requested changes to this revision.Dec 20 2017, 2:13 AM

Good to go @Hahnfeld ?

No, I didn't have time to look at the latest changes due to travel. I've already added some commets to the build system that I still think is not optimal yet.

libomptarget/deviceRTLs/nvptx/CMakeLists.txt
117–123

I still don't like searching a compiler in the PATH that is different from CMAKE_C_COMPILER. That is not what the user requested and we should report that. Please just error out.

125–140

I disagree: CMAKE_C_COMPILER doesn't need to be in PATH so we might not be able to find llvm-link or end up with a different installation. And using clang for linking would be consistent with the usual behavior (build system invokes compiler with object files and the compiler driver invokes the linker). Conclusion: It is better and possible to use clang for linking and we should do it.

This revision now requires changes to proceed.Dec 20 2017, 2:13 AM

Thanks @Hahnfeld just making sure this goes ahead! :)

grokos added inline comments.Dec 20 2017, 11:58 AM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
125–140

After looking into this, from what I could find clang cannot link bitcode files into a single bitcode file - this can only be done with llvm-link. What clang can do is link multiple .bc files into a target-specific object file (which is not what we want here) and for this it needs libLTO and the LLVM gold plugin (which are not necessarily installed on every system and asking the user to provide paths makes things more complex). If anyone knows of a simple way to make clang link multiple .bc files into a single bitcode library, please let me know. Otherwise, I'm afraid we'll have to stick to llvm-link.

hfinkel added inline comments.Dec 20 2017, 12:50 PM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
125–140

I don't believe that Clang itself has any configuration in which it will do what llvm-link does. I'd be in favor of such a mode, and maybe this is a good motivating use case, but others might disagree (i.e., if you're working with bitcode, then we might assume that you have the LLVM utilities around).

We should probably search for llvm-link first in the directory containing clang (as it, indeed, might not be in the user's path). We would need to document, moreover, that LIBOMPTARGET_NVPTX_SELECTED_BC_LINKER need to be set to llvm-link from the right installation if there's not one in Clang's bin directory and the one in the path is not right.

Hahnfeld added inline comments.Dec 20 2017, 12:59 PM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
125–140

@grokos Looks like you are right. I agree with @hfinkel, I would have also proposed searching in the directory of the clang binary. I think we might even omit searching in PATH and just defer to the user so we can avoid any trouble with incompatible tools. Thoughts?

Two global remarks:

  1. I think we agreed on having <thread id> % <warp size> instead of bit operations.
  2. Somewhat related, there are currently at least four different "symbols" for the warp size: warpSize, WARPSIZE, DS_Max_Worker_Warp_Size, and DS_Max_Warp_Number. This needs to be exactly one that is used throughout the runtime.
libomptarget/deviceRTLs/nvptx/src/counter_groupi.h
15

Does this file need to include cuda_runtime.h?

libomptarget/deviceRTLs/nvptx/src/interface.h
361–362

Will the host runtime use the same entry point? Because it's not yet present in libomp

365–369

I think this went away in previous update of this patch.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
75

I don't see this implemented?

libomptarget/deviceRTLs/nvptx/src/option.h
21–44

Only MAX_THREADS_PER_TEAM and WARPSIZE are used, the rest can go away.

70–77

Not needed, remove.

libomptarget/deviceRTLs/nvptx/src/parallel.cu
459

Typo: Do nothing. Start the new sentence with a capital letter and and with a full stop.

libomptarget/deviceRTLs/nvptx/src/reduction.cu
139–156

This isn't used anymore, please remove.

173–175
  1. This has a lot of commented code.
  2. Please don't duplicate the function header, this will avoid mismatches in the future.
314–316

I think this header shouldn't be duplicated either.

Two global remarks:

  1. I think we agreed on having <thread id> % <warp size> instead of bit operations.

What's wrong with bit-wise operations as long as they are documented? I think we should keep them and then comment what it is that they do.

  1. Somewhat related, there are currently at least four different "symbols" for the warp size: warpSize, WARPSIZE, DS_Max_Worker_Warp_Size, and DS_Max_Warp_Number. This needs to be exactly one that is used throughout the runtime.

I agree about the collapsing of the first two, maybe the 3rd. The 4th is the maximum number of warps per team AFAICT not the warp size (sure, it happens to be 32).

Two global remarks:

  1. I think we agreed on having <thread id> % <warp size> instead of bit operations.

What's wrong with bit-wise operations as long as they are documented? I think we should keep them and then comment what it is that they do.

Please see my discussion with Samuel: The code means to do a modulo and the bit operation is an optimization that the compiler can do.

  1. Somewhat related, there are currently at least four different "symbols" for the warp size: warpSize, WARPSIZE, DS_Max_Worker_Warp_Size, and DS_Max_Warp_Number. This needs to be exactly one that is used throughout the runtime.

I agree about the collapsing of the first two, maybe the 3rd. The 4th is the maximum number of warps per team AFAICT not the warp size (sure, it happens to be 32).

You are right, DS_Max_Warp_Number means something else and only accidentally has the same value.

Two global remarks:

  1. I think we agreed on having <thread id> % <warp size> instead of bit operations.

What's wrong with bit-wise operations as long as they are documented? I think we should keep them and then comment what it is that they do.

Please see my discussion with Samuel: The code means to do a modulo and the bit operation is an optimization that the compiler can do.

Good point. Will this optimization be applied by the compiler regardless of kernel size, placement in the code etc?

  1. Somewhat related, there are currently at least four different "symbols" for the warp size: warpSize, WARPSIZE, DS_Max_Worker_Warp_Size, and DS_Max_Warp_Number. This needs to be exactly one that is used throughout the runtime.

I agree about the collapsing of the first two, maybe the 3rd. The 4th is the maximum number of warps per team AFAICT not the warp size (sure, it happens to be 32).

You are right, DS_Max_Warp_Number means something else and only accidentally has the same value.

guansong added inline comments.Jan 12 2018, 9:37 AM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
159

For cuda bc files, a CUDA install will have bc files for different arches, such as

/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_35.10.bc
/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_30.10.bc
/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_50.10.bc
/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_20.10.bc

Should we consider to build different bc files for the end user?

gtbercea added inline comments.Jan 12 2018, 9:49 AM
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
159

What do you mean by that?

Does this patch do what you mean: https://reviews.llvm.org/D41724 ?

tra added a subscriber: tra.Jan 12 2018, 9:52 AM
tra added inline comments.
libomptarget/deviceRTLs/nvptx/CMakeLists.txt
159

That's no longer true for CUDA-9 -- it has a single bitcode file for all architectures.

Is OMP runtime for GPU going to provide anything that a) needs to have the same API for all GPUs and, b) has to be heavily GPU-specific under the hood? If not, then one common library would probably suffice.

Oh, and you need to add documentation about new CMake flags to the README.rst. You can probably get most of that from an old revision of D40920...

grokos updated this revision to Diff 131110.Jan 23 2018, 11:27 AM
grokos marked 10 inline comments as done.

Made corrections according to the feedback I got. Regarding the general comments:

  1. I found only one more instance of threadID & (WARPSIZE-1) and replaced it with threadID % WARPSIZE. There are a few more bitwise operations in the code of the form threadID & ~(WARPSIZE-1) (note the bitwise NOT) which do something else - they round threadID down to the nearest multiple of WARPSIZE. Doing this with arithmetic operators would require something like threadID - threadID%WARPSIZE which I doubt the compiler could optimize (not to mention that the bitwise version is actually more easily recognizable as a round-to-a-multiple-of-2^x operation than the arithmetic version).
  2. I replaced all references to the built-in warpSize with the macro WARPSIZE. warpSize would have the advantage that it makes the code forward-compatible (if the warp size ever changes, the change will be reflected on this variable); however, it is not a compile-time constant and that would prevent the compiler from making certain optimizations (especially since the warp size is used in modulo operations for which we have assumed that the compiler can transform the modulo operation into bitwise operations). DS_Max_Worker_Warp_Size was also replaced with WARPSIZE.
libomptarget/deviceRTLs/nvptx/src/interface.h
361–362

Yes, I will add the same function to the host runtime. It's not there yet, but it's on my list.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
75

I removed DS_Max_Worker_Warp_Size_Bit_Mask completely as it is not needed anymore (we now use x % WARPSIZE).

libomptarget/deviceRTLs/nvptx/src/option.h
21–44

THREAD_ABSOLUTE_LIMIT is also used (in the definition of MAX_THREADS_PER_TEAM).

Regarding the bitcode compiler/linker, we agreed on skipping looking for a suitable compiler in PATH, so for consistency I also skip looking in PATH for a linker. CMake only tries to locate llvm-link in the same directory as the clang binary (if clang is the CMAKE_C_COMPILER) or uses whatever the user has specified. Following @hfinkel 's suggestions, I updated the documentation to describe this behavior.

I think this finally looks reasonably good to land: I'm going to accept this once the last few minor comments are addressed. So if anyone has concerns, please raise them now!

@grokos Can you please upload a new version with changes for my inline comments? This will make it easier to spot errors in the last changes. Afterwards, I think the code should be formatted with clang-format. Unfortunately, it doesn't pick up .cu files by default but I think we might be able to tweak this with a configuration file?

libomptarget/deviceRTLs/nvptx/CMakeLists.txt
97

clang in the PATH isn't true anymore.

109–110

We cannot satisfy the user's request, so should probably be libomptarget_error_say?

122–123

libomptarget_error_say as well

libomptarget/deviceRTLs/nvptx/src/interface.h
361–362

Then you need to agree on the interface with Intel, just saying...

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu
46–51

Not used?

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
73

I think all other usages of >> DS_Max_Worker_Warp_Size_Bits can be replaced with % WARPSIZE or am I missing something in here?

libomptarget/deviceRTLs/nvptx/src/option.h
21–44

Yeah, but in the same file and only once. So you can just do #define MAX_THREADS_PER_TEAM 1024 ;-)

libomptarget/deviceRTLs/nvptx/src/reduction.cu
239

I think this endif needs to go before the last closing } to also finish the function in the if case.

387–389

Here too, I think endif needs to go before the return ThreadId == 0.

grokos marked 9 inline comments as done.

I created a child revision with the final changes.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu
46–51

Probably a leftover, removed.

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
73

You mean div WARPSIZE, not mod... I assume you prefer the division over shifting... I'll change those instances in the new diff. WARPSIZE is a compile-time constant and a power of 2, so the compiler should be able optimize the division away.

libomptarget/deviceRTLs/nvptx/src/reduction.cu
239

Good catch, I forgot to move the #endif as well....

I created a child revision with the final changes.

Nah, please upload it here to keep it in one place. I just meant to not mix up the last fixes and the formatting. So:

  1. Upload a new (full) diff with the last changes.
  2. Upload (yet another) diff which addresses the formatting.
grokos updated this revision to Diff 131617.Jan 26 2018, 10:24 AM
grokos marked 3 inline comments as done.
grokos updated this revision to Diff 131619.Jan 26 2018, 10:39 AM

Fixed formatting with clang-format.

Hahnfeld added inline comments.Jan 26 2018, 10:39 AM
libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
73

Yeah, sure! To be precise, I want to avoid any mismatch between WARPSIZE and 2 ** DS_Max_Worker_Warp_Size_Bits. You should now be able to delete DS_Max_Worker_Warp_Size_Bits, its last usages are gone ;-)

grokos updated this revision to Diff 131621.Jan 26 2018, 10:43 AM
grokos marked an inline comment as done.

Removed obsolete enum constant.

Hahnfeld accepted this revision.Jan 28 2018, 2:17 AM

LGTM

FYI: I just committed rC323615 so that git-clang-format will pick up changes to CUDA files in the future. Until then we need to do this manually with clang-format...

This revision is now accepted and ready to land.Jan 28 2018, 2:17 AM
omalyshe added inline comments.
README.rst
289 ↗(On Diff #131110)

Should it be "the NVPTX device RTL" like in the previous item?

grokos marked an inline comment as done.Jan 29 2018, 5:58 AM
grokos added inline comments.
README.rst
289 ↗(On Diff #131110)

Thanks for pointing it out, I've changed it!

This revision was automatically updated to reflect the committed changes.
grokos marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: jfb. · View Herald Transcript