- User Since
- Feb 16 2016, 1:22 PM (114 w, 18 h)
Mon, Apr 16
Fri, Apr 13
Thu, Apr 12
I think this is now good to go.
Tue, Apr 10
Thu, Apr 5
One caveat regarding Alexey's proposal: According to the CUDA programming guide, malloc on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.
Mon, Apr 2
OK, so we can proceed with this solution and if we observe any performance problem in the future then I will push a different fix (instead of declaring __shared__ variables as extern, we can have getter functions in the .cu file in which the variable is defined which other .cu files can call to get a pointer/reference to the variable).
Thu, Mar 29
Right, I remember this issue. It is caused by the fact that clang does not support __shared__ variables to be extern. This restriction was explicitly introduced here: https://reviews.llvm.org/D25125.
Wed, Mar 28
What is the problem with building the bc library with the latest clang? I cannot reproduce the issue.
Mar 22 2018
Mar 21 2018
Mar 20 2018
Mar 16 2018
Rebase after bug fix.
Updated the patch to query the target offload kind from libomp instead of parsing OMP_TARGET_OFFLOAD.
Question: If the ICV is set to DEFAULT, then libomptarget must change it to either MANDATORY (if there are available devices in the system) or DISABLED (if there are no devices). Should libomptarget notify libomp about this change?
Waiting for the libomp-side patch for OMP_TARGET_OFFLOAD, I removed parsing this env var from within libomptarget and now this revision only implements the internal logic to handle whatever libomptarget will get by querying libomp. (there is already some code checking for OMP_TARGET_OFFLOAD==DISABLED, I'm leaving it there in order not to break systems which make use of this option right now)
Mar 15 2018
Right, I just read TR6 and everything is defined there. I'll revise the patch and upload a new diff.
This is obviously good to go!
Mar 14 2018
Mar 12 2018
Mar 9 2018
Mar 8 2018
I think the assignment makes more sense.
Mar 7 2018
I think the patch looks good. Let's remove the obsolete code in preparation for the new data-sharing scheme.
OK, I left the change to the internal device ID representation out of this patch.
Mar 6 2018
Mar 1 2018
As far as I am concerned, I agree that the existing data sharing scheme should be removed in favour of a simpler future patch which will introduce the more generic data sharing scheme.
Feb 12 2018
I don't have any other remarks, looks good.
Feb 8 2018
Feb 7 2018
Feb 6 2018
Looks good. Thanks for submitting! I'll take care of duplicate extern declarations in another patch.
Jan 30 2018
Jan 29 2018
Jan 26 2018
Removed obsolete enum constant.
Fixed formatting with clang-format.
I created a child revision with the final changes.
Jan 23 2018
Regarding the bitcode compiler/linker, we agreed on skipping looking for a suitable compiler in PATH, so for consistency I also skip looking in PATH for a linker. CMake only tries to locate llvm-link in the same directory as the clang binary (if clang is the CMAKE_C_COMPILER) or uses whatever the user has specified. Following @hfinkel 's suggestions, I updated the documentation to describe this behavior.
Made corrections according to the feedback I got. Regarding the general comments:
- I found only one more instance of threadID & (WARPSIZE-1) and replaced it with threadID % WARPSIZE. There are a few more bitwise operations in the code of the form threadID & ~(WARPSIZE-1) (note the bitwise NOT) which do something else - they round threadID down to the nearest multiple of WARPSIZE. Doing this with arithmetic operators would require something like threadID - threadID%WARPSIZE which I doubt the compiler could optimize (not to mention that the bitwise version is actually more easily recognizable as a round-to-a-multiple-of-2^x operation than the arithmetic version).
- I replaced all references to the built-in warpSize with the macro WARPSIZE. warpSize would have the advantage that it makes the code forward-compatible (if the warp size ever changes, the change will be reflected on this variable); however, it is not a compile-time constant and that would prevent the compiler from making certain optimizations (especially since the warp size is used in modulo operations for which we have assumed that the compiler can transform the modulo operation into bitwise operations). DS_Max_Worker_Warp_Size was also replaced with WARPSIZE.
Dec 20 2017
Dec 12 2017
Dec 11 2017
Dec 7 2017
Responded to comments and modified the code accordingly.