- User Since
- Feb 16 2016, 1:22 PM (161 w, 2 d)
Jan 4 2019
Nice approach to optimizing the degree of collisions of atomic adds. LGTM.
Looks good. Thanks!
LGTM. Thanks for all this work!
Jan 3 2019
I'll accept the patch for the sake of consistency and correctness of execution. Just one question:
which cost is too high
So should we expect a performance penalty until function copy is fixed in LLVM and we can revert back to __syncthreads()?
Dec 31 2018
I'm not sure what we are trying to do here. OMPTARGET_DEBUG is not an environment variable, it's a preprocessor definition. It is used by libomptarget/include/omptarget.h to include all necessary header files and define the DEBUGP macro appropriately, which in turn is used by libomptarget/src/private.h to define the DP macro used throughout the library.
Dec 27 2018
Dec 18 2018
Since a similar solution was adopted for clang, I think we should let this one land. After all, it's a matter of consistency between the two projects.
Nov 5 2018
Oct 22 2018
Are we still interested in this patch or should we abandon it?
Oct 11 2018
Oct 1 2018
Looks good. Thanks for taking care of this issue!
Looks good. I have no idea why this option was left there...
Sep 29 2018
Sep 28 2018
Interesting. Are there other cases where alignment is needed?
Then this revision is good to go!
Sep 21 2018
@Hahnfeld: Are the latest changes in line with your requirements/plans to reduce the memory footprint of the nvptx runtime?
Sep 12 2018
I completely missed this one. The proposed change makes sense.
Sep 11 2018
Sep 10 2018
Great job, adding testing infrastructure has been on our list for a long time!
Sep 6 2018
Sep 4 2018
Good catch, thanks for the fix!
Sep 3 2018
@AlexEichenberger: Have you confirmed whether flag 0x400 is free to be used for LAMBDA_REF or has been reserved for something else in XL?
Aug 23 2018
Aug 22 2018
Is this patch about lambdas inside target regions or about lambdas containing a target region? From the description I assume it's the former. In that case, what happens if there are more than 1 lambdas in the target region?
Aug 16 2018
Aug 2 2018
Jul 20 2018
Jul 19 2018
@ABataev: Can you put a link to the clang-side patch in the description so that we link the two patches together? Also, please let me know when you commit the clang patch so that I commit this one as well.
Added example code to demonstrate the need for padding in partially mapped structs.
Jul 18 2018
Jul 17 2018
Jul 13 2018
I don't have any further comments.
Jul 12 2018
Jul 9 2018
Looks good. Do you think we should do the same for the generic-elf-64 plugin?
Jun 29 2018
The addition looks good. I've seen this piece of code in quite a few places, maybe we need to refactor it into a mini-function. But let's revisit that in the future.
Jun 25 2018
Looks good to me, thanks for submitting those fixes!
Jun 18 2018
May 24 2018
I think this looks good. Definitely a correct implementation compared to what we have now!
May 22 2018
Looks good. I don't see any reason why performance would be affected and it is a necessary change for the sake of correctness.
May 21 2018
May 16 2018
Looks good! Thanks for taking care of it!
May 15 2018
May 14 2018
I agree with Jonas, the initial version of the nvptx RTL that we tried to upstream was somewhat like what you try to do here and the consensus was that we do not want to have these dependencies.
May 4 2018
May 2 2018
May I suggest that we use something like LIBOMPTARGET_DEVICE_RTL_DEBUG instead of LIBOMPDEVICE_DEBUG? I think this name is more descriptive of what the env var does.
Apr 27 2018
I don't think this code does atomic max. Actually, I don't think it does anything at all. First off, old_value and elem are variables which do not change inside the while-loop. So the first loop condition is either always true or always false. Secondly, atomicCAS returns the value that existed at address Buffer (regardless of whether or not the CAS operation succeeded) and not a boolean result. Also, old_value must be updated before each CAS attempt because some other thread may have changed it.
Apr 25 2018
Apr 16 2018
Apr 13 2018
Apr 12 2018
I think this is now good to go.
Apr 10 2018
Apr 5 2018
One caveat regarding Alexey's proposal: According to the CUDA programming guide, malloc on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.
Apr 2 2018
OK, so we can proceed with this solution and if we observe any performance problem in the future then I will push a different fix (instead of declaring __shared__ variables as extern, we can have getter functions in the .cu file in which the variable is defined which other .cu files can call to get a pointer/reference to the variable).
Mar 29 2018
Right, I remember this issue. It is caused by the fact that clang does not support __shared__ variables to be extern. This restriction was explicitly introduced here: https://reviews.llvm.org/D25125.
Mar 28 2018
What is the problem with building the bc library with the latest clang? I cannot reproduce the issue.
Mar 22 2018
Mar 21 2018
Mar 20 2018
Mar 16 2018
Rebase after bug fix.