[libomptarget][nvptx] Reduce calls to cuda header
Remove use of clock_t in favour of a builtin. Drop a preprocessor branch.
Differential D94731
[libomptarget][nvptx] Reduce calls to cuda header JonChesterfield on Jan 14 2021, 4:31 PM. Authored by
Details [libomptarget][nvptx] Reduce calls to cuda header Remove use of clock_t in favour of a builtin. Drop a preprocessor branch.
Diff Detail
Event TimelineComment Actions I don't really like that we have to implement the cuda functions we already implement in places like clang/lib/Headers/__clang_cuda_device_functions.h. Comment Actions Note that this is incremental (on the basis that it's already hard enough to review). Cuda.h is still used for CUDA_VERSION here, but also for the atomic functions and a few libc prototypes. I have the library compiling without cuda.h locally. The complicated stuff that I'd prefer not reimplement here are shuffles and the CUDA_VERSION condition. The shuffles are in __clang_cuda_intrinsics.h, which includes crt/sm_70_rt.hpp from cuda-dev. Some derived macros that we could use instead of CUDA_VERSION are in __clang_cuda_runtime_wrapper.h, which includes lots of pieces of cuda-dev. __clang_cuda_device_functions.h looks standalone. It provides one-line definitions like __DEVICE__ void __threadfence(void) { __nvvm_membar_gl(); }. We could use that, though we don't gain much and we would break if it changed to depend on a cuda header. I don't have a solution to unknown CUDA_VERSION yet. I'd like to derive the branch from the architecture we're compiling for - I think all that matters here is whether the target arch has lockstep execution, which is easier to determine than the cuda library version on another machine. Comment Actions
We have forward declarations and we provide the definitions in the openmp_wrapper headers. No more include cuda here. Comment Actions That's interesting. Move some of the current target_impl.h into clang shipped headers. target_impl.h: DEVICE lanemask_t __kmpc_impl_activemask(); nvptx: Some header, #include <cuda.h> INLINE lanemask_t __kmpc_impl_activemask() {...} amdgpu: target_impl.cpp implementation Comment Actions Remaining calls could be replaced with builtins by duplicating parts of That would yield a deviceRTL that is independent of cuda. However, it |