This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget][nvptx] Reduce calls to cuda header
ClosedPublic

Authored by JonChesterfield on Jan 14 2021, 4:31 PM.

Details

Summary

[libomptarget][nvptx] Reduce calls to cuda header

Remove use of clock_t in favour of a builtin. Drop a preprocessor branch.

Diff Detail

Event Timeline

JonChesterfield requested review of this revision.Jan 14 2021, 4:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 14 2021, 4:31 PM

I don't really like that we have to implement the cuda functions we already implement in places like clang/lib/Headers/__clang_cuda_device_functions.h.
Another problem is we still depend on the CUDA_VERSION and thereby cuda.h. We should forward declare what we need here and put it in the unconditionally included openmp_wrapper headers.

JonChesterfield added a comment.EditedJan 14 2021, 5:08 PM

Note that this is incremental (on the basis that it's already hard enough to review). Cuda.h is still used for CUDA_VERSION here, but also for the atomic functions and a few libc prototypes. I have the library compiling without cuda.h locally.

The complicated stuff that I'd prefer not reimplement here are shuffles and the CUDA_VERSION condition. The shuffles are in __clang_cuda_intrinsics.h, which includes crt/sm_70_rt.hpp from cuda-dev. Some derived macros that we could use instead of CUDA_VERSION are in __clang_cuda_runtime_wrapper.h, which includes lots of pieces of cuda-dev.

__clang_cuda_device_functions.h looks standalone. It provides one-line definitions like __DEVICE__ void __threadfence(void) { __nvvm_membar_gl(); }. We could use that, though we don't gain much and we would break if it changed to depend on a cuda header.

I don't have a solution to unknown CUDA_VERSION yet. I'd like to derive the branch from the architecture we're compiling for - I think all that matters here is whether the target arch has lockstep execution, which is easier to determine than the cuda library version on another machine.

I don't have a solution to unknown CUDA_VERSION yet.

We have forward declarations and we provide the definitions in the openmp_wrapper headers. No more include cuda here.

JonChesterfield added a comment.EditedJan 14 2021, 5:28 PM

That's interesting. Move some of the current target_impl.h into clang shipped headers.

target_impl.h: DEVICE lanemask_t __kmpc_impl_activemask();

nvptx: Some header,

#include <cuda.h>
INLINE lanemask_t __kmpc_impl_activemask() {...}

amdgpu: target_impl.cpp implementation

  • reduce scope
This revision is now accepted and ready to land.Jan 14 2021, 6:10 PM
JonChesterfield retitled this revision from [libomptarget][nvptx] Call builtins instead of cuda to [libomptarget][nvptx] Reduce calls to cuda header.Jan 14 2021, 6:11 PM
JonChesterfield edited the summary of this revision. (Show Details)
JonChesterfield edited the summary of this revision. (Show Details)Jan 14 2021, 6:14 PM

Remaining calls could be replaced with builtins by duplicating parts of
the cuda wrapper infra. This would mean choosing which to call based
on architecture number (>= sm_70) instead of CUDA_VERSION.

That would yield a deviceRTL that is independent of cuda. However, it
would also mean that mixed cuda + openmp code uses CUDA_VERSION
to choose intrinsics in some places and architecture number in others,
which seems likely to cause problems.

This revision was landed with ongoing or failed builds.Jan 14 2021, 6:16 PM
This revision was automatically updated to reflect the committed changes.