This is an archive of the discontinued LLVM Phabricator instance.

I don't really like that we have to implement the cuda functions we already implement in places like clang/lib/Headers/__clang_cuda_device_functions.h.
Another problem is we still depend on the CUDA_VERSION and thereby cuda.h. We should forward declare what we need here and put it in the unconditionally included openmp_wrapper headers.

Harbormaster completed remote builds in B85266: Diff 316809.Jan 14 2021, 5:01 PM

Note that this is incremental (on the basis that it's already hard enough to review). Cuda.h is still used for CUDA_VERSION here, but also for the atomic functions and a few libc prototypes. I have the library compiling without cuda.h locally.

The complicated stuff that I'd prefer not reimplement here are shuffles and the CUDA_VERSION condition. The shuffles are in __clang_cuda_intrinsics.h, which includes crt/sm_70_rt.hpp from cuda-dev. Some derived macros that we could use instead of CUDA_VERSION are in __clang_cuda_runtime_wrapper.h, which includes lots of pieces of cuda-dev.

__clang_cuda_device_functions.h looks standalone. It provides one-line definitions like __DEVICE__ void __threadfence(void) { __nvvm_membar_gl(); }. We could use that, though we don't gain much and we would break if it changed to depend on a cuda header.

I don't have a solution to unknown CUDA_VERSION yet. I'd like to derive the branch from the architecture we're compiling for - I think all that matters here is whether the target arch has lockstep execution, which is easier to determine than the cuda library version on another machine.

I don't have a solution to unknown CUDA_VERSION yet.

We have forward declarations and we provide the definitions in the openmp_wrapper headers. No more include cuda here.

That's interesting. Move some of the current target_impl.h into clang shipped headers.

target_impl.h: DEVICE lanemask_t __kmpc_impl_activemask();

nvptx: Some header,

#include <cuda.h>
INLINE lanemask_t __kmpc_impl_activemask() {...}

amdgpu: target_impl.cpp implementation

reduce scope

LGTM.

This revision is now accepted and ready to land.Jan 14 2021, 6:10 PM

JonChesterfield retitled this revision from [libomptarget][nvptx] Call builtins instead of cuda to [libomptarget][nvptx] Reduce calls to cuda header.Jan 14 2021, 6:11 PM

JonChesterfield edited the summary of this revision. (Show Details)

Remaining calls could be replaced with builtins by duplicating parts of
the cuda wrapper infra. This would mean choosing which to call based
on architecture number (>= sm_70) instead of CUDA_VERSION.

That would yield a deviceRTL that is independent of cuda. However, it
would also mean that mixed cuda + openmp code uses CUDA_VERSION
to choose intrinsics in some places and architecture number in others,
which seems likely to cause problems.

This revision was landed with ongoing or failed builds.Jan 14 2021, 6:16 PM

Closed by commit rG214387c2c694: [libomptarget][nvptx] Reduce calls to cuda header (authored by JonChesterfield). · Explain Why

This revision was automatically updated to reflect the committed changes.

JonChesterfield added a commit: rG214387c2c694: [libomptarget][nvptx] Reduce calls to cuda header.

Harbormaster completed remote builds in B85274: Diff 316820.Jan 14 2021, 6:40 PM

JonChesterfield mentioned this in D94745: [OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target dependent language.Jan 19 2021, 2:45 PM

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

nvptx/

src/

target_impl.cu

19 lines

Diff 316823

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

	Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines

	DEVICE double __kmpc_impl_get_wtime() {			DEVICE double __kmpc_impl_get_wtime() {
	unsigned long long nsecs;			unsigned long long nsecs;
	asm("mov.u64 %0, %%globaltimer;" : "=l"(nsecs));			asm("mov.u64 %0, %%globaltimer;" : "=l"(nsecs));
	return (double)nsecs * __kmpc_impl_get_wtick();			return (double)nsecs * __kmpc_impl_get_wtick();
	}			}

	// In Cuda 9.0, __ballot(1) from Cuda 8.0 is replaced with __activemask().			// In Cuda 9.0, __ballot(1) from Cuda 8.0 is replaced with __activemask().

	DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask() {			DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask() {
	#if CUDA_VERSION >= 9000			#if CUDA_VERSION >= 9000
	return __activemask();			return __activemask();
	#else			#else
	return __ballot(1);			return __ballot(1);
	#endif			#endif
	}			}

	// In Cuda 9.0, the *_sync() version takes an extra argument 'mask'.			// In Cuda 9.0, the *_sync() version takes an extra argument 'mask'.

	DEVICE int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t Mask, int32_t Var,			DEVICE int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t Mask, int32_t Var,
	int32_t SrcLane) {			int32_t SrcLane) {
	#if CUDA_VERSION >= 9000			#if CUDA_VERSION >= 9000
	return __shfl_sync(Mask, Var, SrcLane);			return __shfl_sync(Mask, Var, SrcLane);
	#else			#else
	return __shfl(Var, SrcLane);			return __shfl(Var, SrcLane);
	#endif // CUDA_VERSION			#endif // CUDA_VERSION
	}			}

	DEVICE int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t Mask,			DEVICE int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t Mask,
	int32_t Var, uint32_t Delta,			int32_t Var, uint32_t Delta,
	int32_t Width) {			int32_t Width) {
	#if CUDA_VERSION >= 9000			#if CUDA_VERSION >= 9000
	return __shfl_down_sync(Mask, Var, Delta, Width);			return __shfl_down_sync(Mask, Var, Delta, Width);
	#else			#else
	return __shfl_down(Var, Delta, Width);			return __shfl_down(Var, Delta, Width);
	#endif // CUDA_VERSION			#endif // CUDA_VERSION
	}			}

	DEVICE void __kmpc_impl_syncthreads() {			DEVICE void __kmpc_impl_syncthreads() { __syncthreads(); }
	// Use original __syncthreads if compiled by nvcc or clang >= 9.0.
	#if !defined(__clang__) \|\| __clang_major__ >= 9
	__syncthreads();
	#else
	asm volatile("bar.sync %0;" : : "r"(0) : "memory");
	#endif // __clang__
	}

	DEVICE void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask) {			DEVICE void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask) {
	#if CUDA_VERSION >= 9000			#if CUDA_VERSION >= 9000
	__syncwarp(Mask);			__syncwarp(Mask);
	#else			#else
	// In Cuda < 9.0 no need to sync threads in warps.			// In Cuda < 9.0 no need to sync threads in warps.
	#endif // CUDA_VERSION			#endif // CUDA_VERSION
	}			}
	Show All 35 Lines

	DEVICE void __kmpc_impl_destroy_lock(omp_lock_t *lock) {			DEVICE void __kmpc_impl_destroy_lock(omp_lock_t *lock) {
	__kmpc_impl_unset_lock(lock);			__kmpc_impl_unset_lock(lock);
	}			}

	DEVICE void __kmpc_impl_set_lock(omp_lock_t *lock) {			DEVICE void __kmpc_impl_set_lock(omp_lock_t *lock) {
	// TODO: not sure spinning is a good idea here..			// TODO: not sure spinning is a good idea here..
	while (__kmpc_atomic_cas(lock, UNSET, SET) != UNSET) {			while (__kmpc_atomic_cas(lock, UNSET, SET) != UNSET) {
	clock_t start = clock();			int32_t start = __nvvm_read_ptx_sreg_clock();
	clock_t now;			int32_t now;
	for (;;) {			for (;;) {
	now = clock();			now = __nvvm_read_ptx_sreg_clock();
	clock_t cycles = now > start ? now - start : now + (0xffffffff - start);			int32_t cycles = now > start ? now - start : now + (0xffffffff - start);
	if (cycles >= __OMP_SPIN * GetBlockIdInKernel()) {			if (cycles >= __OMP_SPIN * GetBlockIdInKernel()) {
	break;			break;
	}			}
	}			}
	} // wait for 0 to be the read value			} // wait for 0 to be the read value
	}			}

	DEVICE void __kmpc_impl_unset_lock(omp_lock_t *lock) {			DEVICE void __kmpc_impl_unset_lock(omp_lock_t *lock) {
	Show All 9 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget][nvptx] Reduce calls to cuda headerClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 316823

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

[libomptarget][nvptx] Reduce calls to cuda header
ClosedPublic