This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/lib/
-
lib/
-
Driver/ToolChains/
-
ToolChains/
1/2
Clang.cpp
-
Headers/
-
CMakeLists.txt
-
openmp_wrappers/
2/2
__clang_openmp_devicertl_cuda_ge90.h
-
__clang_openmp_devicertl_cuda_lt90.h
-
openmp/libomptarget/deviceRTLs/
-
libomptarget/
-
deviceRTLs/
-
amdgcn/src/
-
src/
-
target_impl.h
1/1
target_impl.hip
-
nvptx/src/
-
src/
-
target_impl.h
-
target_impl.cu

Differential D95313

[WIP] Move part of nvptx devicertl under clang
AbandonedPublic

Authored by JonChesterfield on Jan 24 2021, 10:36 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
tianshilei1992

Summary

[WIP] Move part of nvptx devicertl under clang

Example of moving the devicertl functions that depend on cuda
version under clang, so they can be injected at application
build time.

The original idea was to use the intrinsic definitions from
__clang_cuda_intrinsics, but that header needs a lot of cuda
specific setup to compile and includes part of the cuda sdk.
It's therefore difficult to compile as openmp.

This implements the code in headers and will work for c++ with
openmp, but not necessarily for C as the inline functions may not
be instantiated. It will also be a problem for fortran openmp.

I'm inclined to do something broadly equivalent to this, but in
the library. It means clang would need to link against devicertl.bc
and against a small cuda version specific devicertl_tbd.bc.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	510 ms	x64 debian > Clang.Driver::openmp-offload-gpu.c
	850 ms	x64 debian > Clang.Driver::openmp-offload.c
	230 ms	x64 windows > Clang.Driver::openmp-offload-gpu.c
	2,820 ms	x64 windows > Clang.Driver::openmp-offload.c

Event Timeline

JonChesterfield created this revision.Jan 24 2021, 10:36 AM

Herald added subscribers: mgorny, jvesely. · View Herald TranscriptJan 24 2021, 10:36 AM

JonChesterfield requested review of this revision.Jan 24 2021, 10:36 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJan 24 2021, 10:36 AM

Herald added subscribers: openmp-commits, cfe-commits, sstefan1. · View Herald Transcript

JonChesterfield added inline comments.Jan 24 2021, 10:41 AM

clang/lib/Driver/ToolChains/Clang.cpp
1204	Logic very like this could pick out a second, small devicertl bitcode library
clang/lib/Headers/openmp_wrappers/__clang_openmp_devicertl_cuda_ge90.h
18	linkonce and linkonce_odr can both be discarded, but these symbols need to survive until devicertl is linked
39	calling into intrinsics would remove this messy expression, if we can work out how to reliably compile parts of the cuda sdk as openmp these functions probably can't be instantiated on the host, so when changing the devicertl build over to openmp we will also need to guard these with variant / macro
openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.hip
49	Changing from c++ to c name mangling seems fine here. We don't consistently use one or the other in the devicertl. Using c mangling is simpler if the implementation is in a header.

Harbormaster completed remote builds in B86465: Diff 318850.Jan 24 2021, 11:30 AM

In general we're moving to the direction that target specific implementation will be compiled along with user code, which is fantastic. In this way, we only need to provide one bitcode library for one target. The change in FE lacks of some efficiency. If user code has multiple files, target specific header will be included multiple times, thus compiled multiple times. A more efficient way is to change the workflow of the driver, probably in the following way:

Compile target implementation t.bc
Link t.bc and libomptarget-[arch].bc to libomptarget.bc
Compile user code, which is also multiple steps. libomptarget.bc is fed into FE in this step.
Remaining steps...

clang/lib/Driver/ToolChains/Clang.cpp
1204	can we just use one header with different macros, like what we're using now?

Abandoned in favour of multiple instantiations of the devicertl, which works across all languages and doesn't require hacks to clang

Revision Contents

Path

Size

clang/

lib/

Driver/

ToolChains/

Clang.cpp

12 lines

Headers/

CMakeLists.txt

2 lines

openmp_wrappers/

__clang_openmp_devicertl_cuda_ge90.h

53 lines

__clang_openmp_devicertl_cuda_lt90.h

52 lines

openmp/

libomptarget/

deviceRTLs/

amdgcn/

src/

target_impl.h

12 lines

target_impl.hip

10 lines

nvptx/

src/

target_impl.h

9 lines

target_impl.cu

48 lines

Diff 318850

clang/lib/Driver/ToolChains/Clang.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,193 Lines • ▼ Show 20 Lines	if (!Args.hasArg(options::OPT_nobuiltininc)) {
llvm::sys::path::append(P, "include");		llvm::sys::path::append(P, "include");
llvm::sys::path::append(P, "openmp_wrappers");		llvm::sys::path::append(P, "openmp_wrappers");
CmdArgs.push_back("-internal-isystem");		CmdArgs.push_back("-internal-isystem");
CmdArgs.push_back(Args.MakeArgString(P));		CmdArgs.push_back(Args.MakeArgString(P));
}		}

CmdArgs.push_back("-include");		CmdArgs.push_back("-include");
CmdArgs.push_back("__clang_openmp_device_functions.h");		CmdArgs.push_back("__clang_openmp_device_functions.h");

		{
		auto CTC = static_cast<const toolchains::CudaToolChain >(
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto CTC' can be declared as 'const auto CTC' [llvm-qualified-auto] not useful Lint: Pre-merge checks: clang-tidy: warning: 'auto CTC' can be declared as 'const auto CTC' [llvm-qualified-auto]…
		JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Logic very like this could pick out a second, small devicertl bitcode library JonChesterfield: Logic very like this could pick out a second, small devicertl bitcode library
		tianshilei1992Unsubmitted Not Done Reply Inline Actions can we just use one header with different macros, like what we're using now? tianshilei1992: can we just use one header with different macros, like what we're using now?
		C.getSingleOffloadToolChain<Action::OFK_Cuda>());
		assert(CTC && "Expected valid CUDA Toolchain.");
		CudaVersion Ver = CTC->CudaInstallation.version();
		CmdArgs.push_back("-include");
		const char *Header = (Ver >= CudaVersion::CUDA_90)
		? "__clang_openmp_devicertl_cuda_ge90.h"
		: "__clang_openmp_devicertl_cuda_lt90.h";
		CmdArgs.push_back(Header);
		}
}		}

// Add -i* options, and automatically translate to		// Add -i* options, and automatically translate to
// -include-pch/-include-pth for transparent PCH support. It's		// -include-pch/-include-pth for transparent PCH support. It's
// wonky, but we include looking for .gch so we can support seamless		// wonky, but we include looking for .gch so we can support seamless
// replacement into a build system already set up to be generating		// replacement into a build system already set up to be generating
// .gch files.		// .gch files.

▲ Show 20 Lines • Show All 6,254 Lines • Show Last 20 Lines

clang/lib/Headers/CMakeLists.txt

	Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines
	)			)

	set(openmp_wrapper_files			set(openmp_wrapper_files
	openmp_wrappers/math.h			openmp_wrappers/math.h
	openmp_wrappers/cmath			openmp_wrappers/cmath
	openmp_wrappers/complex.h			openmp_wrappers/complex.h
	openmp_wrappers/complex			openmp_wrappers/complex
	openmp_wrappers/__clang_openmp_device_functions.h			openmp_wrappers/__clang_openmp_device_functions.h
				openmp_wrappers/__clang_openmp_devicertl_cuda_lt90.h
				openmp_wrappers/__clang_openmp_devicertl_cuda_ge90.h
	openmp_wrappers/complex_cmath.h			openmp_wrappers/complex_cmath.h
	openmp_wrappers/new			openmp_wrappers/new
	)			)

	set(output_dir ${LLVM_LIBRARY_OUTPUT_INTDIR}/clang/${CLANG_VERSION}/include)			set(output_dir ${LLVM_LIBRARY_OUTPUT_INTDIR}/clang/${CLANG_VERSION}/include)
	set(out_files)			set(out_files)
	set(generated_files)			set(generated_files)

	▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines

clang/lib/Headers/openmp_wrappers/__clang_openmp_devicertl_cuda_ge90.h

This file was added.

				//===--- __clang_openmp_devicertl_cuda_ge90.h -----------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#ifndef __CLANG_OPENMP_DEVICERTL_CUDA_GE90_H__
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] not useful Lint: Pre-merge checks: clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] [[https…
				#define __CLANG_OPENMP_DEVICERTL_CUDA_GE90_H__

				#ifdef __cplusplus
				extern "C" {
				#endif

				#pragma push_macro("DEVICE")

				#ifdef _OPENMP
				#define DEVICE __attribute__((used))
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions linkonce and linkonce_odr can both be discarded, but these symbols need to survive until devicertl is linked JonChesterfield: linkonce and linkonce_odr can both be discarded, but these symbols need to survive until…
				#else
				#define DEVICE __attribute__((used)) __attribute__((device))
				#endif

				// In Cuda 9.0, __ballot(1) from Cuda 8.0 is replaced with __activemask().
				inline DEVICE unsigned __kmpc_impl_activemask() {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…
				unsigned mask;
				asm volatile("activemask.b32 %0;" : "=r"(mask));
				return mask;
				}

				// In Cuda 9.0, the *_sync() version takes an extra argument 'mask'.
				inline DEVICE int __kmpc_impl_shfl_sync(unsigned Mask, int Var, int SrcLane) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…
				int WARPSIZE = 32;
				return __nvvm_shfl_sync_idx_i32(Mask, Var, SrcLane, WARPSIZE - 1);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__nvvm_shfl_sync_idx_i32' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__nvvm_shfl_sync_idx_i32' [clang-diagnostic…
				}

				inline DEVICE int __kmpc_impl_shfl_down_sync(unsigned Mask, int Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…
				unsigned Delta, int Width) {
				int WARPSIZE = 32;
				int tmp = ((WARPSIZE - Width) << 8) \| 0x1f;
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions calling into intrinsics would remove this messy expression, if we can work out how to reliably compile parts of the cuda sdk as openmp these functions probably can't be instantiated on the host, so when changing the devicertl build over to openmp we will also need to guard these with variant / macro JonChesterfield: calling into intrinsics would remove this messy expression, if we can work out how to reliably…
				return __nvvm_shfl_sync_down_i32(Mask, Var, Delta, tmp);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__nvvm_shfl_sync_down_i32' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__nvvm_shfl_sync_down_i32' [clang-diagnostic…
				}

				inline DEVICE void __kmpc_impl_syncwarp(unsigned Mask) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…
				__nvvm_bar_warp_sync(Mask);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__nvvm_bar_warp_sync' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__nvvm_bar_warp_sync' [clang-diagnostic-error]…
				}

				#pragma pop_macro("DEVICE")

				#ifdef __cplusplus
				} // extern "C"
				#endif

				#endif

clang/lib/Headers/openmp_wrappers/__clang_openmp_devicertl_cuda_lt90.h

This file was added.

				//===--- __clang_openmp_devicertl_cuda_lt90.h -----------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#ifndef __CLANG_OPENMP_DEVICERTL_CUDA_LT90_H__
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] not useful Lint: Pre-merge checks: clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] [[https…
				#define __CLANG_OPENMP_DEVICERTL_CUDA_LT90_H__

				#ifdef __cplusplus
				extern "C" {
				#endif

				#pragma push_macro("DEVICE")

				#ifdef _OPENMP
				#define DEVICE __attribute__((used))
				#else
				#define DEVICE __attribute__((used)) __attribute__((device))
				#endif

				// In Cuda 9.0, __ballot(1) from Cuda 8.0 is replaced with __activemask().
				inline DEVICE unsigned __kmpc_impl_activemask() {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…
				return __nvvm_vote_ballot(1);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__nvvm_vote_ballot' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__nvvm_vote_ballot' [clang-diagnostic-error]…
				}

				// In Cuda 9.0, the *_sync() version takes an extra argument 'mask'.
				inline DEVICE int __kmpc_impl_shfl_sync(unsigned Mask, int Var, int SrcLane) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…
				int WARPSIZE = 32;
				return __nvvm_shfl_idx_i32(Var, SrcLane, WARPSIZE - 1);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__nvvm_shfl_idx_i32' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__nvvm_shfl_idx_i32' [clang-diagnostic-error]…
				}

				inline DEVICE int __kmpc_impl_shfl_down_sync(unsigned Mask, int Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…
				unsigned Delta, int Width) {
				int WARPSIZE = 32;
				int tmp = ((WARPSIZE - Width) << 8) \| 0x1f;
				return __nvvm_shfl_down_i32(Var, Delta, tmp);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__nvvm_shfl_down_i32' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__nvvm_shfl_down_i32' [clang-diagnostic-error]…
				}

				inline DEVICE void __kmpc_impl_syncwarp(unsigned Mask) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…
				(void)Mask;
				// In Cuda < 9.0 no need to sync threads in warps.
				}

				#pragma pop_macro("DEVICE")

				#ifdef __cplusplus
				} // extern "C"
				#endif

				#endif

openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.h

	//===------- target_impl.h - AMDGCN OpenMP GPU implementation ----- HIP -*-===//			//===------- target_impl.h - AMDGCN OpenMP GPU implementation ----- HIP -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// Declarations and definitions of target specific functions and constants			// Declarations and definitions of target specific functions and constants
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#ifndef OMPTARGET_AMDGCN_TARGET_IMPL_H			#ifndef OMPTARGET_AMDGCN_TARGET_IMPL_H
	#define OMPTARGET_AMDGCN_TARGET_IMPL_H			#define OMPTARGET_AMDGCN_TARGET_IMPL_H

	#ifndef __AMDGCN__			#ifndef __AMDGCN__
	#error "amdgcn target_impl.h expects to be compiled under __AMDGCN__"			#error "amdgcn target_impl.h expects to be compiled under __AMDGCN__"
				Lint: Pre-merge checks Inline Actions clang-tidy: error: "amdgcn target_impl.h expects to be compiled under AMDGCN" [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: "amdgcn target_impl.h expects to be compiled under __AMDGCN__" [clang…
	#endif			#endif

	#include "amdgcn_interface.h"			#include "amdgcn_interface.h"

	#include <assert.h>			#include <assert.h>
	#include <inttypes.h>			#include <inttypes.h>
	#include <stddef.h>			#include <stddef.h>
	#include <stdint.h>			#include <stdint.h>
	▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	DEVICE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_gt();			DEVICE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_gt();
	DEVICE uint32_t __kmpc_impl_smid();			DEVICE uint32_t __kmpc_impl_smid();
	DEVICE double __kmpc_impl_get_wtick();			DEVICE double __kmpc_impl_get_wtick();
	DEVICE double __kmpc_impl_get_wtime();			DEVICE double __kmpc_impl_get_wtime();

	INLINE uint64_t __kmpc_impl_ffs(uint64_t x) { return __builtin_ffsl(x); }			INLINE uint64_t __kmpc_impl_ffs(uint64_t x) { return __builtin_ffsl(x); }
	INLINE uint64_t __kmpc_impl_popc(uint64_t x) { return __builtin_popcountl(x); }			INLINE uint64_t __kmpc_impl_popc(uint64_t x) { return __builtin_popcountl(x); }

	DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask();			EXTERN __kmpc_impl_lanemask_t __kmpc_impl_activemask();
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful clang-tidy: warning: invalid case style for function '__kmpc_impl_activemask' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…

	DEVICE int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t, int32_t Var,			EXTERN int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t, int32_t Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_sync' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…
	int32_t SrcLane);			int32_t SrcLane);

	DEVICE int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t, int32_t Var,			EXTERN int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t, int32_t Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_down_sync' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…
	uint32_t Delta, int32_t Width);			uint32_t Delta, int32_t Width);

	INLINE void __kmpc_impl_syncthreads() { __builtin_amdgcn_s_barrier(); }			EXTERN void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful clang-tidy: warning: invalid case style for function '__kmpc_impl_syncwarp' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…

	INLINE void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t) {			INLINE void __kmpc_impl_syncthreads() { __builtin_amdgcn_s_barrier(); }
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] not useful clang-tidy: warning: invalid case style for function 'kmpc_impl_syncthreads' [readability-identifier-naming] not useful clang-tidy: error: use of undeclared identifier 'builtin_amdgcn_s_barrier' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: warning: 'device' attribute ignored [clang-diagnostic-ignored-attributes] [[https…
	// AMDGCN doesn't need to sync threads in a warp
	}

	// AMDGCN specific kernel initialization			// AMDGCN specific kernel initialization
	DEVICE void __kmpc_impl_target_init();			DEVICE void __kmpc_impl_target_init();

	// Equivalent to ptx bar.sync 1. Barrier until num_threads arrive.			// Equivalent to ptx bar.sync 1. Barrier until num_threads arrive.
	DEVICE void __kmpc_impl_named_sync(uint32_t num_threads);			DEVICE void __kmpc_impl_named_sync(uint32_t num_threads);

	INLINE void __kmpc_impl_threadfence() {			INLINE void __kmpc_impl_threadfence() {
	__builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "agent");			__builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "agent");
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_fence' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_fence' [clang-diagnostic…
	}			}

	INLINE void __kmpc_impl_threadfence_block() {			INLINE void __kmpc_impl_threadfence_block() {
	__builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup");			__builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup");
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_fence' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_fence' [clang-diagnostic…
	}			}

	INLINE void __kmpc_impl_threadfence_system() {			INLINE void __kmpc_impl_threadfence_system() {
	__builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "");			__builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "");
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_fence' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_fence' [clang-diagnostic…
	}			}

	// Calls to the AMDGCN layer (assuming 1D layout)			// Calls to the AMDGCN layer (assuming 1D layout)
	INLINE int GetThreadIdInBlock() { return __builtin_amdgcn_workitem_id_x(); }			INLINE int GetThreadIdInBlock() { return __builtin_amdgcn_workitem_id_x(); }
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_workitem_id_x' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_workitem_id_x' [clang…
	INLINE int GetBlockIdInKernel() { return __builtin_amdgcn_workgroup_id_x(); }			INLINE int GetBlockIdInKernel() { return __builtin_amdgcn_workgroup_id_x(); }
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_workgroup_id_x' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_workgroup_id_x' [clang…
	DEVICE int GetNumberOfBlocksInKernel();			DEVICE int GetNumberOfBlocksInKernel();
	DEVICE int GetNumberOfThreadsInBlock();			DEVICE int GetNumberOfThreadsInBlock();
	DEVICE unsigned GetWarpId();			DEVICE unsigned GetWarpId();
	DEVICE unsigned GetLaneId();			DEVICE unsigned GetLaneId();

	// Atomics			// Atomics
	DEVICE uint32_t __kmpc_atomic_add(uint32_t *, uint32_t);			DEVICE uint32_t __kmpc_atomic_add(uint32_t *, uint32_t);
	DEVICE uint32_t __kmpc_atomic_inc(uint32_t *, uint32_t);			DEVICE uint32_t __kmpc_atomic_inc(uint32_t *, uint32_t);
	Show All 29 Lines

openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.hip

Show All 40 Lines	DEVICE double __kmpc_impl_get_wtime() {
// The intrinsics for measuring time have undocumented frequency		// The intrinsics for measuring time have undocumented frequency
// This will probably need to be found by measurement on a number of		// This will probably need to be found by measurement on a number of
// architectures. Until then, return 0, which is very inaccurate as a		// architectures. Until then, return 0, which is very inaccurate as a
// timer but resolves the undefined symbol at link time.		// timer but resolves the undefined symbol at link time.
return 0;		return 0;
}		}

// Warp vote function		// Warp vote function
DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask() {		EXTERN __kmpc_impl_lanemask_t __kmpc_impl_activemask() {
		JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Changing from c++ to c name mangling seems fine here. We don't consistently use one or the other in the devicertl. Using c mangling is simpler if the implementation is in a header. JonChesterfield: Changing from c++ to c name mangling seems fine here. We don't consistently use one or the…
return __builtin_amdgcn_read_exec();		return __builtin_amdgcn_read_exec();
}		}

DEVICE int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t, int32_t var,		EXTERN int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t, int32_t var,
int32_t srcLane) {		int32_t srcLane) {
int width = WARPSIZE;		int width = WARPSIZE;
int self = GetLaneId();		int self = GetLaneId();
int index = srcLane + (self & ~(width - 1));		int index = srcLane + (self & ~(width - 1));
return __builtin_amdgcn_ds_bpermute(index << 2, var);		return __builtin_amdgcn_ds_bpermute(index << 2, var);
}		}

DEVICE int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t, int32_t var,		EXTERN int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t, int32_t var,
uint32_t laneDelta, int32_t width) {		uint32_t laneDelta, int32_t width) {
int self = GetLaneId();		int self = GetLaneId();
int index = self + laneDelta;		int index = self + laneDelta;
index = (int)(laneDelta + (self & (width - 1))) >= width ? self : index;		index = (int)(laneDelta + (self & (width - 1))) >= width ? self : index;
return __builtin_amdgcn_ds_bpermute(index << 2, var);		return __builtin_amdgcn_ds_bpermute(index << 2, var);
}		}

		EXTERN void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t) {
		// AMDGCN doesn't need to sync threads in a warp
		}

static DEVICE SHARED uint32_t L1_Barrier;		static DEVICE SHARED uint32_t L1_Barrier;

DEVICE void __kmpc_impl_target_init() {		DEVICE void __kmpc_impl_target_init() {
// Don't have global ctors, and shared memory is not zero init		// Don't have global ctors, and shared memory is not zero init
__atomic_store_n(&L1_Barrier, 0u, __ATOMIC_RELEASE);		__atomic_store_n(&L1_Barrier, 0u, __ATOMIC_RELEASE);
}		}

DEVICE void __kmpc_impl_named_sync(uint32_t num_threads) {		DEVICE void __kmpc_impl_named_sync(uint32_t num_threads) {
▲ Show 20 Lines • Show All 116 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h

	//===------------ target_impl.h - NVPTX OpenMP GPU options ------- CUDA -*-===//			//===------------ target_impl.h - NVPTX OpenMP GPU options ------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// Definitions of target specific functions			// Definitions of target specific functions
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#ifndef _TARGET_IMPL_H_			#ifndef _TARGET_IMPL_H_
	#define _TARGET_IMPL_H_			#define _TARGET_IMPL_H_

	#include <assert.h>			#include <assert.h>
	#include <cuda.h>			#include <cuda.h>
				Lint: Pre-merge checks Inline Actions clang-tidy: error: 'cuda.h' file not found [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: 'cuda.h' file not found [clang-diagnostic-error] [[https://github.
	#include <inttypes.h>			#include <inttypes.h>
	#include <stdio.h>			#include <stdio.h>
	#include <stdlib.h>			#include <stdlib.h>

	#include "nvptx_interface.h"			#include "nvptx_interface.h"

	#define DEVICE __device__			#define DEVICE __device__
	#define INLINE __forceinline__ DEVICE			#define INLINE __forceinline__ DEVICE
	▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines

	INLINE uint32_t __kmpc_impl_ffs(uint32_t x) { return __builtin_ffs(x); }			INLINE uint32_t __kmpc_impl_ffs(uint32_t x) { return __builtin_ffs(x); }
	INLINE uint32_t __kmpc_impl_popc(uint32_t x) { return __builtin_popcount(x); }			INLINE uint32_t __kmpc_impl_popc(uint32_t x) { return __builtin_popcount(x); }

	#ifndef CUDA_VERSION			#ifndef CUDA_VERSION
	#error CUDA_VERSION macro is undefined, something wrong with cuda.			#error CUDA_VERSION macro is undefined, something wrong with cuda.
	#endif			#endif

	DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask();			EXTERN __kmpc_impl_lanemask_t __kmpc_impl_activemask();
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable '__kmpc_impl_lanemask_t' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable '__kmpc_impl_lanemask_t' [readability…

	DEVICE int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t Mask, int32_t Var,			EXTERN int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t Mask, int32_t Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'int32_t' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'int32_t' [readability-identifier-naming]…
	int32_t SrcLane);			int32_t SrcLane);

	DEVICE int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t Mask,			EXTERN int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t Mask,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'int32_t' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'int32_t' [readability-identifier-naming]…
	int32_t Var, uint32_t Delta,			int32_t Var, uint32_t Delta,
	int32_t Width);			int32_t Width);

				EXTERN void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_impl_syncwarp' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_impl_syncwarp' [readability…

	DEVICE void __kmpc_impl_syncthreads();			DEVICE void __kmpc_impl_syncthreads();
	DEVICE void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask);

	// NVPTX specific kernel initialization			// NVPTX specific kernel initialization
	DEVICE void __kmpc_impl_target_init();			DEVICE void __kmpc_impl_target_init();

	// Barrier until num_threads arrive.			// Barrier until num_threads arrive.
	DEVICE void __kmpc_impl_named_sync(uint32_t num_threads);			DEVICE void __kmpc_impl_named_sync(uint32_t num_threads);

	DEVICE void __kmpc_impl_threadfence();			DEVICE void __kmpc_impl_threadfence();
	Show All 36 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

	Show All 12 Lines

	#include "target_impl.h"			#include "target_impl.h"
	#include "common/debug.h"			#include "common/debug.h"

	#include <cuda.h>			#include <cuda.h>

	// Forward declaration of CUDA primitives which will be evetually transformed			// Forward declaration of CUDA primitives which will be evetually transformed
	// into LLVM intrinsics.			// into LLVM intrinsics.
	extern "C" {
	unsigned int __activemask();
	unsigned int __ballot(unsigned);
	// The default argument here is based on NVIDIA's website
	// https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
	int __shfl_sync(unsigned mask, int val, int src_line, int width = WARPSIZE);
	int __shfl(int val, int src_line, int width = WARPSIZE);
	int __shfl_down(int var, unsigned detla, int width);
	int __shfl_down_sync(unsigned mask, int var, unsigned detla, int width);
	void __syncwarp(int mask);
	}

	DEVICE void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi) {			DEVICE void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi) {
	asm volatile("mov.b64 {%0,%1}, %2;" : "=r"(lo), "=r"(hi) : "l"(val));			asm volatile("mov.b64 {%0,%1}, %2;" : "=r"(lo), "=r"(hi) : "l"(val));
	}			}

	DEVICE uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi) {			DEVICE uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi) {
	uint64_t val;			uint64_t val;
	asm volatile("mov.b64 %0, {%1,%2};" : "=l"(val) : "r"(lo), "r"(hi));			asm volatile("mov.b64 %0, {%1,%2};" : "=l"(val) : "r"(lo), "r"(hi));
	Show All 24 Lines
	}			}

	DEVICE double __kmpc_impl_get_wtime() {			DEVICE double __kmpc_impl_get_wtime() {
	unsigned long long nsecs;			unsigned long long nsecs;
	asm("mov.u64 %0, %%globaltimer;" : "=l"(nsecs));			asm("mov.u64 %0, %%globaltimer;" : "=l"(nsecs));
	return (double)nsecs * __kmpc_impl_get_wtick();			return (double)nsecs * __kmpc_impl_get_wtick();
	}			}

	// In Cuda 9.0, __ballot(1) from Cuda 8.0 is replaced with __activemask().
	DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask() {
	#if CUDA_VERSION >= 9000
	return __activemask();
	#else
	return __ballot(1);
	#endif
	}

	// In Cuda 9.0, the *_sync() version takes an extra argument 'mask'.
	DEVICE int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t Mask, int32_t Var,
	int32_t SrcLane) {
	#if CUDA_VERSION >= 9000
	return __shfl_sync(Mask, Var, SrcLane);
	#else
	return __shfl(Var, SrcLane);
	#endif // CUDA_VERSION
	}

	DEVICE int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t Mask,
	int32_t Var, uint32_t Delta,
	int32_t Width) {
	#if CUDA_VERSION >= 9000
	return __shfl_down_sync(Mask, Var, Delta, Width);
	#else
	return __shfl_down(Var, Delta, Width);
	#endif // CUDA_VERSION
	}

	DEVICE void __kmpc_impl_syncthreads() { __syncthreads(); }			DEVICE void __kmpc_impl_syncthreads() { __syncthreads(); }

	DEVICE void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask) {
	#if CUDA_VERSION >= 9000
	__syncwarp(Mask);
	#else
	// In Cuda < 9.0 no need to sync threads in warps.
	#endif // CUDA_VERSION
	}

	// NVPTX specific kernel initialization			// NVPTX specific kernel initialization
	DEVICE void __kmpc_impl_target_init() { /* nvptx needs no extra setup */			DEVICE void __kmpc_impl_target_init() { /* nvptx needs no extra setup */
	}			}

	// Barrier until num_threads arrive.			// Barrier until num_threads arrive.
	DEVICE void __kmpc_impl_named_sync(uint32_t num_threads) {			DEVICE void __kmpc_impl_named_sync(uint32_t num_threads) {
	// The named barrier for active parallel threads of a team in an L1 parallel			// The named barrier for active parallel threads of a team in an L1 parallel
	// region to synchronize with each other.			// region to synchronize with each other.
	▲ Show 20 Lines • Show All 97 Lines • Show Last 20 Lines