This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/clang/Basic/
-
clang/
-
Basic/
4
BuiltinsNVPTX.def
-
lib/
-
CodeGen/
-
CGBuiltin.cpp
-
Headers/
-
CMakeLists.txt
-
__clang_cuda_intrinsics.h
-
__clang_cuda_runtime_wrapper.h
-
test/CodeGen/
-
CodeGen/
-
builtins-nvptx.c

Differential D19990

[CUDA] Implement __ldg using intrinsics.
ClosedPublic

Authored by jlebar on May 5 2016, 1:02 PM.

Download Raw Diff

Details

Reviewers

tra
rnk
rsmith

Commits

rG2e4ecfdebe8f: [CUDA] Implement __ldg using intrinsics.
rC270150: [CUDA] Implement __ldg using intrinsics.
rL270150: [CUDA] Implement __ldg using intrinsics.

Summary

Previously it was implemented as inline asm in the CUDA headers.

This change allows us to use the [addr+imm] addressing mode when
executing ld.global.nc instructions. This translates into a 1.3x
speedup on some benchmarks that call this instruction from within an
unrolled loop.

Diff Detail

Event Timeline

jlebar updated this revision to Diff 56331.May 5 2016, 1:02 PM

jlebar retitled this revision from to [CUDA] Implement __ldg using intrinsics..

jlebar updated this object.

jlebar added reviewers: tra, rsmith.

jlebar added subscribers: cfe-commits, jhen.

Herald added a subscriber: jholewinski. · View Herald TranscriptMay 5 2016, 1:02 PM

majnemer added a subscriber: majnemer.May 5 2016, 1:30 PM

majnemer added inline comments.

include/clang/Basic/BuiltinsNVPTX.def
569–603	Would it be crazy to instead provide a generic builtin? Would cut down on the number of variants... `__builtin_add_overflow` is an example of such a builtin.

jlebar added inline comments.May 5 2016, 1:40 PM

include/clang/Basic/BuiltinsNVPTX.def
569–603	Art is going to send you flowers. :) He and I just had an argument about this. I think this isn't an unreasonable thing to want, but I think it's beneficial to be consistent with our existing API. So if we offer a generic thing for ldg, it would be nice to have one for atomics above, which are basically the same. So I told Art I'd prefer to add it to our list.

jlebar added inline comments.May 5 2016, 1:43 PM

include/clang/Basic/BuiltinsNVPTX.def
569–603	Oh, another thing is that, you really see the benefit of having a generic builtin when you start hitting the combinatorial explosion of all the different kinds of loads. Like, as-is it's not so bad, but if you want to support all forms of ld.global.nc, there are four different caching behaviors. Supporting all forms of ld is way worse. Which is to say, if we're going to do the generic thing, it seems like we benefit the most by making it generic on more than the types. But we're not ready to do that; I don't think most of these loads even exist in llvm atm. http://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld

Art pointed me to the fact that CUDA 8 adds a bunch more load intrinsics, and I said ohmygosh maybe we *do* want to do the variadic intrinsic thing here.

But now looking at how __builtin_add_overflow is implemented, we'd need special sema checking to make it work. We would also need some sort of argument promotion logic to make the value and pointer into the same types. In both cases it seems like maybe it's better to leave this stuff to clang, rather than trying to write a buggy implementation ourselves?

Even with the many new load intrinsics, listing all the intrinsics is a relatively small part of the code required. The majority of the code necessary is in our CUDA header, but even with a variadic builtin, that would be hard to reduce without some serious template magic, and that would be doubly difficult to do without exposing crummy diagnostics to users.

What do you all think?

OK. Let's stick with __ldg for now.

Art pointed out that static_assert is c++11-only.

I'll just remove them and make a note to move them into the CUDA test-suite stuff Art is working on.

Remove static_asserts.

jlebar added a reviewer: rnk.May 12 2016, 2:29 PM

Friendly ping. This is a big help with some Tensorflow benchmarks.

rsmith added inline comments.May 17 2016, 12:16 PM

include/clang/Basic/BuiltinsNVPTX.def
569–603	It sounds like the combinatorial explosion will be unmanageable if we don't switch to using a generic builtin for the full suite of 'ld' operations, so it seems worthwhile to do that now. This would also be consistent with how we handle the somewhat-similar builtin `__builtin_nontemporal_load`.

After offline discussion: we don't know for sure whether we're going to hit the combinatorial explosion in future or not. Let's go ahead with this as-is for now, then, with the explicit acknowledgement that we reserve the right to replace these builtins with a single type-generic builtin in the future.

This revision is now accepted and ready to land.May 19 2016, 3:36 PM

Closed by commit rL270150: [CUDA] Implement __ldg using intrinsics. (authored by jlebar). · Explain WhyMay 19 2016, 3:55 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

clang/

Basic/

BuiltinsNVPTX.def

36 lines

lib/

CodeGen/

CGBuiltin.cpp

45 lines

Headers/

CMakeLists.txt

1 line

__clang_cuda_intrinsics.h

256 lines

__clang_cuda_runtime_wrapper.h

6 lines

test/

CodeGen/

builtins-nvptx.c

106 lines

Diff 56603

include/clang/Basic/BuiltinsNVPTX.def

	Show First 20 Lines • Show All 560 Lines • ▼ Show 20 Lines
	BUILTIN(__nvvm_atom_cas_g_ll, "LLiLLiD*1LLiLLi", "n")			BUILTIN(__nvvm_atom_cas_g_ll, "LLiLLiD*1LLiLLi", "n")
	BUILTIN(__nvvm_atom_cas_s_ll, "LLiLLiD*3LLiLLi", "n")			BUILTIN(__nvvm_atom_cas_s_ll, "LLiLLiD*3LLiLLi", "n")
	BUILTIN(__nvvm_atom_cas_gen_ll, "LLiLLiD*LLiLLi", "n")			BUILTIN(__nvvm_atom_cas_gen_ll, "LLiLLiD*LLiLLi", "n")

	// Compiler Error Warn			// Compiler Error Warn
	BUILTIN(__nvvm_compiler_error, "vcC*4", "n")			BUILTIN(__nvvm_compiler_error, "vcC*4", "n")
	BUILTIN(__nvvm_compiler_warn, "vcC*4", "n")			BUILTIN(__nvvm_compiler_warn, "vcC*4", "n")

				// __ldg. This is not implemented as a builtin by nvcc.
				BUILTIN(__nvvm_ldg_c, "ccC*", "")
				BUILTIN(__nvvm_ldg_s, "ssC*", "")
				BUILTIN(__nvvm_ldg_i, "iiC*", "")
				BUILTIN(__nvvm_ldg_l, "LiLiC*", "")
				BUILTIN(__nvvm_ldg_ll, "LLiLLiC*", "")

				BUILTIN(__nvvm_ldg_uc, "UcUcC*", "")
				BUILTIN(__nvvm_ldg_us, "UsUsC*", "")
				BUILTIN(__nvvm_ldg_ui, "UiUiC*", "")
				BUILTIN(__nvvm_ldg_ul, "ULiULiC*", "")
				BUILTIN(__nvvm_ldg_ull, "ULLiULLiC*", "")

				BUILTIN(__nvvm_ldg_f, "ffC*", "")
				BUILTIN(__nvvm_ldg_d, "ddC*", "")

				BUILTIN(__nvvm_ldg_c2, "E2cE2cC*", "")
				BUILTIN(__nvvm_ldg_c4, "E4cE4cC*", "")
				BUILTIN(__nvvm_ldg_s2, "E2sE2sC*", "")
				BUILTIN(__nvvm_ldg_s4, "E4sE4sC*", "")
				BUILTIN(__nvvm_ldg_i2, "E2iE2iC*", "")
				BUILTIN(__nvvm_ldg_i4, "E4iE4iC*", "")
				BUILTIN(__nvvm_ldg_ll2, "E2LLiE2LLiC*", "")

				BUILTIN(__nvvm_ldg_uc2, "E2UcE2UcC*", "")
				BUILTIN(__nvvm_ldg_uc4, "E4UcE4UcC*", "")
				BUILTIN(__nvvm_ldg_us2, "E2UsE2UsC*", "")
				BUILTIN(__nvvm_ldg_us4, "E4UsE4UsC*", "")
				BUILTIN(__nvvm_ldg_ui2, "E2UiE2UiC*", "")
				BUILTIN(__nvvm_ldg_ui4, "E4UiE4UiC*", "")
				BUILTIN(__nvvm_ldg_ull2, "E2ULLiE2ULLiC*", "")

				BUILTIN(__nvvm_ldg_f2, "E2fE2fC*", "")
				BUILTIN(__nvvm_ldg_f4, "E4fE4fC*", "")
				BUILTIN(__nvvm_ldg_d2, "E2dE2dC*", "")
				majnemerUnsubmitted Not Done Reply Inline Actions Would it be crazy to instead provide a generic builtin? Would cut down on the number of variants... `__builtin_add_overflow` is an example of such a builtin. majnemer: Would it be crazy to instead provide a generic builtin? Would cut down on the number of…
				jlebarAuthorUnsubmitted Not Done Reply Inline Actions Art is going to send you flowers. :) He and I just had an argument about this. I think this isn't an unreasonable thing to want, but I think it's beneficial to be consistent with our existing API. So if we offer a generic thing for ldg, it would be nice to have one for atomics above, which are basically the same. So I told Art I'd prefer to add it to our list. jlebar: Art is going to send you flowers. :) He and I just had an argument about this. I think this…
				jlebarAuthorUnsubmitted Not Done Reply Inline Actions Oh, another thing is that, you really see the benefit of having a generic builtin when you start hitting the combinatorial explosion of all the different kinds of loads. Like, as-is it's not so bad, but if you want to support all forms of ld.global.nc, there are four different caching behaviors. Supporting all forms of ld is way worse. Which is to say, if we're going to do the generic thing, it seems like we benefit the most by making it generic on more than the types. But we're not ready to do that; I don't think most of these loads even exist in llvm atm. http://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld jlebar: Oh, another thing is that, you really see the benefit of having a generic builtin when you…
				rsmithUnsubmitted Not Done Reply Inline Actions It sounds like the combinatorial explosion will be unmanageable if we don't switch to using a generic builtin for the full suite of 'ld' operations, so it seems worthwhile to do that now. This would also be consistent with how we handle the somewhat-similar builtin `__builtin_nontemporal_load`. rsmith: It sounds like the combinatorial explosion will be unmanageable if we don't switch to using a…

	#undef BUILTIN			#undef BUILTIN

lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,343 Lines • ▼ Show 20 Lines	#undef INTRINSIC_WITH_CC

default:		default:
return nullptr;		return nullptr;
}		}
}		}

Value *CodeGenFunction::EmitNVPTXBuiltinExpr(unsigned BuiltinID,		Value *CodeGenFunction::EmitNVPTXBuiltinExpr(unsigned BuiltinID,
const CallExpr *E) {		const CallExpr *E) {
		auto MakeLdg = [&](unsigned IntrinsicID) {
		Value *Ptr = EmitScalarExpr(E->getArg(0));
		AlignmentSource AlignSource;
		clang::CharUnits Align =
		getNaturalPointeeTypeAlignment(E->getArg(0)->getType(), &AlignSource);
		return Builder.CreateCall(
		CGM.getIntrinsic(IntrinsicID, {Ptr->getType()->getPointerElementType(),
		Ptr->getType()}),
		{Ptr, ConstantInt::get(Builder.getInt32Ty(), Align.getQuantity())});
		};

switch (BuiltinID) {		switch (BuiltinID) {
case NVPTX::BI__nvvm_atom_add_gen_i:		case NVPTX::BI__nvvm_atom_add_gen_i:
case NVPTX::BI__nvvm_atom_add_gen_l:		case NVPTX::BI__nvvm_atom_add_gen_l:
case NVPTX::BI__nvvm_atom_add_gen_ll:		case NVPTX::BI__nvvm_atom_add_gen_ll:
return MakeBinaryAtomicValue(*this, llvm::AtomicRMWInst::Add, E);		return MakeBinaryAtomicValue(*this, llvm::AtomicRMWInst::Add, E);

case NVPTX::BI__nvvm_atom_sub_gen_i:		case NVPTX::BI__nvvm_atom_sub_gen_i:
case NVPTX::BI__nvvm_atom_sub_gen_l:		case NVPTX::BI__nvvm_atom_sub_gen_l:
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	Value *CodeGenFunction::EmitNVPTXBuiltinExpr(unsigned BuiltinID,
case NVPTX::BI__nvvm_atom_dec_gen_ui: {		case NVPTX::BI__nvvm_atom_dec_gen_ui: {
Value *Ptr = EmitScalarExpr(E->getArg(0));		Value *Ptr = EmitScalarExpr(E->getArg(0));
Value *Val = EmitScalarExpr(E->getArg(1));		Value *Val = EmitScalarExpr(E->getArg(1));
Value *FnALD32 =		Value *FnALD32 =
CGM.getIntrinsic(Intrinsic::nvvm_atomic_load_dec_32, Ptr->getType());		CGM.getIntrinsic(Intrinsic::nvvm_atomic_load_dec_32, Ptr->getType());
return Builder.CreateCall(FnALD32, {Ptr, Val});		return Builder.CreateCall(FnALD32, {Ptr, Val});
}		}

		case NVPTX::BI__nvvm_ldg_c:
		case NVPTX::BI__nvvm_ldg_c2:
		case NVPTX::BI__nvvm_ldg_c4:
		case NVPTX::BI__nvvm_ldg_s:
		case NVPTX::BI__nvvm_ldg_s2:
		case NVPTX::BI__nvvm_ldg_s4:
		case NVPTX::BI__nvvm_ldg_i:
		case NVPTX::BI__nvvm_ldg_i2:
		case NVPTX::BI__nvvm_ldg_i4:
		case NVPTX::BI__nvvm_ldg_l:
		case NVPTX::BI__nvvm_ldg_ll:
		case NVPTX::BI__nvvm_ldg_ll2:
		case NVPTX::BI__nvvm_ldg_uc:
		case NVPTX::BI__nvvm_ldg_uc2:
		case NVPTX::BI__nvvm_ldg_uc4:
		case NVPTX::BI__nvvm_ldg_us:
		case NVPTX::BI__nvvm_ldg_us2:
		case NVPTX::BI__nvvm_ldg_us4:
		case NVPTX::BI__nvvm_ldg_ui:
		case NVPTX::BI__nvvm_ldg_ui2:
		case NVPTX::BI__nvvm_ldg_ui4:
		case NVPTX::BI__nvvm_ldg_ul:
		case NVPTX::BI__nvvm_ldg_ull:
		case NVPTX::BI__nvvm_ldg_ull2:
		// PTX Interoperability section 2.2: "For a vector with an even number of
		// elements, its alignment is set to number of elements times the alignment
		// of its member: n*alignof(t)."
		return MakeLdg(Intrinsic::nvvm_ldg_global_i);
		case NVPTX::BI__nvvm_ldg_f:
		case NVPTX::BI__nvvm_ldg_f2:
		case NVPTX::BI__nvvm_ldg_f4:
		case NVPTX::BI__nvvm_ldg_d:
		case NVPTX::BI__nvvm_ldg_d2:
		return MakeLdg(Intrinsic::nvvm_ldg_global_f);
default:		default:
return nullptr;		return nullptr;
}		}
}		}

Value *CodeGenFunction::EmitWebAssemblyBuiltinExpr(unsigned BuiltinID,		Value *CodeGenFunction::EmitWebAssemblyBuiltinExpr(unsigned BuiltinID,
const CallExpr *E) {		const CallExpr *E) {
switch (BuiltinID) {		switch (BuiltinID) {
Show All 15 Lines

lib/Headers/CMakeLists.txt

Show All 15 Lines	set(files
avx512vldqintrin.h		avx512vldqintrin.h
avx512vbmiintrin.h		avx512vbmiintrin.h
avx512vbmivlintrin.h		avx512vbmivlintrin.h
pkuintrin.h		pkuintrin.h
avxintrin.h		avxintrin.h
bmi2intrin.h		bmi2intrin.h
bmiintrin.h		bmiintrin.h
__clang_cuda_cmath.h		__clang_cuda_cmath.h
		__clang_cuda_intrinsics.h
__clang_cuda_math_forward_declares.h		__clang_cuda_math_forward_declares.h
__clang_cuda_runtime_wrapper.h		__clang_cuda_runtime_wrapper.h
cpuid.h		cpuid.h
cuda_builtin_vars.h		cuda_builtin_vars.h
emmintrin.h		emmintrin.h
f16cintrin.h		f16cintrin.h
float.h		float.h
fma4intrin.h		fma4intrin.h
▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines

lib/Headers/__clang_cuda_intrinsics.h

This file was added.

				/*===--- __clang_cuda_intrinsics.h - Device-side CUDA intrinsic wrappers ---===
				*
				* Permission is hereby granted, free of charge, to any person obtaining a copy
				* of this software and associated documentation files (the "Software"), to deal
				* in the Software without restriction, including without limitation the rights
				* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
				* copies of the Software, and to permit persons to whom the Software is
				* furnished to do so, subject to the following conditions:
				*
				* The above copyright notice and this permission notice shall be included in
				* all copies or substantial portions of the Software.
				*
				* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
				* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
				* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
				* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
				* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
				* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
				* THE SOFTWARE.
				*
				*===-----------------------------------------------------------------------===
				*/
				#ifndef __CLANG_CUDA_INTRINSICS_H__
				#define __CLANG_CUDA_INTRINSICS_H__
				#ifndef __CUDA__
				#error "This file is for CUDA compilation only."
				#endif

				// sm_32 intrinsics: __ldg and __funnelshift_{l,lc,r,rc}.

				// Prevent the vanilla sm_32 intrinsics header from being included.
				#define __SM_32_INTRINSICS_H__
				#define __SM_32_INTRINSICS_HPP__

				#if !defined(__CUDA_ARCH__) \|\| __CUDA_ARCH__ >= 320

				inline __device__ char __ldg(const char *ptr) { return __nvvm_ldg_c(ptr); }
				inline __device__ short __ldg(const short *ptr) { return __nvvm_ldg_s(ptr); }
				inline __device__ int __ldg(const int *ptr) { return __nvvm_ldg_i(ptr); }
				inline __device__ long __ldg(const long *ptr) { return __nvvm_ldg_l(ptr); }
				inline __device__ long long __ldg(const long long *ptr) {
				return __nvvm_ldg_ll(ptr);
				}
				inline __device__ unsigned char __ldg(const unsigned char *ptr) {
				return __nvvm_ldg_uc(ptr);
				}
				inline __device__ unsigned short __ldg(const unsigned short *ptr) {
				return __nvvm_ldg_us(ptr);
				}
				inline __device__ unsigned int __ldg(const unsigned int *ptr) {
				return __nvvm_ldg_ui(ptr);
				}
				inline __device__ unsigned long __ldg(const unsigned long *ptr) {
				return __nvvm_ldg_ul(ptr);
				}
				inline __device__ unsigned long long __ldg(const unsigned long long *ptr) {
				return __nvvm_ldg_ull(ptr);
				}
				inline __device__ float __ldg(const float *ptr) { return __nvvm_ldg_f(ptr); }
				inline __device__ double __ldg(const double *ptr) { return __nvvm_ldg_d(ptr); }

				inline __device__ char2 __ldg(const char2 *ptr) {
				typedef char c2 __attribute__((ext_vector_type(2)));
				// We can assume that ptr is aligned at least to char2's alignment, but the
				// load will assume that ptr is aligned to char2's alignment. This is only
				// safe if alignof(c2) <= alignof(char2).
				c2 rv = __nvvm_ldg_c2(reinterpret_cast<const c2 *>(ptr));
				char2 ret;
				ret.x = rv[0];
				ret.y = rv[1];
				return ret;
				}
				inline __device__ char4 __ldg(const char4 *ptr) {
				typedef char c4 __attribute__((ext_vector_type(4)));
				c4 rv = __nvvm_ldg_c4(reinterpret_cast<const c4 *>(ptr));
				char4 ret;
				ret.w = rv[0];
				ret.x = rv[1];
				ret.y = rv[2];
				ret.z = rv[3];
				return ret;
				}
				inline __device__ short2 __ldg(const short2 *ptr) {
				typedef short s2 __attribute__((ext_vector_type(2)));
				s2 rv = __nvvm_ldg_s2(reinterpret_cast<const s2 *>(ptr));
				short2 ret;
				ret.x = rv[0];
				ret.y = rv[1];
				return ret;
				}
				inline __device__ short4 __ldg(const short4 *ptr) {
				typedef short s4 __attribute__((ext_vector_type(4)));
				s4 rv = __nvvm_ldg_s4(reinterpret_cast<const s4 *>(ptr));
				short4 ret;
				ret.w = rv[0];
				ret.x = rv[1];
				ret.y = rv[2];
				ret.z = rv[3];
				return ret;
				}
				inline __device__ int2 __ldg(const int2 *ptr) {
				typedef int i2 __attribute__((ext_vector_type(2)));
				i2 rv = __nvvm_ldg_i2(reinterpret_cast<const i2 *>(ptr));
				int2 ret;
				ret.x = rv[0];
				ret.y = rv[1];
				return ret;
				}
				inline __device__ int4 __ldg(const int4 *ptr) {
				typedef int i4 __attribute__((ext_vector_type(4)));
				i4 rv = __nvvm_ldg_i4(reinterpret_cast<const i4 *>(ptr));
				int4 ret;
				ret.w = rv[0];
				ret.x = rv[1];
				ret.y = rv[2];
				ret.z = rv[3];
				return ret;
				}
				inline __device__ longlong2 __ldg(const longlong2 *ptr) {
				typedef long long ll2 __attribute__((ext_vector_type(2)));
				ll2 rv = __nvvm_ldg_ll2(reinterpret_cast<const ll2 *>(ptr));
				longlong2 ret;
				ret.x = rv[0];
				ret.y = rv[1];
				return ret;
				}

				inline __device__ uchar2 __ldg(const uchar2 *ptr) {
				typedef unsigned char uc2 __attribute__((ext_vector_type(2)));
				uc2 rv = __nvvm_ldg_uc2(reinterpret_cast<const uc2 *>(ptr));
				uchar2 ret;
				ret.x = rv[0];
				ret.y = rv[1];
				return ret;
				}
				inline __device__ uchar4 __ldg(const uchar4 *ptr) {
				typedef unsigned char uc4 __attribute__((ext_vector_type(4)));
				uc4 rv = __nvvm_ldg_uc4(reinterpret_cast<const uc4 *>(ptr));
				uchar4 ret;
				ret.w = rv[0];
				ret.x = rv[1];
				ret.y = rv[2];
				ret.z = rv[3];
				return ret;
				}
				inline __device__ ushort2 __ldg(const ushort2 *ptr) {
				typedef unsigned short us2 __attribute__((ext_vector_type(2)));
				us2 rv = __nvvm_ldg_us2(reinterpret_cast<const us2 *>(ptr));
				ushort2 ret;
				ret.x = rv[0];
				ret.y = rv[1];
				return ret;
				}
				inline __device__ ushort4 __ldg(const ushort4 *ptr) {
				typedef unsigned short us4 __attribute__((ext_vector_type(4)));
				us4 rv = __nvvm_ldg_us4(reinterpret_cast<const us4 *>(ptr));
				ushort4 ret;
				ret.w = rv[0];
				ret.x = rv[1];
				ret.y = rv[2];
				ret.z = rv[3];
				return ret;
				}
				inline __device__ uint2 __ldg(const uint2 *ptr) {
				typedef unsigned int ui2 __attribute__((ext_vector_type(2)));
				ui2 rv = __nvvm_ldg_ui2(reinterpret_cast<const ui2 *>(ptr));
				uint2 ret;
				ret.x = rv[0];
				ret.y = rv[1];
				return ret;
				}
				inline __device__ uint4 __ldg(const uint4 *ptr) {
				typedef unsigned int ui4 __attribute__((ext_vector_type(4)));
				ui4 rv = __nvvm_ldg_ui4(reinterpret_cast<const ui4 *>(ptr));
				uint4 ret;
				ret.w = rv[0];
				ret.x = rv[1];
				ret.y = rv[2];
				ret.z = rv[3];
				return ret;
				}
				inline __device__ ulonglong2 __ldg(const ulonglong2 *ptr) {
				typedef unsigned long long ull2 __attribute__((ext_vector_type(2)));
				ull2 rv = __nvvm_ldg_ull2(reinterpret_cast<const ull2 *>(ptr));
				ulonglong2 ret;
				ret.x = rv[0];
				ret.y = rv[1];
				return ret;
				}

				inline __device__ float2 __ldg(const float2 *ptr) {
				typedef float f2 __attribute__((ext_vector_type(2)));
				f2 rv = __nvvm_ldg_f2(reinterpret_cast<const f2 *>(ptr));
				float2 ret;
				ret.x = rv[0];
				ret.y = rv[1];
				return ret;
				}
				inline __device__ float4 __ldg(const float4 *ptr) {
				typedef float f4 __attribute__((ext_vector_type(4)));
				f4 rv = __nvvm_ldg_f4(reinterpret_cast<const f4 *>(ptr));
				float4 ret;
				ret.w = rv[0];
				ret.x = rv[1];
				ret.y = rv[2];
				ret.z = rv[3];
				return ret;
				}
				inline __device__ double2 __ldg(const double2 *ptr) {
				typedef double d2 __attribute__((ext_vector_type(2)));
				d2 rv = __nvvm_ldg_d2(reinterpret_cast<const d2 *>(ptr));
				double2 ret;
				ret.x = rv[0];
				ret.y = rv[1];
				return ret;
				}

				// TODO: Implement these as intrinsics, so the backend can work its magic on
				// these. Alternatively, we could implement these as plain C and try to get
				// llvm to recognize the relevant patterns.
				inline __device__ unsigned __funnelshift_l(unsigned low32, unsigned high32,
				unsigned shiftWidth) {
				unsigned result;
				asm("shf.l.wrap.b32 %0, %1, %2, %3;"
				: "=r"(result)
				: "r"(low32), "r"(high32), "r"(shiftWidth));
				return result;
				}
				inline __device__ unsigned __funnelshift_lc(unsigned low32, unsigned high32,
				unsigned shiftWidth) {
				unsigned result;
				asm("shf.l.clamp.b32 %0, %1, %2, %3;"
				: "=r"(result)
				: "r"(low32), "r"(high32), "r"(shiftWidth));
				return result;
				}
				inline __device__ unsigned __funnelshift_r(unsigned low32, unsigned high32,
				unsigned shiftWidth) {
				unsigned result;
				asm("shf.r.wrap.b32 %0, %1, %2, %3;"
				: "=r"(result)
				: "r"(low32), "r"(high32), "r"(shiftWidth));
				return result;
				}
				inline __device__ unsigned __funnelshift_rc(unsigned low32, unsigned high32,
				unsigned shiftWidth) {
				unsigned ret;
				asm("shf.r.clamp.b32 %0, %1, %2, %3;"
				: "=r"(ret)
				: "r"(low32), "r"(high32), "r"(shiftWidth));
				return ret;
				}

				#endif // !defined(__CUDA_ARCH__) \|\| __CUDA_ARCH__ >= 320

				#endif // defined(__CLANG_CUDA_INTRINSICS_H__)

lib/Headers/__clang_cuda_runtime_wrapper.h

Show First 20 Lines • Show All 182 Lines • ▼ Show 20 Lines
#include "device_atomic_functions.hpp"		#include "device_atomic_functions.hpp"
#include "device_functions.hpp"		#include "device_functions.hpp"
#include "sm_20_atomic_functions.hpp"		#include "sm_20_atomic_functions.hpp"
#include "sm_20_intrinsics.hpp"		#include "sm_20_intrinsics.hpp"
#include "sm_32_atomic_functions.hpp"		#include "sm_32_atomic_functions.hpp"
// sm_30_intrinsics.h has declarations that use default argument, so		// sm_30_intrinsics.h has declarations that use default argument, so
// we have to include it and it will in turn include .hpp		// we have to include it and it will in turn include .hpp
#include "sm_30_intrinsics.h"		#include "sm_30_intrinsics.h"
#include "sm_32_intrinsics.hpp"
		// Don't include sm_32_intrinsics.h. That header defines __ldg using inline
		// asm, but we want to define it using builtins, because we can't use the
		// [addr+imm] addressing mode if we use the inline asm in the header.

#undef __MATH_FUNCTIONS_HPP__		#undef __MATH_FUNCTIONS_HPP__

// math_functions.hpp defines ::signbit as a __host__ __device__ function. This		// math_functions.hpp defines ::signbit as a __host__ __device__ function. This
// conflicts with libstdc++'s constexpr ::signbit, so we have to rename		// conflicts with libstdc++'s constexpr ::signbit, so we have to rename
// math_function.hpp's ::signbit. It's guarded by #undef signbit, but that's		// math_function.hpp's ::signbit. It's guarded by #undef signbit, but that's
// conditional on __GNUC__. :)		// conditional on __GNUC__. :)
#pragma push_macro("signbit")		#pragma push_macro("signbit")
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	__device__ inline __cuda_builtin_blockDim_t::operator dim3() const {
return dim3(x, y, z);		return dim3(x, y, z);
}		}

__device__ inline __cuda_builtin_gridDim_t::operator dim3() const {		__device__ inline __cuda_builtin_gridDim_t::operator dim3() const {
return dim3(x, y, z);		return dim3(x, y, z);
}		}

#include <__clang_cuda_cmath.h>		#include <__clang_cuda_cmath.h>
		#include <__clang_cuda_intrinsics.h>

// curand_mtgp32_kernel helpfully redeclares blockDim and threadIdx in host		// curand_mtgp32_kernel helpfully redeclares blockDim and threadIdx in host
// mode, giving them their "proper" types of dim3 and uint3. This is		// mode, giving them their "proper" types of dim3 and uint3. This is
// incompatible with the types we give in cuda_builtin_vars.h. As as hack,		// incompatible with the types we give in cuda_builtin_vars.h. As as hack,
// force-include the header (nvcc doesn't include it by default) but redefine		// force-include the header (nvcc doesn't include it by default) but redefine
// dim3 and uint3 to our builtin types. (Thankfully dim3 and uint3 are only		// dim3 and uint3 to our builtin types. (Thankfully dim3 and uint3 are only
// used here for the redeclarations of blockDim and threadIdx.)		// used here for the redeclarations of blockDim and threadIdx.)
#pragma push_macro("dim3")		#pragma push_macro("dim3")
Show All 9 Lines

test/CodeGen/builtins-nvptx.c

// REQUIRES: nvptx-registered-target		// REQUIRES: nvptx-registered-target
// RUN: %clang_cc1 -triple nvptx-unknown-unknown -fcuda-is-device -S -emit-llvm -o - -x cuda %s \| FileCheck %s		// RUN: %clang_cc1 -triple nvptx-unknown-unknown -fcuda-is-device -S -emit-llvm -o - -x cuda %s \| \
// RUN: %clang_cc1 -triple nvptx64-unknown-unknown -fcuda-is-device -S -emit-llvm -o - -x cuda %s \| FileCheck %s		// RUN: FileCheck -check-prefix=CHECK -check-prefix=LP32 %s
		// RUN: %clang_cc1 -triple nvptx64-unknown-unknown -fcuda-is-device -S -emit-llvm -o - -x cuda %s \| \
		// RUN: FileCheck -check-prefix=CHECK -check-prefix=LP64 %s

#define __device__ __attribute__((device))		#define __device__ __attribute__((device))
#define __global__ __attribute__((global))		#define __global__ __attribute__((global))
#define __shared__ __attribute__((shared))		#define __shared__ __attribute__((shared))
#define __constant__ __attribute__((constant))		#define __constant__ __attribute__((constant))

__device__ int read_tid() {		__device__ int read_tid() {

▲ Show 20 Lines • Show All 263 Lines • ▼ Show 20 Lines	__device__ void nvvm_atom(float fp, float f, int ip, int i, unsigned int uip, unsigned ui, long lp, long l,
// CHECK: call i32 @llvm.nvvm.atomic.load.inc.32.p0i32		// CHECK: call i32 @llvm.nvvm.atomic.load.inc.32.p0i32
__nvvm_atom_inc_gen_ui(uip, ui);		__nvvm_atom_inc_gen_ui(uip, ui);

// CHECK: call i32 @llvm.nvvm.atomic.load.dec.32.p0i32		// CHECK: call i32 @llvm.nvvm.atomic.load.dec.32.p0i32
__nvvm_atom_dec_gen_ui(uip, ui);		__nvvm_atom_dec_gen_ui(uip, ui);

// CHECK: ret		// CHECK: ret
}		}

		// CHECK-LABEL: nvvm_ldg
		__device__ void nvvm_ldg(const void *p) {
		// CHECK: call i8 @llvm.nvvm.ldg.global.i.i8.p0i8(i8* {{%[0-9]+}}, i32 1)
		// CHECK: call i8 @llvm.nvvm.ldg.global.i.i8.p0i8(i8* {{%[0-9]+}}, i32 1)
		__nvvm_ldg_c((const char *)p);
		__nvvm_ldg_uc((const unsigned char *)p);

		// CHECK: call i16 @llvm.nvvm.ldg.global.i.i16.p0i16(i16* {{%[0-9]+}}, i32 2)
		// CHECK: call i16 @llvm.nvvm.ldg.global.i.i16.p0i16(i16* {{%[0-9]+}}, i32 2)
		__nvvm_ldg_s((const short *)p);
		__nvvm_ldg_us((const unsigned short *)p);

		// CHECK: call i32 @llvm.nvvm.ldg.global.i.i32.p0i32(i32* {{%[0-9]+}}, i32 4)
		// CHECK: call i32 @llvm.nvvm.ldg.global.i.i32.p0i32(i32* {{%[0-9]+}}, i32 4)
		__nvvm_ldg_i((const int *)p);
		__nvvm_ldg_ui((const unsigned int *)p);

		// LP32: call i32 @llvm.nvvm.ldg.global.i.i32.p0i32(i32* {{%[0-9]+}}, i32 4)
		// LP32: call i32 @llvm.nvvm.ldg.global.i.i32.p0i32(i32* {{%[0-9]+}}, i32 4)
		// LP64: call i64 @llvm.nvvm.ldg.global.i.i64.p0i64(i64* {{%[0-9]+}}, i32 8)
		// LP64: call i64 @llvm.nvvm.ldg.global.i.i64.p0i64(i64* {{%[0-9]+}}, i32 8)
		__nvvm_ldg_l((const long *)p);
		__nvvm_ldg_ul((const unsigned long *)p);

		// CHECK: call float @llvm.nvvm.ldg.global.f.f32.p0f32(float* {{%[0-9]+}}, i32 4)
		__nvvm_ldg_f((const float *)p);
		// CHECK: call double @llvm.nvvm.ldg.global.f.f64.p0f64(double* {{%[0-9]+}}, i32 8)
		__nvvm_ldg_d((const double *)p);

		// In practice, the pointers we pass to __ldg will be aligned as appropriate
		// for the CUDA <type>N vector types (e.g. short4), which are not the same as
		// the LLVM vector types. However, each LLVM vector type has an alignment
		// less than or equal to its corresponding CUDA type, so we're OK.
		//
		// PTX Interoperability section 2.2: "For a vector with an even number of
		// elements, its alignment is set to number of elements times the alignment of
		// its member: n*alignof(t)."

		// CHECK: call <2 x i8> @llvm.nvvm.ldg.global.i.v2i8.p0v2i8(<2 x i8>* {{%[0-9]+}}, i32 2)
		// CHECK: call <2 x i8> @llvm.nvvm.ldg.global.i.v2i8.p0v2i8(<2 x i8>* {{%[0-9]+}}, i32 2)
		typedef char char2 __attribute__((ext_vector_type(2)));
		typedef unsigned char uchar2 __attribute__((ext_vector_type(2)));
		__nvvm_ldg_c2((const char2 *)p);
		__nvvm_ldg_uc2((const uchar2 *)p);

		// CHECK: call <4 x i8> @llvm.nvvm.ldg.global.i.v4i8.p0v4i8(<4 x i8>* {{%[0-9]+}}, i32 4)
		// CHECK: call <4 x i8> @llvm.nvvm.ldg.global.i.v4i8.p0v4i8(<4 x i8>* {{%[0-9]+}}, i32 4)
		typedef char char4 __attribute__((ext_vector_type(4)));
		typedef unsigned char uchar4 __attribute__((ext_vector_type(4)));
		__nvvm_ldg_c4((const char4 *)p);
		__nvvm_ldg_uc4((const uchar4 *)p);

		// CHECK: call <2 x i16> @llvm.nvvm.ldg.global.i.v2i16.p0v2i16(<2 x i16>* {{%[0-9]+}}, i32 4)
		// CHECK: call <2 x i16> @llvm.nvvm.ldg.global.i.v2i16.p0v2i16(<2 x i16>* {{%[0-9]+}}, i32 4)
		typedef short short2 __attribute__((ext_vector_type(2)));
		typedef unsigned short ushort2 __attribute__((ext_vector_type(2)));
		__nvvm_ldg_s2((const short2 *)p);
		__nvvm_ldg_us2((const ushort2 *)p);

		// CHECK: call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0v4i16(<4 x i16>* {{%[0-9]+}}, i32 8)
		// CHECK: call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0v4i16(<4 x i16>* {{%[0-9]+}}, i32 8)
		typedef short short4 __attribute__((ext_vector_type(4)));
		typedef unsigned short ushort4 __attribute__((ext_vector_type(4)));
		__nvvm_ldg_s4((const short4 *)p);
		__nvvm_ldg_us4((const ushort4 *)p);

		// CHECK: call <2 x i32> @llvm.nvvm.ldg.global.i.v2i32.p0v2i32(<2 x i32>* {{%[0-9]+}}, i32 8)
		// CHECK: call <2 x i32> @llvm.nvvm.ldg.global.i.v2i32.p0v2i32(<2 x i32>* {{%[0-9]+}}, i32 8)
		typedef int int2 __attribute__((ext_vector_type(2)));
		typedef unsigned int uint2 __attribute__((ext_vector_type(2)));
		__nvvm_ldg_i2((const int2 *)p);
		__nvvm_ldg_ui2((const uint2 *)p);

		// CHECK: call <4 x i32> @llvm.nvvm.ldg.global.i.v4i32.p0v4i32(<4 x i32>* {{%[0-9]+}}, i32 16)
		// CHECK: call <4 x i32> @llvm.nvvm.ldg.global.i.v4i32.p0v4i32(<4 x i32>* {{%[0-9]+}}, i32 16)
		typedef int int4 __attribute__((ext_vector_type(4)));
		typedef unsigned int uint4 __attribute__((ext_vector_type(4)));
		__nvvm_ldg_i4((const int4 *)p);
		__nvvm_ldg_ui4((const uint4 *)p);

		// CHECK: call <2 x i64> @llvm.nvvm.ldg.global.i.v2i64.p0v2i64(<2 x i64>* {{%[0-9]+}}, i32 16)
		// CHECK: call <2 x i64> @llvm.nvvm.ldg.global.i.v2i64.p0v2i64(<2 x i64>* {{%[0-9]+}}, i32 16)
		typedef long long longlong2 __attribute__((ext_vector_type(2)));
		typedef unsigned long long ulonglong2 __attribute__((ext_vector_type(2)));
		__nvvm_ldg_ll2((const longlong2 *)p);
		__nvvm_ldg_ull2((const ulonglong2 *)p);

		// CHECK: call <2 x float> @llvm.nvvm.ldg.global.f.v2f32.p0v2f32(<2 x float>* {{%[0-9]+}}, i32 8)
		typedef float float2 __attribute__((ext_vector_type(2)));
		__nvvm_ldg_f2((const float2 *)p);

		// CHECK: call <4 x float> @llvm.nvvm.ldg.global.f.v4f32.p0v4f32(<4 x float>* {{%[0-9]+}}, i32 16)
		typedef float float4 __attribute__((ext_vector_type(4)));
		__nvvm_ldg_f4((const float4 *)p);

		// CHECK: call <2 x double> @llvm.nvvm.ldg.global.f.v2f64.p0v2f64(<2 x double>* {{%[0-9]+}}, i32 16)
		typedef double double2 __attribute__((ext_vector_type(2)));
		__nvvm_ldg_d2((const double2 *)p);
		}