This is an archive of the discontinued LLVM Phabricator instance.

With this patch, the following code could be compiled into the same PTX as NVCC. Check https://godbolt.org/z/q6EYE1q1o for the difference between NVCC and the current Clang output.

Harbormaster completed remote builds in B129471: Diff 380580.Oct 18 2021, 9:48 PM

LGTM in general.

clang/include/clang/Basic/BuiltinsNVPTX.def
691–694	CUDA appears to be using `__nv_isGlobal_impl` for the AS predicates. Perhaps we want to add those, too, forwarding them to the `__nvvm_...` implementations above. I've already added a few other AS-related `__nv_*` builtins in `lib/Headers/__clang_cuda_intrinsics.h`.

This revision is now accepted and ready to land.Oct 19 2021, 11:27 AM

Closed by commit rG6fe902daf931: [cuda] Add address space predicate funuctions. (authored by hliao). · Explain WhyOct 19 2021, 1:20 PM

This revision was automatically updated to reflect the committed changes.

hliao added a commit: rG6fe902daf931: [cuda] Add address space predicate funuctions..

hliao added inline comments.Oct 19 2021, 1:22 PM

clang/include/clang/Basic/BuiltinsNVPTX.def

691–694

__nv_isGlobal_impl is exposed as an official interface. In fact, in CUDA SDK 10.0 or earlier, __isGlobal is directly implemented as inline asm. If possible, we should avoid defining unofficial or undocumented interfaces. __nv_isGlobal_impl was introduced from CUDA SDK 10.1 but there is no documentation on it.

// This function returns 1 if generic address "ptr" is in global memory space.
// It returns 0 if "ptr" is in shared, local or constant memory space.
__SM_20_INTRINSICS_DECL__ unsigned int __isGlobal(const void *ptr)
{
  unsigned int ret;
  asm volatile ("{ \n\t"
                "    .reg .pred p; \n\t"
                "    isspacep.global p, %1; \n\t"
                "    selp.u32 %0, 1, 0, p;  \n\t"
#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__) || defined(__CUDACC_RTC__)
                "} \n\t" : "=r"(ret) : "l"(ptr));
#else
                "} \n\t" : "=r"(ret) : "r"(ptr));
#endif

  return ret;
}

hliao added inline comments.Oct 19 2021, 1:23 PM

clang/include/clang/Basic/BuiltinsNVPTX.def

691–694

typo, __nv_isGlobal_impl is *not* exposed as an official interface.

__nv_isGlobal_impl is exposed as an official interface. In fact, in CUDA SDK 10.0 or earlier, __isGlobal is directly implemented as inline asm. If possible, we should avoid defining unofficial or undocumented interfaces. __nv_isGlobal_impl was introduced from CUDA SDK 10.1 but there is no documentation on it.
// This function returns 1 if generic address "ptr" is in global memory space.
// It returns 0 if "ptr" is in shared, local or constant memory space.
__SM_20_INTRINSICS_DECL__ unsigned int __isGlobal(const void *ptr)
{
  unsigned int ret;
  asm volatile ("{ \n\t"
                "    .reg .pred p; \n\t"
                "    isspacep.global p, %1; \n\t"
                "    selp.u32 %0, 1, 0, p;  \n\t"
#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__) || defined(__CUDACC_RTC__)
                "} \n\t" : "=r"(ret) : "l"(ptr));
#else
                "} \n\t" : "=r"(ret) : "r"(ptr));
#endif

  return ret;
}

tra added inline comments.Oct 19 2021, 1:35 PM

clang/include/clang/Basic/BuiltinsNVPTX.def
691–694	`__nv_isGlobal_impl` is not exposed as an official interface. If possible, we should avoid defining unofficial or undocumented interfaces. __nv_isGlobal_impl was introduced from CUDA SDK 10.1 but there is no documentation on it. In general, I agree. I just wish NVIDIA would stop using undocumented APIs in the public headers they ship. By necessity, clang either has to rely on preprocessor hacks to edit out uncompileable code, or guess what the undocumented APIs do and implement them. In this case I do not think `__nv_is*_impl` are used anywhere other than in the functions you have renamed, so we're fine without them. They would be needed if the patch didn't have to rename `__isGlobal` and friends.

hliao added inline comments.Oct 19 2021, 3:30 PM

clang/include/clang/Basic/BuiltinsNVPTX.def
691–694	`__nv_isGlobal_impl` is not exposed as an official interface. If possible, we should avoid defining unofficial or undocumented interfaces. __nv_isGlobal_impl was introduced from CUDA SDK 10.1 but there is no documentation on it. In general, I agree. I just wish NVIDIA would stop using undocumented APIs in the public headers they ship. By necessity, clang either has to rely on preprocessor hacks to edit out uncompileable code, or guess what the undocumented APIs do and implement them. In this case I do not think `__nv_is*_impl` are used anywhere other than in the functions you have renamed, so we're fine without them. They would be needed if the patch didn't have to rename `__isGlobal` and friends. another motivation we have to redefine them is that we need to add const (or pure) attributes for them.

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

BuiltinsNVPTX.def

6 lines

lib/

Headers/

__clang_cuda_runtime_wrapper.h

31 lines

Diff 380768

clang/include/clang/Basic/BuiltinsNVPTX.def

	Show First 20 Lines • Show All 681 Lines • ▼ Show 20 Lines
	BUILTIN(__nvvm_ldg_ui2, "E2UiE2UiC*", "")			BUILTIN(__nvvm_ldg_ui2, "E2UiE2UiC*", "")
	BUILTIN(__nvvm_ldg_ui4, "E4UiE4UiC*", "")			BUILTIN(__nvvm_ldg_ui4, "E4UiE4UiC*", "")
	BUILTIN(__nvvm_ldg_ull2, "E2ULLiE2ULLiC*", "")			BUILTIN(__nvvm_ldg_ull2, "E2ULLiE2ULLiC*", "")

	BUILTIN(__nvvm_ldg_f2, "E2fE2fC*", "")			BUILTIN(__nvvm_ldg_f2, "E2fE2fC*", "")
	BUILTIN(__nvvm_ldg_f4, "E4fE4fC*", "")			BUILTIN(__nvvm_ldg_f4, "E4fE4fC*", "")
	BUILTIN(__nvvm_ldg_d2, "E2dE2dC*", "")			BUILTIN(__nvvm_ldg_d2, "E2dE2dC*", "")

				// Address space predicates.
				BUILTIN(__nvvm_isspacep_const, "bvC*", "nc")
				BUILTIN(__nvvm_isspacep_global, "bvC*", "nc")
				BUILTIN(__nvvm_isspacep_local, "bvC*", "nc")
				BUILTIN(__nvvm_isspacep_shared, "bvC*", "nc")
				traUnsubmitted Not Done Reply Inline Actions CUDA appears to be using `__nv_isGlobal_impl` for the AS predicates. Perhaps we want to add those, too, forwarding them to the `__nvvm_...` implementations above. I've already added a few other AS-related `__nv_` builtins in `lib/Headers/__clang_cuda_intrinsics.h`. tra:* CUDA appears to be using `__nv_isGlobal_impl` for the AS predicates. Perhaps we want to add…
				hliaoAuthorUnsubmitted Done Reply Inline Actions `__nv_isGlobal_impl` is exposed as an official interface. In fact, in CUDA SDK 10.0 or earlier, `__isGlobal` is directly implemented as inline asm. If possible, we should avoid defining unofficial or undocumented interfaces. `__nv_isGlobal_impl` was introduced from CUDA SDK 10.1 but there is no documentation on it. // This function returns 1 if generic address "ptr" is in global memory space. // It returns 0 if "ptr" is in shared, local or constant memory space. __SM_20_INTRINSICS_DECL__ unsigned int __isGlobal(const void ptr) { unsigned int ret; asm volatile ("{ \n\t" " .reg .pred p; \n\t" " isspacep.global p, %1; \n\t" " selp.u32 %0, 1, 0, p; \n\t" #if (defined(_MSC_VER) && defined(_WIN64)) \|\| defined(__LP64__) \|\| defined(__CUDACC_RTC__) "} \n\t" : "=r"(ret) : "l"(ptr)); #else "} \n\t" : "=r"(ret) : "r"(ptr)); #endif return ret; } hliao:* `__nv_isGlobal_impl` is exposed as an official interface. In fact, in CUDA SDK 10.0 or earlier…
				hliaoAuthorUnsubmitted Done Reply Inline Actions typo, `__nv_isGlobal_impl` is not exposed as an official interface. `__nv_isGlobal_impl` is exposed as an official interface. In fact, in CUDA SDK 10.0 or earlier, `__isGlobal` is directly implemented as inline asm. If possible, we should avoid defining unofficial or undocumented interfaces. `__nv_isGlobal_impl` was introduced from CUDA SDK 10.1 but there is no documentation on it. // This function returns 1 if generic address "ptr" is in global memory space. // It returns 0 if "ptr" is in shared, local or constant memory space. __SM_20_INTRINSICS_DECL__ unsigned int __isGlobal(const void ptr) { unsigned int ret; asm volatile ("{ \n\t" " .reg .pred p; \n\t" " isspacep.global p, %1; \n\t" " selp.u32 %0, 1, 0, p; \n\t" #if (defined(_MSC_VER) && defined(_WIN64)) \|\| defined(__LP64__) \|\| defined(__CUDACC_RTC__) "} \n\t" : "=r"(ret) : "l"(ptr)); #else "} \n\t" : "=r"(ret) : "r"(ptr)); #endif return ret; } hliao:* typo, `__nv_isGlobal_impl` is not exposed as an official interface. > `__nv_isGlobal_impl`…
				traUnsubmitted Not Done Reply Inline Actions `__nv_isGlobal_impl` is not exposed as an official interface. If possible, we should avoid defining unofficial or undocumented interfaces. __nv_isGlobal_impl was introduced from CUDA SDK 10.1 but there is no documentation on it. In general, I agree. I just wish NVIDIA would stop using undocumented APIs in the public headers they ship. By necessity, clang either has to rely on preprocessor hacks to edit out uncompileable code, or guess what the undocumented APIs do and implement them. In this case I do not think `__nv_is_impl` are used anywhere other than in the functions you have renamed, so we're fine without them. They would be needed if the patch didn't have to rename `__isGlobal` and friends. tra:* > `__nv_isGlobal_impl` is not exposed as an official interface. > If possible, we should…
				hliaoAuthorUnsubmitted Done Reply Inline Actions `__nv_isGlobal_impl` is not exposed as an official interface. If possible, we should avoid defining unofficial or undocumented interfaces. __nv_isGlobal_impl was introduced from CUDA SDK 10.1 but there is no documentation on it. In general, I agree. I just wish NVIDIA would stop using undocumented APIs in the public headers they ship. By necessity, clang either has to rely on preprocessor hacks to edit out uncompileable code, or guess what the undocumented APIs do and implement them. In this case I do not think `__nv_is_impl` are used anywhere other than in the functions you have renamed, so we're fine without them. They would be needed if the patch didn't have to rename `__isGlobal` and friends. another motivation we have to redefine them is that we need to add const (or pure) attributes for them. hliao:* > > `__nv_isGlobal_impl` is not exposed as an official interface. > > If possible, we should…

	// Builtins to support WMMA instructions on sm_70			// Builtins to support WMMA instructions on sm_70
	TARGET_BUILTIN(__hmma_m16n16k16_ld_a, "viiCUiIi", "", AND(SM_70,PTX60))			TARGET_BUILTIN(__hmma_m16n16k16_ld_a, "viiCUiIi", "", AND(SM_70,PTX60))
	TARGET_BUILTIN(__hmma_m16n16k16_ld_b, "viiCUiIi", "", AND(SM_70,PTX60))			TARGET_BUILTIN(__hmma_m16n16k16_ld_b, "viiCUiIi", "", AND(SM_70,PTX60))
	TARGET_BUILTIN(__hmma_m16n16k16_ld_c_f16, "viiCUiIi", "", AND(SM_70,PTX60))			TARGET_BUILTIN(__hmma_m16n16k16_ld_c_f16, "viiCUiIi", "", AND(SM_70,PTX60))
	TARGET_BUILTIN(__hmma_m16n16k16_ld_c_f32, "vffCUiIi", "", AND(SM_70,PTX60))			TARGET_BUILTIN(__hmma_m16n16k16_ld_c_f32, "vffCUiIi", "", AND(SM_70,PTX60))
	TARGET_BUILTIN(__hmma_m16n16k16_st_c_f16, "viiUiIi", "", AND(SM_70,PTX60))			TARGET_BUILTIN(__hmma_m16n16k16_st_c_f16, "viiUiIi", "", AND(SM_70,PTX60))
	TARGET_BUILTIN(__hmma_m16n16k16_st_c_f32, "vffUiIi", "", AND(SM_70,PTX60))			TARGET_BUILTIN(__hmma_m16n16k16_st_c_f32, "vffUiIi", "", AND(SM_70,PTX60))

	▲ Show 20 Lines • Show All 126 Lines • Show Last 20 Lines

clang/lib/Headers/__clang_cuda_runtime_wrapper.h

	Show First 20 Lines • Show All 265 Lines • ▼ Show 20 Lines
	#include "crt/device_double_functions.hpp"			#include "crt/device_double_functions.hpp"
	#else			#else
	#include "device_functions.hpp"			#include "device_functions.hpp"
	#define __CUDABE__			#define __CUDABE__
	#include "device_double_functions.h"			#include "device_double_functions.h"
	#undef __CUDABE__			#undef __CUDABE__
	#endif			#endif
	#include "sm_20_atomic_functions.hpp"			#include "sm_20_atomic_functions.hpp"
				// Predicate functions used in `__builtin_assume` need to have no side effect.
				// However, sm_20_intrinsics.hpp doesn't define them with neither pure nor
				// const attribute. Rename definitions from sm_20_intrinsics.hpp and re-define
				// them as pure ones.
				#pragma push_macro("__isGlobal")
				#pragma push_macro("__isShared")
				#pragma push_macro("__isConstant")
				#pragma push_macro("__isLocal")
				#define __isGlobal __ignored_cuda___isGlobal
				#define __isShared __ignored_cuda___isShared
				#define __isConstant __ignored_cuda___isConstant
				#define __isLocal __ignored_cuda___isLocal
	#include "sm_20_intrinsics.hpp"			#include "sm_20_intrinsics.hpp"
				#pragma pop_macro("__isGlobal")
				#pragma pop_macro("__isShared")
				#pragma pop_macro("__isConstant")
				#pragma pop_macro("__isLocal")
				#pragma push_macro("__DEVICE__")
				#define __DEVICE__ static __device__ __forceinline__ __attribute__((const))
				__DEVICE__ unsigned int __isGlobal(const void *p) {
				return __nvvm_isspacep_global(p);
				}
				__DEVICE__ unsigned int __isShared(const void *p) {
				return __nvvm_isspacep_shared(p);
				}
				__DEVICE__ unsigned int __isConstant(const void *p) {
				return __nvvm_isspacep_const(p);
				}
				__DEVICE__ unsigned int __isLocal(const void *p) {
				return __nvvm_isspacep_local(p);
				}
				#pragma pop_macro("__DEVICE__")
	#include "sm_32_atomic_functions.hpp"			#include "sm_32_atomic_functions.hpp"

	// Don't include sm_30_intrinsics.h and sm_32_intrinsics.h. These define the			// Don't include sm_30_intrinsics.h and sm_32_intrinsics.h. These define the
	// __shfl and __ldg intrinsics using inline (volatile) asm, but we want to			// __shfl and __ldg intrinsics using inline (volatile) asm, but we want to
	// define them using builtins so that the optimizer can reason about and across			// define them using builtins so that the optimizer can reason about and across
	// these instructions. In particular, using intrinsics for ldg gets us the			// these instructions. In particular, using intrinsics for ldg gets us the
	// [addr+imm] addressing mode, which, although it doesn't actually exist in the			// [addr+imm] addressing mode, which, although it doesn't actually exist in the
	// hardware, seems to generate faster machine code because ptxas can more easily			// hardware, seems to generate faster machine code because ptxas can more easily
	▲ Show 20 Lines • Show All 190 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[cuda] Add address space predicate funuctions.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 380768

clang/include/clang/Basic/BuiltinsNVPTX.def

clang/lib/Headers/__clang_cuda_runtime_wrapper.h

[cuda] Add address space predicate funuctions.
ClosedPublic