This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Provide CUDA's vector types implemented using clang's vector extension.
AbandonedPublic

Authored by tra on Mar 10 2016, 10:42 AM.

Download Raw Diff

Details

Reviewers

jlebar
jingyue

Summary

This provides substantial performance boost on some benchmarks
(~25% on SHOC's FFT) due to vectorized loads/stores.

Unfortunately existing CUDA headers and user code occasionally
take pointer to vector fields which clang does not allow, so
we can't use vector types by default.

While vectorized types help in some cases, they may lower
performance in cases when user reads/writes only part of the vector as
Clang currently generates code to always load/store complete vector.
It may also create data races if user code assumed that parts of the
same vector can be safely changed from different threads.

For now control this feature via -DCUDA_VECTOR_TYPES and let user
choose whether to use Clang's vectorized types or CUDA's
non-vectorized ones.

Diff Detail

Event Timeline

tra updated this revision to Diff 50301.Mar 10 2016, 10:42 AM

tra retitled this revision from to [CUDA] Provide CUDA's vector types implemented using clang's vector extension..

tra updated this object.

tra added reviewers: jlebar, jingyue.

tra added a subscriber: cfe-commits.

jlebar added inline comments.Mar 10 2016, 1:04 PM

lib/Headers/__clang_cuda_runtime_wrapper.h
72	Hm, this is a surprising (to me) way of controlling this feature. Can we use a -f flag instead? Even if all that -f flag does is define something (although in this case I'd suggest giving it a longer name so it's harder to collide with it). -fsomething would be more discoverable and canonical, I think, and would be easier to document.
lib/Headers/__clang_cuda_vector_types.h
76	I thought host/device attributes weren't needed on classes, only functions?
80	Nit: double underscore is a little weird here, and sort of needlessly competes with the language-reserved __ identifier namespace. Could we just use one underscore?
82	nvidia's version of this function is not explicit -- is this difference intentional?
84	This requires C++11 -- is that intentional?

Removed unneeded struct attributes.

tra added inline comments.Mar 10 2016, 1:30 PM

lib/Headers/__clang_cuda_runtime_wrapper.h
72	I want to tweak end-user's view of CUDA headers which has nothing to do with compiler, IMO.
lib/Headers/__clang_cuda_vector_types.h
76	Ugh. Removed.
80	I'm following the change Eric made to other headers in r260647 so that all arguments use __.
82	That's due to the way vector types are their base types with attribute. Without explicit dim3 variant, compiler can't disambiguate between dim3(int=1,int=1,int=1) and dim3(uint3 which is int w/ attribute).
84	It looks that way, but it does not need c++11. Vector literals allow brace initializers w/o requiring c++11.

jlebar added inline comments.Mar 10 2016, 1:44 PM

lib/Headers/__clang_cuda_runtime_wrapper.h
72	The compiler driver is responsible for enabling/disabling language extensions, and for choosing exactly which dialect we accept. It's also responsible for deciding which optimizations to use. This fits in all of those ways. Moreover, again, -Dfoo won't appear in --help, so, from a user's perspective, is undiscoverable. In the event that they do discover it somehow, there's no documentation attached to the flag. I am not aware of any switches built into clang that rely on -D. If you really want to do it this way, can you point me to prior art?
lib/Headers/__clang_cuda_vector_types.h
82	Huh, apparently we do want to use the reserved namespace? If so, this logic applies very strongly to a -D, which is going to be far more user-visible than the arg names here.
84	If I'm understanding correctly, you're saying that if we have struct dim3 { dim3(unsigned, unsigned, unsigned); dim3(uint3); }; void foo(dim3); that the call uint3 x; foo(x); is ambiguous, because it could call either dim3 constructor overload? That is bizarre, but if so, do we need the dim3(uint3) constructor at all?

Ugh. Found more problems with using vector types in C++. Abandoning the idea.

In D18051#372490, @tra wrote:

Ugh. Found more problems with using vector types in C++. Abandoning the idea.

I'm curious, what problems?

There were ambiguities in overload resolution between vector types and
their base types. I.e. if I had

void foo(int);
void foo(int3);

then call foo(3) was ambiguous.
It wasn't clear whether this extension is supposed to work in C++ at all.

Revision Contents

Path

Size

lib/

Headers/

CMakeLists.txt

1 line

__clang_cuda_runtime_wrapper.h

9 lines

__clang_cuda_vector_types.h

87 lines

cuda_builtin_vars.h

2 lines

Diff 50301

lib/Headers/CMakeLists.txt

Show All 14 Lines	set(files
avx512vbmiintrin.h		avx512vbmiintrin.h
avx512vbmivlintrin.h		avx512vbmivlintrin.h
pkuintrin.h		pkuintrin.h
avxintrin.h		avxintrin.h
bmi2intrin.h		bmi2intrin.h
bmiintrin.h		bmiintrin.h
__clang_cuda_cmath.h		__clang_cuda_cmath.h
__clang_cuda_runtime_wrapper.h		__clang_cuda_runtime_wrapper.h
		__clang_cuda_vector_types.h
cpuid.h		cpuid.h
cuda_builtin_vars.h		cuda_builtin_vars.h
emmintrin.h		emmintrin.h
f16cintrin.h		f16cintrin.h
float.h		float.h
fma4intrin.h		fma4intrin.h
fmaintrin.h		fmaintrin.h
fxsrintrin.h		fxsrintrin.h
▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

lib/Headers/__clang_cuda_runtime_wrapper.h

	Show First 20 Lines • Show All 63 Lines • ▼ Show 20 Lines
	#endif			#endif

	// Make largest subset of device functions available during host			// Make largest subset of device functions available during host
	// compilation -- SM_35 for the time being.			// compilation -- SM_35 for the time being.
	#ifndef __CUDA_ARCH__			#ifndef __CUDA_ARCH__
	#define __CUDA_ARCH__ 350			#define __CUDA_ARCH__ 350
	#endif			#endif

				#if defined(CUDA_VECTOR_TYPES)
				jlebarUnsubmitted Not Done Reply Inline Actions Hm, this is a surprising (to me) way of controlling this feature. Can we use a -f flag instead? Even if all that -f flag does is define something (although in this case I'd suggest giving it a longer name so it's harder to collide with it). -fsomething would be more discoverable and canonical, I think, and would be easier to document. jlebar: Hm, this is a surprising (to me) way of controlling this feature. Can we use a -f flag instead?
				traAuthorUnsubmitted Not Done Reply Inline Actions I want to tweak end-user's view of CUDA headers which has nothing to do with compiler, IMO. tra: I want to tweak end-user's view of CUDA headers which has nothing to do with compiler, IMO.
				jlebarUnsubmitted Not Done Reply Inline Actions The compiler driver is responsible for enabling/disabling language extensions, and for choosing exactly which dialect we accept. It's also responsible for deciding which optimizations to use. This fits in all of those ways. Moreover, again, -Dfoo won't appear in --help, so, from a user's perspective, is undiscoverable. In the event that they do discover it somehow, there's no documentation attached to the flag. I am not aware of any switches built into clang that rely on -D. If you really want to do it this way, can you point me to prior art? jlebar: The compiler driver is responsible for enabling/disabling language extensions, and for choosing…
				// Prevent inclusion of CUDA's vector_types.h
				#define __VECTOR_TYPES_H__
				// .. and include clang's own types for them instead.
				#include "__clang_cuda_vector_types.h"
				#endif

	#include "cuda_builtin_vars.h"			#include "cuda_builtin_vars.h"

	// No need for device_launch_parameters.h as cuda_builtin_vars.h above			// No need for device_launch_parameters.h as cuda_builtin_vars.h above
	// has taken care of builtin variables declared in the file.			// has taken care of builtin variables declared in the file.
	#define __DEVICE_LAUNCH_PARAMETERS_H__			#define __DEVICE_LAUNCH_PARAMETERS_H__

	// {math,device}_functions.h only have declarations of the			// {math,device}_functions.h only have declarations of the
	// functions. We don't need them as we're going to pull in their			// functions. We don't need them as we're going to pull in their
	▲ Show 20 Lines • Show All 105 Lines • ▼ Show 20 Lines
	// sm_30_intrinsics.h has declarations that use default argument, so			// sm_30_intrinsics.h has declarations that use default argument, so
	// we have to include it and it will in turn include .hpp			// we have to include it and it will in turn include .hpp
	#include "sm_30_intrinsics.h"			#include "sm_30_intrinsics.h"
	#include "sm_32_intrinsics.hpp"			#include "sm_32_intrinsics.hpp"
	#undef __MATH_FUNCTIONS_HPP__			#undef __MATH_FUNCTIONS_HPP__
	#include "math_functions.hpp"			#include "math_functions.hpp"
	#pragma pop_macro("__host__")			#pragma pop_macro("__host__")

				#if !defined(CUDA_VECTOR_TYPES)
	#include "texture_indirect_functions.h"			#include "texture_indirect_functions.h"
				#endif

	// Restore state of __CUDA_ARCH__ and __THROW we had on entry.			// Restore state of __CUDA_ARCH__ and __THROW we had on entry.
	#pragma pop_macro("__CUDA_ARCH__")			#pragma pop_macro("__CUDA_ARCH__")
	#pragma pop_macro("__THROW")			#pragma pop_macro("__THROW")

	// Set up compiler macros expected to be seen during compilation.			// Set up compiler macros expected to be seen during compilation.
	#undef __CUDABE__			#undef __CUDABE__
	#define __CUDACC__			#define __CUDACC__
	▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

lib/Headers/__clang_cuda_vector_types.h

This file was added.

				/*===---- __clang_cuda_vector_types.h - CUDA vector types -----------------===
				*
				* Permission is hereby granted, free of charge, to any person obtaining a copy
				* of this software and associated documentation files (the "Software"), to deal
				* in the Software without restriction, including without limitation the rights
				* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
				* copies of the Software, and to permit persons to whom the Software is
				* furnished to do so, subject to the following conditions:
				*
				* The above copyright notice and this permission notice shall be included in
				* all copies or substantial portions of the Software.
				*
				* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
				* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
				* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
				* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
				* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
				* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
				* THE SOFTWARE.
				*
				*===-----------------------------------------------------------------------===
				*/

				#ifndef __CLANG_CUDA_VECTOR_TYPES_H__
				#define __CLANG_CUDA_VECTOR_TYPES_H__

				typedef char char1 __attribute__((ext_vector_type(1)));
				typedef char char2 __attribute__((ext_vector_type(2)));
				typedef char char3 __attribute__((ext_vector_type(3)));
				typedef char char4 __attribute__((ext_vector_type(4)));
				typedef unsigned char uchar1 __attribute__((ext_vector_type(1)));
				typedef unsigned char uchar2 __attribute__((ext_vector_type(2)));
				typedef unsigned char uchar3 __attribute__((ext_vector_type(3)));
				typedef unsigned char uchar4 __attribute__((ext_vector_type(4)));
				typedef short short1 __attribute__((ext_vector_type(1)));
				typedef short short2 __attribute__((ext_vector_type(2)));
				typedef short short3 __attribute__((ext_vector_type(3)));
				typedef short short4 __attribute__((ext_vector_type(4)));
				typedef unsigned short ushort1 __attribute__((ext_vector_type(1)));
				typedef unsigned short ushort2 __attribute__((ext_vector_type(2)));
				typedef unsigned short ushort3 __attribute__((ext_vector_type(3)));
				typedef unsigned short ushort4 __attribute__((ext_vector_type(4)));
				typedef int int1 __attribute__((ext_vector_type(1)));
				typedef int int2 __attribute__((ext_vector_type(2)));
				typedef int int3 __attribute__((ext_vector_type(3)));
				typedef int int4 __attribute__((ext_vector_type(4)));
				typedef unsigned int uint1 __attribute__((ext_vector_type(1)));
				typedef unsigned int uint2 __attribute__((ext_vector_type(2)));
				typedef unsigned int uint3 __attribute__((ext_vector_type(3)));
				typedef unsigned int uint4 __attribute__((ext_vector_type(4)));
				typedef long long1 __attribute__((ext_vector_type(1)));
				typedef long long2 __attribute__((ext_vector_type(2)));
				typedef long long3 __attribute__((ext_vector_type(3)));
				typedef long long4 __attribute__((ext_vector_type(4)));
				typedef unsigned long ulong1 __attribute__((ext_vector_type(1)));
				typedef unsigned long ulong2 __attribute__((ext_vector_type(2)));
				typedef unsigned long ulong3 __attribute__((ext_vector_type(3)));
				typedef unsigned long ulong4 __attribute__((ext_vector_type(4)));
				typedef long long longlong1 __attribute__((ext_vector_type(1)));
				typedef long long longlong2 __attribute__((ext_vector_type(2)));
				typedef long long longlong3 __attribute__((ext_vector_type(3)));
				typedef long long longlong4 __attribute__((ext_vector_type(4)));
				typedef unsigned long long ulonglong1 __attribute__((ext_vector_type(1)));
				typedef unsigned long long ulonglong2 __attribute__((ext_vector_type(2)));
				typedef unsigned long long ulonglong3 __attribute__((ext_vector_type(3)));
				typedef unsigned long long ulonglong4 __attribute__((ext_vector_type(4)));
				typedef float float1 __attribute__((ext_vector_type(1)));
				typedef float float2 __attribute__((ext_vector_type(2)));
				typedef float float3 __attribute__((ext_vector_type(3)));
				typedef float float4 __attribute__((ext_vector_type(4)));
				typedef double double1 __attribute__((ext_vector_type(1)));
				typedef double double2 __attribute__((ext_vector_type(2)));
				typedef double double3 __attribute__((ext_vector_type(3)));
				typedef double double4 __attribute__((ext_vector_type(4)));

				__attribute__((host,device))
				jlebarUnsubmitted Done Reply Inline Actions I thought host/device attributes weren't needed on classes, only functions? jlebar: I thought host/device attributes weren't needed on classes, only functions?
				traAuthorUnsubmitted Not Done Reply Inline Actions Ugh. Removed. tra: Ugh. Removed.
				struct dim3 {
				uint x, y, z;
				__attribute__((host, device))
				dim3(unsigned __x = 1, unsigned __y = 1, unsigned __z = 1)
				jlebarUnsubmitted Not Done Reply Inline Actions Nit: double underscore is a little weird here, and sort of needlessly competes with the language-reserved __ identifier namespace. Could we just use one underscore? jlebar: Nit: double underscore is a little weird here, and sort of needlessly competes with the…
				traAuthorUnsubmitted Not Done Reply Inline Actions I'm following the change Eric made to other headers in r260647 so that all arguments use __. tra: I'm following the change Eric made to other headers in r260647 so that all arguments use __.
				: x(__x), y(__y), z(__z) {}
				__attribute__((host, device)) explicit dim3(uint3 __a)
				jlebarUnsubmitted Not Done Reply Inline Actions Huh, apparently we do want to use the reserved namespace? If so, this logic applies very strongly to a -D, which is going to be far more user-visible than the arg names here. jlebar: Huh, apparently we do want to use the reserved namespace? If so, this logic applies very…
				jlebarUnsubmitted Not Done Reply Inline Actions nvidia's version of this function is not explicit -- is this difference intentional? jlebar: nvidia's version of this function is not explicit -- is this difference intentional?
				traAuthorUnsubmitted Not Done Reply Inline Actions That's due to the way vector types are their base types with attribute. Without explicit dim3 variant, compiler can't disambiguate between dim3(int=1,int=1,int=1) and dim3(uint3 which is int w/ attribute). tra: That's due to the way vector types are their base types with attribute. Without explicit dim3…
				: x(__a.x), y(__a.y), z(__a.z) {}
				__attribute__((host, device)) operator uint3(void) { return {x, y, z}; }
				jlebarUnsubmitted Not Done Reply Inline Actions If I'm understanding correctly, you're saying that if we have struct dim3 { dim3(unsigned, unsigned, unsigned); dim3(uint3); }; void foo(dim3); that the call uint3 x; foo(x); is ambiguous, because it could call either dim3 constructor overload? That is bizarre, but if so, do we need the dim3(uint3) constructor at all? jlebar: If I'm understanding correctly, you're saying that if we have struct dim3 { dim3…
				jlebarUnsubmitted Not Done Reply Inline Actions This requires C++11 -- is that intentional? jlebar: This requires C++11 -- is that intentional?
				traAuthorUnsubmitted Not Done Reply Inline Actions It looks that way, but it does not need c++11. Vector literals allow brace initializers w/o requiring c++11. tra: It looks that way, but it does not need c++11. [[ http://clang.llvm.org/docs/LanguageExtensions.
				};

				#endif

lib/Headers/cuda_builtin_vars.h

	Show All 18 Lines
	* THE SOFTWARE.			* THE SOFTWARE.
	*			*
	*===-----------------------------------------------------------------------===			*===-----------------------------------------------------------------------===
	*/			*/

	#ifndef __CUDA_BUILTIN_VARS_H			#ifndef __CUDA_BUILTIN_VARS_H
	#define __CUDA_BUILTIN_VARS_H			#define __CUDA_BUILTIN_VARS_H

				#if !defined(CUDA_VECTOR_TYPES)
	// Forward declares from vector_types.h.			// Forward declares from vector_types.h.
	struct uint3;			struct uint3;
	struct dim3;			struct dim3;
				#endif

	// The file implements built-in CUDA variables using __declspec(property).			// The file implements built-in CUDA variables using __declspec(property).
	// https://msdn.microsoft.com/en-us/library/yhfk0thd.aspx			// https://msdn.microsoft.com/en-us/library/yhfk0thd.aspx
	// All read accesses of built-in variable fields get converted into calls to a			// All read accesses of built-in variable fields get converted into calls to a
	// getter function which in turn calls the appropriate builtin to fetch the			// getter function which in turn calls the appropriate builtin to fetch the
	// value.			// value.
	//			//
	// Example:			// Example:
	▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines