This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/Headers/
-
Headers/
-
CMakeLists.txt
-
__clang_cuda_runtime_wrapper.h
5/12
__clang_cuda_texture_intrinsics.h
-
test/Headers/
-
Headers/
-
Inputs/include/
-
include/
-
cuda.h
-
texture_fetch_functions.h
-
texture_intrinsics.cu

Differential D110089

[CUDA] Implement experimental support for texture lookups.
ClosedPublic

Authored by tra on Sep 20 2021, 11:27 AM.

Download Raw Diff

Details

Reviewers

jlebar
yaxunl
hliao
a.sidorin

Commits

rGccfb0555f76b: [CUDA] Implement experimental support for texture lookups.

Summary

The patch Implements support for testure lookups (mostly) in a header file.

The patch has been tested on a source file with all possible combinations of argument types supported by CUDA headers,
compiled and verified that the generated instructions and their parameters match the code generated by NVCC.
Unfortunately, compiling texture code requires CUDA headers and can't be tested in clang itself.
The test will need to be added to the test-suite later.

While generated code compiles and seems to match NVCC, I do not have any code that uses textures that I could test correctness of the implementation.

The gory details of the implementation follow.

User-facing texture lookup API relies on NVCC's __nv_tex_surf_handler builtin which is actually a set of overloads.
The catch is that it's overloaded not only by the argument types, but also by the value of the first argument.

Implementing it in the compiler itself would be rather messy as there are a lot of texture lookup variants.

Implementing texture lookups in C++ is somewhat more maintainable.
If we could use string literals as a template parameter, the implementation could be done completely in the headers.
Unfortunately, literal classes as template parameters are only available in C++20.

One alternative would be to use run-time dispatch, but, given that texture lookup is a single instruction, the overhead would be substantial-to-prohibitive.
As an alternative, this patch introduces __nvvm_texture_op builtin which maps known texture operations to an integer, which is then used to parametrize texture operations.

A lot of texture operations are fairly uniform, with the differences only in the instruction suffix.
Unfortunately, inline assembly requires its input to be a string literal, so we can not rely on templates to generate it and have to resort to preprocessor to do the job.

Another quirk is that historically there were two ways to refer to a texture.
Newer Api uses cudaTextureObject_t which is an opaque scalar value.
Older APIs were using an object of texture<> type which was magically converted to an opaque texture handle (essentially the cudaTextureObject_t).
There's no good way to do this conversion explicitly, which would require implementing each texture lookup twice, for each way to refer to a texture.
However, we can cheat a bit by introducing a dummy inline assembly.
Nominally it accepts texture<> as input, but compiler will convert it to cudaTextureObject_t, so generated assembly will just return correct handle.
This allows both reference styles to use the same implementation.

Overall code structure :

struct __FT; // maps texture data type to the 4-element texture fetch result type.
class __tex_fetch_v4<__op>; // implements run methods for specific texture data types.
class __convert<DstT,SrcT>; // converts result of __tex_fetch_v4 into expected return type (usually a smaller slice of 4-element fetch result
__tex_fetch<__op,...>(); // Calls appropriate __convert(__text_fetch_v4()) variants.
#define __nv_tex_surf_handler(__op, __ptr, ...) ; calls appropriate __tex_fetch<>
__IMPL* macros do the boilerplate generation of __tex_fetch_v4 variants.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

tra created this revision.Sep 20 2021, 11:27 AM

Herald added a reviewer: a.sidorin. · View Herald TranscriptSep 20 2021, 11:27 AM

Herald added subscribers: bixia, mgorny. · View Herald Transcript

tra requested review of this revision.Sep 20 2021, 11:27 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2021, 11:27 AM

kalvdans added a subscriber: kalvdans.Sep 20 2021, 12:08 PM

Harbormaster completed remote builds in B124708: Diff 373650.Sep 20 2021, 12:13 PM

Here's expanded and formatted version of the header: https://gist.github.com/Artem-B/ec4290809650f5092d61d6dafa6b0131
It may help to see what's going on.

cosmetic cleanups.

Harbormaster completed remote builds in B124785: Diff 373747.Sep 20 2021, 5:38 PM

Cool! I like the idea of *compile-time* dispatch. LGTM except minor warnings from clang-tidy. Could you fix them before committing this change?

This revision is now accepted and ready to land.Sep 21 2021, 1:16 PM

Minor cleanups

Harbormaster completed remote builds in B124983: Diff 374028.Sep 21 2021, 2:17 PM

Undo useless NOLINT

Most of clang-tidy warnings are irrelevant -- it tries to parse the header all by itself, without CUDA headers.
It also ignores NOLINTNEXTLINE(clang-diagnostic-error) which was intended to suppress the warning triggered by #error.

The only useful one was in SemaChecking.cpp -- fixed now.

Harbormaster completed remote builds in B124987: Diff 374034.Sep 21 2021, 2:54 PM

One alternative would be to use run-time dispatch, but, given that texture lookup is a single instruction, the overhead would be substantial-to-prohibitive.

I guess I'm confused... Is the parameter value that we're "overloading" on usually/always a constant? In that case, there's no overhead with runtime dispatch. Or, is it not a constant? In which case, how does nvcc generate a single instruction for this idiom at all?

But then I see switch statements in the code, so now I'm extra confused. :)

Overall, I am unsure of why we need all of this magic. We can rely on LLVM to optimize away constant integer comparisons, and also even comparisons between string literals.

What specifically would be inefficient if this were a series of "real" overloaded functions, with none of the macros, templates, or builtins? (Assuming efficiency is the concern here?)

clang/lib/AST/ExprConstant.cpp
11097 ↗	(On Diff #374034)	Write a comment explaining what this function does? (It seems to...translate a string into an integer? If so, to me, it's strange that it uses a sorted list for this because...what if I add another function? Won't that mess up all the numbers? Anyway, to be clarified in the comment.) Now that I read more, I see that you don't care about this being a stable mapping etc etc... I don't really get why this has to be a builtin at all, though. If it's always a string literal, a simple strcmp will do the job, LLVM can optimize this? And I'm almost sure you can assert that the char* is always a string literal, so you can guarantee that it's always optimized away.
11098 ↗	(On Diff #374034)	stuuported
11209 ↗	(On Diff #374034)	how do we know the arg is a string constant? Looking at the builtin def it doesn't seem that we enforce it there.
clang/lib/Headers/__clang_cuda_texture_intrinsics.h
13	is `__compilation` intentional? (Maybe search-and-replace bug?)
42	what are you trying to accomplish with an anon ns inside a header?
42	I know you wrote it in the commit message, but this file could really use comments, otherwise I'm afraid you are going to be the only human being on the planet who can edit this... For starters, it seems that the purpose of this file is to define the __nv_tex_surf_handler "function" -- is that right?
58–59	I have no idea what bt and ft are supposed to stand for. "fetch type" and ...? But __FT stands for "fundamental type" per the comment? Oh, I found it later, "base type". I'm all for brevity, but would `__base_ty` and `__fetch_ty` be too long?
91	There are only a limited number of these. Could we assert that __T is one of the expected vector types, just for readability and maybe to help the next person who tries to edit this?
92	this is c++11-only. Which, you know what, fine by me. But might be worth an explicit #error at least?
93	I think this is also C++11 syntax

In D110089#3014388, @jlebar wrote:

One alternative would be to use run-time dispatch, but, given that texture lookup is a single instruction, the overhead would be substantial-to-prohibitive.

I guess I'm confused... Is the parameter value that we're "overloading" on usually/always a constant? In that case, there's no overhead with runtime dispatch. Or, is it not a constant? In which case, how does nvcc generate a single instruction for this idiom at all?

It's a string literal. And you're actually right, clang does manage to optimize strcmp with a known value. https://godbolt.org/z/h351hfsMf

However, it's only part of the problem. Depending on which particular operation is used, the arguments vary, too. I still need to use templates that effectively need to be parameterized by that string literal argument and I can't easily do it until C++20.
I'd need to push strcmp-based runtime dispatch down to the implementation of the texture lookups with the same operand signature. That's harder to generalize, as I'd have to implement string-based dispatch for quite a few subsets of the operations -- basically for each variant of cartesian product of {dimensionality, Lod, Level, Sparse}.

Another downside is that the string comparison code will result in functions being much larger than necessary. Probably not a big thing overall, but why add overhead that would be paid for by all users and which does not buy us anything? Having one trivial compiler builtin that simplifies things a lot is a better trade-off, IMO.

But then I see switch statements in the code, so now I'm extra confused. :)

That switch is for a special case of texture lookup which may result in one of four texture instruction variants. All others map 1:1.

Overall, I am unsure of why we need all of this magic. We can rely on LLVM to optimize away constant integer comparisons, and also even comparisons between string literals.

It makes it possible to usa a string literal to parameterize templates, which allows to generate variants of __nv_tex_surf_handler in a relatively concise way.

What specifically would be inefficient if this were a series of "real" overloaded functions, with none of the macros, templates, or builtins? (Assuming efficiency is the concern here?)

It's both efficiency and avoidance of typos in repetitive nearly identical code.
There are ~2500 variants of high-level texture lookup variants. They end up calling about 600 different __nv_tex_surf_handler overloads that, in turn, end up generating ~70 unique inline assembly variants.
The current code structure reflects that hierarchy. This is essentially the reason for the parameterization by the operation name happening early, instead of being used as a key for runtime dispatch at the end.

clang/lib/AST/ExprConstant.cpp
11097 ↗	(On Diff #374034)	Yes, it's just a 1:1 map. We do not care about specific values as they only matter within one TU. I'll document that. I can't easily use string literal to parameterize a template. Hmm. Perhaps I can implement a `constexpr perfect_hash(literal)` in a header. This would eliminate the need for the builtin. E.g. https://godbolt.org/z/bzzMbaKhe Let me give it a try.

Depending on which particular operation is used, the arguments vary, too.

So something like

T __nv_tex_surf_handler(name, arg1) {
  switch (name) {
    ...
    default:
      panic();
  }
}

T __nv_tex_surf_handler(name, arg1, arg2) {
  switch(...) { ... }
}

and so on?

I'd need to push strcmp-based runtime dispatch down to the implementation of the texture lookups with the same operand signature.

Agree.

That's harder to generalize, as I'd have to implement string-based dispatch for quite a few subsets of the operations -- basically for each variant of cartesian product of {dimensionality, Lod, Level, Sparse}.

Another downside is that the string comparison code will result in functions being much larger than necessary. Probably not a big thing overall, but why add overhead that would be paid for by all users and which does not buy us anything?

If it didn't buy us anything, I'd agree. The thing I'm concerned about is readability of this code. Which, if we want to tie it back to users, affects our ability to catch bugs in this implementation.

Having one trivial compiler builtin that simplifies things a lot is a better trade-off, IMO.

Ah, maybe I wasn't clear then. I'm not actually super-concerned with the compiler builtin. It'd be nice to get rid of it if there's a clean way to do so, but if we don't, that's ok. Basically, the builtin is just for changing strcmp(x, "foo") into builtin(x) == builtin("foo"). Fine.

What I'm more concerned with is the spaghetti of macros here to do something as simple as a series of overloaded functions. It seems like a premature optimization, and I don't feel confident I can check it for bugs.

In D110089#3016145, @jlebar wrote:
Depending on which particular operation is used, the arguments vary, too.

So something like
T __nv_tex_surf_handler(name, arg1) {
  switch (name) {
    ...
    default:
      panic();
  }
}

T __nv_tex_surf_handler(name, arg1, arg2) {
  switch(...) { ... }
}
and so on?

Yes, and there will be multiple such overloads for each name. So the switch will have to be replicated/adjusted in each overload.

If it didn't buy us anything, I'd agree. The thing I'm concerned about is readability of this code. Which, if we want to tie it back to users, affects our ability to catch bugs in this implementation.

Having one trivial compiler builtin that simplifies things a lot is a better trade-off, IMO.

Ah, maybe I wasn't clear then. I'm not actually super-concerned with the compiler builtin. It'd be nice to get rid of it if there's a clean way to do so, but if we don't, that's ok. Basically, the builtin is just for changing strcmp(x, "foo") into builtin(x) == builtin("foo"). Fine.

What I'm more concerned with is the spaghetti of macros here to do something as simple as a series of overloaded functions. It seems like a premature optimization, and I don't feel confident I can check it for bugs.

The choice is between using macros to generate the boilerplate vs replicating things manually. If we could get templates to generaste inline asm operands, that would be great. Unfortunately it requires using literals, so the macros are the only way to construct the right instruction.
If I do not do that with macros, I'll have to manually write each instruction variant and get it right every time.

I can preprocess the macros and commit the results. That would be about an order of magnitude more code than what we have now. That would be harder to change en masse without errors. E.g. try spotting the differences between any two neighboring functions there: https://gist.github.com/Artem-B/ec4290809650f5092d61d6dafa6b0131

Switched to purely in-header implementation based on constexpr perfect hash.

Require c++11 for texture support.

Harbormaster completed remote builds in B125223: Diff 374362.Sep 22 2021, 2:20 PM

Added better C++11 guards.

tra added inline comments.Sep 22 2021, 4:02 PM

clang/lib/Headers/__clang_cuda_texture_intrinsics.h
42	what are you trying to accomplish with an anon ns inside a header? I wanted to give all functions internal linkage, so they do not pollute visible symbols. Without that and with numeric tag IDs not being stable, we could end up having ODR issues in code compiled with `-fgpu-rdc` by different clang versions. I've moved all defined functions into `namespace __cuda_tex`, so I don't have to use an extra prefix on all the types the header creates. For starters, it seems that the purpose of this file is to define the __nv_tex_surf_handler "function" -- is that right? Yes. I've added a comment at the top of the doc.
92	I've added an include guard instead. This header is included from the cuda runtime wrapper for all compilations. We don't want to break folks who use c++98, but don't need textures. If they do try to use them, they will get a static assert.

Harbormaster completed remote builds in B125241: Diff 374391.Sep 22 2021, 4:16 PM

Cleanups. Added more comments.

Removed a test file committed by mistake.

Harbormaster completed remote builds in B125252: Diff 374405.Sep 22 2021, 5:36 PM

Added a test.

Harbormaster completed remote builds in B125388: Diff 374604.Sep 23 2021, 10:49 AM

Disable sparse ops for pre-sm_60 GPUs.

Sort push/pop_macro.

Harbormaster completed remote builds in B125421: Diff 374644.Sep 23 2021, 12:45 PM

Okay, I give up on the phab interface. It's unreadable with all the existing
comments and lint errors.

Hope you don't mind comments this way. I'm just going to put it all in a giant
code block so it doesn't get wrapped or whatever.

+// __nv_tex_surf_handler() provided by this header as a macro.
+#define __nv_tex_surf_handler(__op, __ptr, ...)                                \
+  __cuda_tex::__tex_fetch<__cuda_tex::__Tag<__cuda_tex::__tex_op_hash(__op)>>( \
+      __ptr, __VA_ARGS__)

::__cuda_tex

+// Put all functions into anonymous namespace so they have internal linkage.

Say a little more?  Specifically, you want anon ns because this is device code
and it has to work even without being linked.

(Also, are you sure that plain `inline` doesn't do the right thing?  Like, we
have lots of CUDA headers that are `inline`'ed without all being in an anon
ns.)

+// First, we need a perfect hash function and a few constexpr helper functions
+// for converting a string literal into a numeric value which can be used to
+// parametrize a template. We can not use string literals for that as that would
+// require C++20.
+//
+// The hash function was generated with 'gperf' and then manually converted into
+// its constexpr equivalent.
+//
+// NOTE: the perfect hashing scheme comes with inherent self-test. If the hash
+// function has a collision for any of the texture operations, the compilation
+// will fail due to an attempt to redefine a tag with the same value. If the
+// header compiles, then the hash function is good enough for the job.

I guess if it has a self-test then that's fine.  Though is this really better
than a series of `if` statements with strcmp?  I guess I am scared of this kind
of thing because I did it once in ccache.  I thought I was very clever and got
a good speedup.  1 year later I found out I'd broken handling of __DATE__ and
__TIME__.  o.O

clang/lib/Headers/__clang_cuda_texture_intrinsics.h
27	`::__cuda_tex` (appears twice)
54	Write a little more? This looks super-suspicious, but you need it specifically because these are device functions.

Presumably as a separate commit we should add tests to the test_suite repository to ensure that this at least still compiles with different versions of CUDA?

More comments.

In D110089#3021652, @jlebar wrote:

Okay, I give up on the phab interface. It's unreadable with all the existing
comments and lint errors.

Yeah. Phabricator experience is not great.

+// Put all functions into anonymous namespace so they have internal linkage.

Say a little more? Specifically, you want anon ns because this is device code
and it has to work even without being linked.

(Also, are you sure that plain inline doesn't do the right thing? Like, we
have lots of CUDA headers that are inline'ed without all being in an anon
ns.)

We do want inlining, but the main purpose here is to avoid potential ODR for use with -fgpu-rdc where multiple TUs may be compiled with different versions of this header. Because the hash may change, we could end up with thesame Tag types (and fetch functions) with the same names meaning different things for different TUs.

+ NOTE: the perfect hashing scheme comes with inherent self-test. If the hash
+ function has a collision for any of the texture operations, the compilation
+ will fail due to an attempt to redefine a tag with the same value. If the
+ header compiles, then the hash function is good enough for the job.

I guess if it has a self-test then that's fine. Though is this really better
than a series of if statements with strcmp?

Yes, I think somewhat obfuscated metaprogramming here wins on points, IMO.

it's fairly well-structured, even if macros make it a bit of a pain to dig through.
assumptions about the perfect hash are minimal -- it's just a 1:1 string->integer map. If that assumption is violated we're guaranteed to get a compilation error when we instantiate the templates that map to the same value. I did test that by changing the hash function.
strcmp() will result in 100+ comparisons. That alone will be a pain to write manually. In my experience, having more than a handful of nearly-identical, but critically different chunks of code makes the whole thing very error-prone. I've tried that early on before I've figured out how to parameterize templates by a string.
We'll also need to use it in a function template, so the code will get replicated over all the instances of the signatures we'll need to impement. While it's probably no a showstopper, it's still additional IR we'd have to deal with. Adding incremental burden on all CUDA users is worse than additional mental burden on whoever may need to read this code (most likely me).

I guess I am scared of this kind of thing because I did it once in ccache. I thought I was very clever and got
a good speedup. 1 year later I found out I'd broken handling of DATE and TIME. o.O

Being wary of making an easy-to-miss errors here, I literally did exhaustive testing of all variants (2972 of them) of high-level API calls provided by NVIDIA headers and verified that we do end up generating identical instructions and their parameters.
I will add that test in the test-suite as it needs actual CUDA headers.

In D110089#3021659, @jlebar wrote:

Presumably as a separate commit we should add tests to the test_suite repository to ensure that this at least still compiles with different versions of CUDA?

That's the plan. I've tested thhe patch manually down to CUDA-9. It will not work with CUDA-8 or older as they have completely different under-the hood implementation in CUDA headers. I'll add an include guard for the old CUDA versions.

Harbormaster completed remote builds in B125659: Diff 374979.Sep 24 2021, 4:25 PM

Removed obsolete comment.

Use int for string hash calculations to avoid dealing with char signedness.

This revision was landed with ongoing or failed builds.Oct 6 2021, 3:16 PM

Closed by commit rGccfb0555f76b: [CUDA] Implement experimental support for texture lookups. (authored by tra). · Explain Why

This revision was automatically updated to reflect the committed changes.

tra added a commit: rGccfb0555f76b: [CUDA] Implement experimental support for texture lookups..

Harbormaster completed remote builds in B127413: Diff 377703.Oct 6 2021, 5:01 PM

This breaks tests on Mac: http://45.33.8.238/macm1/19372/step_7.txt

Please take a look :)

Fixed in 6707a7d7e96ac23ba66f16bdb44927082d2fd4d3, thanks!

Will the new macros in this patch also be useful for supporting the surface-related methods that also use __nv_tex_surf_handler (from surface_indirect_functions.h)?

I gave this new code a try with surf2Dread and surf2Dwrite, and based on the errors, it look like it may just be a matter of creating the right mappings from Tag to the correct asm using these new macros (e.g. __isurf2Dwrite_v2 and isurf2Dread).

In D110089#3047161, @kgk wrote:

Will the new macros in this patch also be useful for supporting the surface-related methods that also use __nv_tex_surf_handler (from surface_indirect_functions.h)?

I gave this new code a try with surf2Dread and surf2Dwrite, and based on the errors, it look like it may just be a matter of creating the right mappings from Tag to the correct asm using these new macros (e.g. __isurf2Dwrite_v2 and isurf2Dread).

Only textures are supported at the moment, but adding support for surface operations would indeed be very similar.

Basically we just need to add specializations for the surface operations. It's fairly tedious, but straightforward in principle.

clang/lib/Headers/__clang_cuda_texture_intrinsics.h
62–65	An alternative would be to use something like this: https://github.com/gelldur/gcpp/blob/master/src/gcpp/string/ConstexprString.hpp That would be a bit too complicated for this limited use case.

In D110089#3048460, @tra wrote:

In D110089#3047161, @kgk wrote:

Will the new macros in this patch also be useful for supporting the surface-related methods that also use __nv_tex_surf_handler (from surface_indirect_functions.h)?

I gave this new code a try with surf2Dread and surf2Dwrite, and based on the errors, it look like it may just be a matter of creating the right mappings from Tag to the correct asm using these new macros (e.g. __isurf2Dwrite_v2 and isurf2Dread).

Only textures are supported at the moment, but adding support for surface operations would indeed be very similar.

Basically we just need to add specializations for the surface operations. It's fairly tedious, but straightforward in principle.

Very cool! I am selfishly curious if support for surface operations is something you plan to add. I had a go at implementing it myself today based on this patch, and found it a bit harder than I was expecting 😅

I appreciate your work on this; it's great to see cuda texture support being added to clang!

In D110089#3049656, @kgk wrote:

Very cool! I am selfishly curious if support for surface operations is something you plan to add. I had a go at implementing it myself today based on this patch, and found it a bit harder than I was expecting 😅

I don't have immediate plans to do it. It's pretty far down my todo list, so I don't know when/if I'll get to it. So, you'll have plenty of time to beat me to it. :-]

Revision Contents

Path

Size

clang/

lib/

Headers/

CMakeLists.txt

1 line

__clang_cuda_runtime_wrapper.h

32 lines

__clang_cuda_texture_intrinsics.h

742 lines

test/

Headers/

Inputs/

include/

cuda.h

24 lines

texture_fetch_functions.h

2 lines

texture_intrinsics.cu

13 lines

Diff 377708

clang/lib/Headers/CMakeLists.txt

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	set(files
bmiintrin.h		bmiintrin.h
builtins.h		builtins.h
__clang_cuda_builtin_vars.h		__clang_cuda_builtin_vars.h
__clang_cuda_math.h		__clang_cuda_math.h
__clang_cuda_cmath.h		__clang_cuda_cmath.h
__clang_cuda_complex_builtins.h		__clang_cuda_complex_builtins.h
__clang_cuda_device_functions.h		__clang_cuda_device_functions.h
__clang_cuda_intrinsics.h		__clang_cuda_intrinsics.h
		__clang_cuda_texture_intrinsics.h
__clang_cuda_libdevice_declares.h		__clang_cuda_libdevice_declares.h
__clang_cuda_math_forward_declares.h		__clang_cuda_math_forward_declares.h
__clang_cuda_runtime_wrapper.h		__clang_cuda_runtime_wrapper.h
__clang_hip_libdevice_declares.h		__clang_hip_libdevice_declares.h
__clang_hip_cmath.h		__clang_hip_cmath.h
__clang_hip_math.h		__clang_hip_math.h
__clang_hip_runtime_wrapper.h		__clang_hip_runtime_wrapper.h
cetintrin.h		cetintrin.h
▲ Show 20 Lines • Show All 197 Lines • Show Last 20 Lines

clang/lib/Headers/__clang_cuda_runtime_wrapper.h

	Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines
	#endif			#endif

	#pragma push_macro("__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__")			#pragma push_macro("__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__")
	#if CUDA_VERSION >= 10000			#if CUDA_VERSION >= 10000
	#define __CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__			#define __CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__
	#endif			#endif

	// Make largest subset of device functions available during host			// Make largest subset of device functions available during host
	// compilation -- SM_35 for the time being.			// compilation.
	#ifndef __CUDA_ARCH__			#ifndef __CUDA_ARCH__
	#define __CUDA_ARCH__ 350			#define __CUDA_ARCH__ 9999
	#endif			#endif

	#include "__clang_cuda_builtin_vars.h"			#include "__clang_cuda_builtin_vars.h"

	// No need for device_launch_parameters.h as __clang_cuda_builtin_vars.h above			// No need for device_launch_parameters.h as __clang_cuda_builtin_vars.h above
	// has taken care of builtin variables declared in the file.			// has taken care of builtin variables declared in the file.
	#define __DEVICE_LAUNCH_PARAMETERS_H__			#define __DEVICE_LAUNCH_PARAMETERS_H__

	▲ Show 20 Lines • Show All 247 Lines • ▼ Show 20 Lines
	#endif			#endif
	#pragma pop_macro("_GLIBCXX_MATH_H")			#pragma pop_macro("_GLIBCXX_MATH_H")
	#pragma pop_macro("_LIBCPP_VERSION")			#pragma pop_macro("_LIBCPP_VERSION")
	#pragma pop_macro("__GNUC__")			#pragma pop_macro("__GNUC__")
	#pragma pop_macro("signbit")			#pragma pop_macro("signbit")

	#pragma pop_macro("__host__")			#pragma pop_macro("__host__")

				// __clang_cuda_texture_intrinsics.h must be included first in order to provide
				// implementation for __nv_tex_surf_handler that CUDA's headers depend on.
				// The implementation requires c++11 and only works with CUDA-9 or newer.
				#if __cplusplus >= 201103L && CUDA_VERSION >= 9000
				// clang-format off
				#include <__clang_cuda_texture_intrinsics.h>
				// clang-format on
				#else
				#if CUDA_VERSION >= 9000
				// Provide a hint that texture support needs C++11.
				template <typename T> struct __nv_tex_needs_cxx11 {
				const static bool value = false;
				};
				template <class T>
				__host__ __device__ void __nv_tex_surf_handler(const char name, T ptr,
				cudaTextureObject_t obj,
				float x) {
				_Static_assert(__nv_tex_needs_cxx11<T>::value,
				"Texture support requires C++11");
				}
				#else
				// Textures in CUDA-8 and older are not supported by clang.There's no
				// convenient way to intercept texture use in these versions, so we can't
				// produce a meaningful error. The source code that attempts to use textures
				// will continue to fail as it does now.
				#endif // CUDA_VERSION
				#endif // __cplusplus >= 201103L && CUDA_VERSION >= 9000
				#include "texture_fetch_functions.h"
	#include "texture_indirect_functions.h"			#include "texture_indirect_functions.h"

	// Restore state of __CUDA_ARCH__ and __THROW we had on entry.			// Restore state of __CUDA_ARCH__ and __THROW we had on entry.
	#pragma pop_macro("__CUDA_ARCH__")			#pragma pop_macro("__CUDA_ARCH__")
	#pragma pop_macro("__THROW")			#pragma pop_macro("__THROW")

	// Set up compiler macros expected to be seen during compilation.			// Set up compiler macros expected to be seen during compilation.
	#undef __CUDABE__			#undef __CUDABE__
	▲ Show 20 Lines • Show All 108 Lines • Show Last 20 Lines

clang/lib/Headers/__clang_cuda_texture_intrinsics.h

This file was added.

				/*===--- __clang_cuda_texture_intrinsics.h - Device-side texture support ---===
				*
				* Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				* See https://llvm.org/LICENSE.txt for license information.
				* SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				*
				*===-----------------------------------------------------------------------===
				*
				* This header provides in-header implmentations for NVCC's built-in
				* __nv_tex_surf_handler() which is used by CUDA's texture-related headers. The
				* built-in is unusual as it's actually a set of function overloads that use the
				* first string literal argument as one of the overload parameters.
				*/
				jlebarUnsubmitted Done Reply Inline Actions is `__compilation` intentional? (Maybe search-and-replace bug?) jlebar: is `__compilation` intentional? (Maybe search-and-replace bug?)
				#ifndef __CLANG_CUDA_TEXTURE_INTRINSICS_H__
				#define __CLANG_CUDA_TEXTURE_INTRINSICS_H__
				#ifndef __CUDA__
				#error "This file is for CUDA compilation only."
				#endif

				#include <type_traits>

				// __nv_tex_surf_handler() provided by this header as a macro.
				#define __nv_tex_surf_handler(__op, __ptr, ...) \
				::__cuda_tex::__tex_fetch< \
				::__cuda_tex::__Tag<::__cuda_tex::__tex_op_hash(__op)>>(__ptr, \
				__VA_ARGS__)

				jlebarUnsubmitted Not Done Reply Inline Actions `::__cuda_tex` (appears twice) jlebar: `::__cuda_tex` (appears twice)
				#pragma push_macro("__ASM_OUT")
				#pragma push_macro("__ASM_OUTP")
				#pragma push_macro("__Args")
				#pragma push_macro("__ID")
				#pragma push_macro("__IDV")
				#pragma push_macro("__IMPL_2DGATHER")
				#pragma push_macro("__IMPL_ALIAS")
				#pragma push_macro("__IMPL_ALIASI")
				#pragma push_macro("__IMPL_F1")
				#pragma push_macro("__IMPL_F3")
				#pragma push_macro("__IMPL_F3N")
				#pragma push_macro("__IMPL_F3S")
				#pragma push_macro("__IMPL_S")
				#pragma push_macro("__IMPL_S3")
				#pragma push_macro("__IMPL_S3I")
				jlebarUnsubmitted Not Done Reply Inline Actions what are you trying to accomplish with an anon ns inside a header? jlebar: what are you trying to accomplish with an anon ns inside a header?
				jlebarUnsubmitted Not Done Reply Inline Actions I know you wrote it in the commit message, but this file could really use comments, otherwise I'm afraid you are going to be the only human being on the planet who can edit this... For starters, it seems that the purpose of this file is to define the __nv_tex_surf_handler "function" -- is that right? jlebar: I know you wrote it in the commit message, but this file could really use comments, otherwise…
				traAuthorUnsubmitted Done Reply Inline Actions what are you trying to accomplish with an anon ns inside a header? I wanted to give all functions internal linkage, so they do not pollute visible symbols. Without that and with numeric tag IDs not being stable, we could end up having ODR issues in code compiled with `-fgpu-rdc` by different clang versions. I've moved all defined functions into `namespace __cuda_tex`, so I don't have to use an extra prefix on all the types the header creates. For starters, it seems that the purpose of this file is to define the __nv_tex_surf_handler "function" -- is that right? Yes. I've added a comment at the top of the doc. tra: > what are you trying to accomplish with an anon ns inside a header? I wanted to give all…
				#pragma push_macro("__IMPL_S3N")
				#pragma push_macro("__IMPL_S3NI")
				#pragma push_macro("__IMPL_S3S")
				#pragma push_macro("__IMPL_S3SI")
				#pragma push_macro("__IMPL_SI")
				#pragma push_macro("__L")
				#pragma push_macro("__STRIP_PARENS")

				// Put all functions into anonymous namespace so they have internal linkage.
				// The device-only function here must be internal in order to avoid ODR
				// violations in case they are used from the files compiled with
				// -fgpu-rdc. E.g. a library and an app using it may be built with a different
				jlebarUnsubmitted Not Done Reply Inline Actions Write a little more? This looks super-suspicious, but you need it specifically because these are device functions. jlebar: Write a little more? This looks super-suspicious, but you need it specifically because these…
				// version of this header file.
				namespace {

				// Put the implmentation into its own namespace so we don't pollute the TU.
				namespace __cuda_tex {
				jlebarUnsubmitted Done Reply Inline Actions I have no idea what bt and ft are supposed to stand for. "fetch type" and ...? But __FT stands for "fundamental type" per the comment? Oh, I found it later, "base type". I'm all for brevity, but would `__base_ty` and `__fetch_ty` be too long? jlebar: I have no idea what bt and ft are supposed to stand for. "fetch type" and ...? But __FT…

				// First, we need a perfect hash function and a few constexpr helper functions
				// for converting a string literal into a numeric value which can be used to
				// parametrize a template. We can not use string literals for that as that would
				// require C++20.
				//
				traAuthorUnsubmitted Done Reply Inline Actions An alternative would be to use something like this: https://github.com/gelldur/gcpp/blob/master/src/gcpp/string/ConstexprString.hpp That would be a bit too complicated for this limited use case. tra: An alternative would be to use something like this: https://github.
				// The hash function was generated with 'gperf' and then manually converted into
				// its constexpr equivalent.
				//
				// NOTE: the perfect hashing scheme comes with inherent self-test. If the hash
				// function has a collision for any of the texture operations, the compilation
				// will fail due to an attempt to redefine a tag with the same value. If the
				// header compiles, then the hash function is good enough for the job.

				constexpr int __tex_len(const char *s) {
				return (s[0] == 0) ? 0
				: (s[1] == 0) ? 1
				: (s[2] == 0) ? 2
				: (s[3] == 0) ? 3
				: (s[4] == 0) ? 4
				: (s[5] == 0) ? 5
				: (s[6] == 0) ? 6
				: (s[7] == 0) ? 7
				: (s[8] == 0) ? 8
				: (s[9] == 0) ? 9
				: (s[10] == 0) ? 10
				: (s[11] == 0) ? 11
				: (s[12] == 0) ? 12
				: (s[13] == 0) ? 13
				: (s[14] == 0) ? 14
				: (s[15] == 0) ? 15
				: (s[16] == 0) ? 16
				jlebarUnsubmitted Not Done Reply Inline Actions There are only a limited number of these. Could we assert that __T is one of the expected vector types, just for readability and maybe to help the next person who tries to edit this? jlebar: There are only a limited number of these. Could we assert that __T is one of the expected…
				: (s[17] == 0) ? 17
				jlebarUnsubmitted Not Done Reply Inline Actions this is c++11-only. Which, you know what, fine by me. But might be worth an explicit #error at least? jlebar: this is c++11-only. Which, you know what, fine by me. But might be worth an explicit #error…
				traAuthorUnsubmitted Done Reply Inline Actions I've added an include guard instead. This header is included from the cuda runtime wrapper for all compilations. We don't want to break folks who use c++98, but don't need textures. If they do try to use them, they will get a static assert. tra: I've added an include guard instead. This header is included from the cuda runtime wrapper for…
				: (s[18] == 0) ? 18
				jlebarUnsubmitted Not Done Reply Inline Actions I think this is also C++11 syntax jlebar: I think this is also C++11 syntax
				: (s[19] == 0) ? 19
				: (s[20] == 0) ? 20
				: (s[21] == 0) ? 21
				: (s[22] == 0) ? 22
				: (s[23] == 0) ? 23
				: (s[24] == 0) ? 24
				: (s[25] == 0) ? 25
				: (s[26] == 0) ? 26
				: (s[27] == 0) ? 27
				: (s[28] == 0) ? 28
				: (s[29] == 0) ? 29
				: (s[30] == 0) ? 30
				: (s[31] == 0) ? 31
				: 32;
				}

				constexpr int __tex_hash_map(int c) {
				return (c == 49) ? 10
				: (c == 50) ? 0
				: (c == 51) ? 100
				: (c == 52) ? 30
				: (c == 67) ? 10
				: (c == 68) ? 0
				: (c == 69) ? 25
				: (c == 72) ? 70
				: (c == 77) ? 0
				: (c == 96) ? 44
				: (c == 99) ? 10
				: (c == 100) ? 5
				: (c == 101) ? 60
				: (c == 102) ? 40
				: (c == 103) ? 70
				: (c == 104) ? 25
				: (c == 112) ? 0
				: (c == 114) ? 45
				: (c == 117) ? 5
				: (c == 118) ? 85
				: (c == 120) ? 20
				: 225;
				}

				constexpr int __tex_op_hash(const char *str) {
				return __tex_len(str) + __tex_hash_map(str[7] + 1) + __tex_hash_map(str[6]) +
				__tex_hash_map(str[5]) + __tex_hash_map(str[__tex_len(str) - 1]);
				}

				// Tag type to identify particular texture operation.
				template <int N> struct __Tag;
				#define __ID(__op) __Tag<__tex_op_hash(__op)>
				// Tags for variants of particular operation. E.g. tex2Dgather can translate
				// into 4 different instructions.
				#define __IDV(__op, __variant) \
				__Tag<10000 + __tex_op_hash(__op) * 100 + __variant>

				// Helper classes for figuring out key data types for derived types.
				// E.g. char2 has __base_t = char, __fetch_t = char4
				template <class> struct __TypeInfoT;
				// Type info for the fundamental types.
				template <> struct __TypeInfoT<float> {
				using __base_t = float;
				using __fetch_t = float4;
				};
				template <> struct __TypeInfoT<char> {
				using __base_t = char;
				using __fetch_t = int4;
				};
				template <> struct __TypeInfoT<signed char> {
				using __base_t = signed char;
				using __fetch_t = int4;
				};
				template <> struct __TypeInfoT<unsigned char> {
				using __base_t = unsigned char;
				using __fetch_t = uint4;
				};
				template <> struct __TypeInfoT<short> {
				using __base_t = short;
				using __fetch_t = int4;
				};
				template <> struct __TypeInfoT<unsigned short> {
				using __base_t = unsigned short;
				using __fetch_t = uint4;
				};
				template <> struct __TypeInfoT<int> {
				using __base_t = int;
				using __fetch_t = int4;
				};
				template <> struct __TypeInfoT<unsigned int> {
				using __base_t = unsigned int;
				using __fetch_t = uint4;
				};

				// Derived base/fetch types for N-element vectors.
				template <class __T> struct __TypeInfoT {
				using __base_t = decltype(__T::x);
				using __fetch_t = typename __TypeInfoT<__base_t>::__fetch_t;
				};

				// Classes that implement specific texture ops.
				template <class __op> struct __tex_fetch_v4;

				// Helper macros to strip parens from a macro argument.
				#define __Args(...) __VA_ARGS__
				#define __STRIP_PARENS(__X) __X
				#define __L(__X) __STRIP_PARENS(__Args __X)

				// Construct inline assembly output args.
				// Results are stored in a temp var __r.
				// isResident bool is pointed to by __ir
				// Asm args for return values. It's a 4-element vector
				#define __ASM_OUT(__t) \
				("=" __t(__r.x), "=" __t(__r.y), "=" __t(__r.z), "=" __t(__r.w))
				// .. possibly combined with a predicate.
				#define __ASM_OUTP(__t) (__L(__ASM_OUT(__t)), "=h"(*__ir))

				// Implements a single variant of texture fetch instruction.
				#define __IMPL_F1(__rt, __dt, __args, __asm_op, __asm_outs, __asm_args) \
				template <> \
				__device__ __rt __run<__dt>(cudaTextureObject_t __obj, __L(__args)) { \
				__rt __r; \
				asm(__asm_op : __L(__asm_outs) : "l"(__obj), __L(__asm_args)); \
				return __r; \
				}

				// Implements texture fetch instructions for int4/uint4/float4 data types.
				#define __IMPL_F3(__args, __asm_op, __ctype, __asm_op_args, __asm_args) \
				__IMPL_F1(int4, int4, __args, __asm_op ".s32." __ctype "\t" __asm_op_args, \
				__ASM_OUT("r"), __asm_args) \
				__IMPL_F1(uint4, uint4, __args, __asm_op ".u32." __ctype "\t" __asm_op_args, \
				__ASM_OUT("r"), __asm_args) \
				__IMPL_F1(float4, float4, __args, \
				__asm_op ".f32." __ctype "\t" __asm_op_args, __ASM_OUT("f"), \
				__asm_args)
				// Implements 'sparse' texture fetch instructions for int4/uint4/float4 data
				// types. Similar to above, but returns a boolean 'isPresent' value in addition
				// to texture data,
				#define __IMPL_F3S(__args, __asm_op, __ctype, __asm_op_args, __asm_args) \
				__IMPL_F1(int4, int4, __args, __asm_op ".s32." __ctype "\t" __asm_op_args, \
				__ASM_OUTP("r"), __asm_args) \
				__IMPL_F1(uint4, uint4, __args, __asm_op ".u32." __ctype "\t" __asm_op_args, \
				__ASM_OUTP("r"), __asm_args) \
				__IMPL_F1(float4, float4, __args, \
				__asm_op ".f32." __ctype "\t" __asm_op_args, __ASM_OUTP("f"), \
				__asm_args)

				// Similar to F3, but for integer data which is returned as normalized floats.
				// Only instantiates fetch functions for int4/uint4.
				#define __IMPL_F3N(__args, __asm_op, __ctype, __asm_op_args, __asm_args) \
				__IMPL_F1(float4, int4, __args, __asm_op ".s32." __ctype "\t" __asm_op_args, \
				__ASM_OUT("r"), __asm_args) \
				__IMPL_F1(float4, uint4, __args, \
				__asm_op ".u32." __ctype "\t" __asm_op_args, __ASM_OUT("r"), \
				__asm_args)

				// Instantiates __tex_fetch_v4 with regular fetch functions.
				#define __IMPL_S3I(__op, __args, __asm_op, __ctype, __asm_op_args, __asm_args) \
				template <> struct __tex_fetch_v4<__op> { \
				template <class T> \
				__device__ static T __run(cudaTextureObject_t __obj, __L(__args)); \
				__IMPL_F3(__args, __asm_op, __ctype, __asm_op_args, __asm_args) \
				}

				// Same, but for sparse ops. Only available on sm_60+
				#if !defined(__CUDA_ARCH__) \|\| (__CUDA_ARCH__ >= 600)
				#define __IMPL_S3SI(__op, __args, __asm_op, __ctype, __asm_op_args, \
				__asm_args) \
				template <> struct __tex_fetch_v4<__op> { \
				template <class T> \
				__device__ static T __run(cudaTextureObject_t __obj, __L(__args)); \
				__IMPL_F3S(__args, __asm_op, __ctype, __asm_op_args, __asm_args) \
				}
				#else
				#define __IMPL_S3SI(__op, __args, __asm_op, __ctype, __asm_op_args, __asm_args)
				#endif

				// Same, but for normalized float ops.
				#define __IMPL_S3NI(__op, __args, __asm_op, __ctype, __asm_op_args, \
				__asm_args) \
				template <> struct __tex_fetch_v4<__op> { \
				template <class T> \
				__device__ static float4 __run(cudaTextureObject_t __obj, __L(__args)); \
				__IMPL_F3N(__args, __asm_op, __ctype, __asm_op_args, __asm_args) \
				}

				// Regular and normalized float ops share a lot of similarities. This macro
				// instantiates both variants -- normal for __op and normalized for __opn.
				#define __IMPL_SI(__op, __opn, __args, __asm_op, __ctype, __asm_op_args, \
				__asm_args) \
				__IMPL_S3I(__op, __args, __asm_op, __ctype, __asm_op_args, __asm_args); \
				__IMPL_S3NI(__opn, __args, __asm_op, __ctype, __asm_op_args, __asm_args)

				// Convenience macros which converts string literal __op into a __Tag,
				#define __IMPL_S3(__op, __args, __asm_op, __ctype, __asm_op_args, __asm_args) \
				__IMPL_S3I(__ID(__op), __args, __asm_op, __ctype, __asm_op_args, __asm_args)
				#define __IMPL_S3S(__op, __args, __asm_op, __ctype, __asm_op_args, __asm_args) \
				__IMPL_S3SI(__ID(__op), __args, __asm_op, __ctype, __asm_op_args, __asm_args)
				#define __IMPL_S3N(__op, __args, __asm_op, __ctype, __asm_op_args, __asm_args) \
				__IMPL_S3NI(__ID(__op), __args, __asm_op, __ctype, __asm_op_args, __asm_args)
				#define __IMPL_S(__op, __opn, __args, __asm_op, __ctype, __asm_op_args, \
				__asm_args) \
				__IMPL_SI(__ID(__op), __ID(__opn), __args, __asm_op, __ctype, __asm_op_args, \
				__asm_args)

				// CUDA headers have some 'legacy' texture oprerations that duplicate
				// functionality. So, we just inherit it, instead of refining a copy.
				#define __IMPL_ALIASI(__op, __opn) \
				template <> struct __tex_fetch_v4<__op> : __tex_fetch_v4<__opn> {}
				#define __IMPL_ALIAS(__op, __opn) __IMPL_ALIASI(__ID(__op), __ID(__opn))

				// Now we can instantiate everything we need for each specific texture fetch
				// variant.
				__IMPL_S("__tex1D_v2", "__tex1D_rmnf_v2", (float __x), "tex.1d.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5}];", ("f"(__x)));
				__IMPL_S("__tex1Dfetch_v2", "__tex1Dfetch_rmnf_v2", (int __x), "tex.1d.v4",
				"s32", "{%0, %1, %2, %3}, [%4, {%5}];", ("r"(__x)));
				__IMPL_ALIAS("__itex1D", "__tex1D_v2");
				__IMPL_ALIAS("__itex1Dfetch", "__tex1Dfetch_v2");

				__IMPL_S("__tex1DGrad_v2", "__tex1DGrad_rmnf_v2",
				(float __x, float __dPdx, float __dPdy), "tex.grad.1d.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5}], {%6}, {%7};",
				("f"(__x), "f"(__dPdx), "f"(__dPdy)));
				__IMPL_ALIAS("__itex1DGrad", "__tex1DGrad_v2");

				__IMPL_S("__tex1DLayered_v2", "__tex1DLayered_rmnf_v2",
				(float __x, int __layer), "tex.a1d.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6}];", ("r"(__layer), "f"(__x)));
				__IMPL_ALIAS("__itex1DLayered", "__tex1DLayered_v2");

				__IMPL_S("__tex1DLayeredGrad_v2", "__tex1DLayeredGrad_rmnf_v2",
				(float __x, int __layer, float __dPdx, float __dPdy),
				"tex.grad.a1d.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6}], {%7}, {%8};",
				("r"(__layer), "f"(__x), "f"(__dPdx), "f"(__dPdy)));
				__IMPL_ALIAS("__itex1DLayeredGrad", "__tex1DLayeredGrad_v2");

				__IMPL_S("__tex1DLayeredLod_v2", "__tex1DLayeredLod_rmnf_v2",
				(float __x, int __layer, float __level), "tex.level.a1d.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6}], %7;",
				("r"(__layer), "f"(__x), "f"(__level)));
				__IMPL_ALIAS("__itex1DLayeredLod", "__tex1DLayeredLod_v2");

				__IMPL_S("__tex1DLod_v2", "__tex1DLod_rmnf_v2", (float __x, float __level),
				"tex.level.1d.v4", "f32", "{%0, %1, %2, %3}, [%4, {%5}], %6;",
				("f"(__x), "f"(__level)));
				__IMPL_ALIAS("__itex1DLod", "__tex1DLod_v2");

				// 2D
				__IMPL_S("__tex2D_v2", "__tex2D_rmnf_v2", (float __x, float __y), "tex.2d.v4",
				"f32", "{%0, %1, %2, %3}, [%4, {%5, %6}];", ("f"(__x), "f"(__y)));
				__IMPL_ALIAS("__itex2D", "__tex2D_v2");

				__IMPL_S3S("__itex2D_sparse", (float __x, float __y, unsigned char *__ir),
				"{.reg .pred %%p0;\n\t"
				"tex.2d.v4",
				"f32",
				"{%0, %1, %2, %3}\|%%p0, [%5, {%6, %7}];\n\t"
				" selp.u16 %4, 1, 0, %%p0; }",
				("f"(__x), "f"(__y)));

				__IMPL_S("__tex2DGrad_v2", "__tex2DGrad_rmnf_v2",
				(float __x, float __y, const float2 __dPdx, const float2 __dPdy),
				"tex.grad.2d.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6}], {%7, %8}, {%9, %10};",
				("f"(__x), "f"(__y), "f"(__dPdx->x), "f"(__dPdx->y), "f"(__dPdy->x),
				"f"(__dPdy->y)));
				__IMPL_ALIAS("__itex2DGrad_v2", "__tex2DGrad_v2");

				__IMPL_S3S("__itex2DGrad_sparse",
				(float __x, float __y, const float2 __dPdx, const float2 __dPdy,
				unsigned char *__ir),
				"{.reg .pred %%p0;\n\t"
				"tex.grad.2d.v4",
				"f32",
				"{%0, %1, %2, %3}\|%%p0, [%5, {%6, %7}], {%8, %9}, {%10, %11};\n\t"
				"selp.u16 %4, 1, 0, %%p0; }",
				("f"(__x), "f"(__y), "f"(__dPdx->x), "f"(__dPdx->y), "f"(__dPdy->x),
				"f"(__dPdy->y)));

				__IMPL_S("__tex2DLayered_v2", "__tex2DLayered_rmnf_v2",
				(float __x, float __y, int __layer), "tex.a2d.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6, %7, %7}];",
				("r"(__layer), "f"(__x), "f"(__y)));
				__IMPL_ALIAS("__itex2DLayered", "__tex2DLayered_v2");

				__IMPL_S3S("__itex2DLayered_sparse",
				(float __x, float __y, int __layer, unsigned char *__ir),
				"{.reg .pred %%p0;\n\t"
				"tex.a2d.v4",
				"f32",
				"{%0, %1, %2, %3}\|%%p0, [%5, {%6, %7, %8, %8}];\n\t"
				"selp.u16 %4, 1, 0, %%p0; }",
				("r"(__layer), "f"(__x), "f"(__y)));

				__IMPL_S("__tex2DLayeredGrad_v2", "__tex2DLayeredGrad_rmnf_v2",
				(float __x, float __y, int __layer, const float2 *__dPdx,
				const float2 *__dPdy),
				"tex.grad.a2d.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6, %7, %7}], {%8, %9}, {%10, %11};",
				("r"(__layer), "f"(__x), "f"(__y), "f"(__dPdx->x), "f"(__dPdx->y),
				"f"(__dPdy->x), "f"(__dPdy->y)));
				__IMPL_ALIAS("__itex2DLayeredGrad_v2", "__tex2DLayeredGrad_v2");

				__IMPL_S3S(
				"__itex2DLayeredGrad_sparse",
				(float __x, float __y, int __layer, const float2 *__dPdx,
				const float2 __dPdy, unsigned char __ir),
				"{.reg .pred %%p0;\n\t"
				"tex.grad.a2d.v4",
				"f32",
				"{%0, %1, %2, %3}\|%%p0, [%5, {%6, %7, %8, %8}], {%9, %10}, {%11, %12};\n\t"
				"selp.u16 %4, 1, 0, %%p0; }",
				("r"(__layer), "f"(__x), "f"(__y), "f"(__dPdx->x), "f"(__dPdx->y),
				"f"(__dPdy->x), "f"(__dPdy->y)));

				__IMPL_S("__tex2DLayeredLod_v2", "__tex2DLayeredLod_rmnf_v2",
				(float __x, float __y, int __layer, float __level), "tex.level.a2d.v4",
				"f32", "{%0, %1, %2, %3}, [%4, {%5, %6, %7, %7}], %8;",
				("r"(__layer), "f"(__x), "f"(__y), "f"(__level)));
				__IMPL_ALIAS("__itex2DLayeredLod", "__tex2DLayeredLod_v2");

				__IMPL_S3S("__itex2DLayeredLod_sparse",
				(float __x, float __y, int __layer, float __level,
				unsigned char *__ir),
				"{.reg .pred %%p0;\n\t"
				"tex.level.a2d.v4",
				"f32",
				"{%0, %1, %2, %3}\|%%p0, [%5, {%6, %7, %8, %8}], %9;\n\t"
				"selp.u16 %4, 1, 0, %%p0; }",
				("r"(__layer), "f"(__x), "f"(__y), "f"(__level)));

				__IMPL_S("__tex2DLod_v2", "__tex2DLod_rmnf_v2",
				(float __x, float __y, float __level), "tex.level.2d.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6}], %7;",
				("f"(__x), "f"(__y), "f"(__level)));
				__IMPL_ALIAS("__itex2DLod", "__tex2DLod_v2");

				__IMPL_S3S("__itex2DLod_sparse",
				(float __x, float __y, float __level, unsigned char *__ir),
				"{.reg .pred %%p0;\n\t"
				"tex.level.2d.v4",
				"f32",
				"{%0, %1, %2, %3}\|%%p0, [%5, {%6, %7}], %8;\n\t"
				"selp.u16 %4, 1, 0, %%p0; }",
				("f"(__x), "f"(__y), "f"(__level)));

				// 2D gather is special. Unlike other variants that translate into exactly one
				// asm instruction, it uses one of the four different instructions selected by
				// __comp. We implement each instruction variant separately, and dispatch the
				// right one from the manually implemented 'umbrella' fetch.
				#define __IMPL_2DGATHER(variant, instr) \
				__IMPL_SI(__IDV("__tex2Dgather_v2", variant), \
				__IDV("__tex2Dgather_rmnf_v2", variant), \
				(float __x, float __y, int __comp), instr, "f32", \
				"{%0, %1, %2, %3}, [%4, {%5, %6}];", ("f"(__x), "f"(__y))); \
				__IMPL_ALIASI(__IDV("__itex2Dgather", variant), \
				__IDV("__tex2Dgather_v2", variant)); \
				__IMPL_S3SI(__IDV("__itex2Dgather_sparse", variant), \
				(float __x, float __y, unsigned char *__ir, int __comp), \
				"{.reg .pred %%p0;\n\t" instr, "f32", \
				"{%0, %1, %2, %3}\|%%p0, [%5, {%6, %7}];\n\t" \
				"selp.u16 %4, 1, 0, %%p0; }", \
				("f"(__x), "f"(__y)));
				__IMPL_2DGATHER(0, "tld4.r.2d.v4");
				__IMPL_2DGATHER(1, "tld4.g.2d.v4");
				__IMPL_2DGATHER(2, "tld4.b.2d.v4");
				__IMPL_2DGATHER(3, "tld4.a.2d.v4");

				// Umbrella dispatcher -- calls into specific 2Dgather variant.
				template <> struct __tex_fetch_v4<__ID("__tex2Dgather_v2")> {
				template <class __T>
				__device__ static __T __run(cudaTextureObject_t __obj, float __x, float __y,
				int __comp) {
				switch (__comp) {
				case 0:
				return __tex_fetch_v4<__IDV("__tex2Dgather_v2", 0)>::__run<__T>(
				__obj, __x, __y, __comp);
				case 1:
				return __tex_fetch_v4<__IDV("__tex2Dgather_v2", 1)>::__run<__T>(
				__obj, __x, __y, __comp);
				case 2:
				return __tex_fetch_v4<__IDV("__tex2Dgather_v2", 2)>::__run<__T>(
				__obj, __x, __y, __comp);
				case 3:
				return __tex_fetch_v4<__IDV("__tex2Dgather_v2", 3)>::__run<__T>(
				__obj, __x, __y, __comp);
				}
				}
				};
				__IMPL_ALIAS("__itex2Dgather", "__tex2Dgather_v2");

				template <> struct __tex_fetch_v4<__ID("__tex2Dgather_rmnf_v2")> {
				template <class __T>
				__device__ static float4 __run(cudaTextureObject_t __obj, float __x,
				float __y, int __comp) {
				switch (__comp) {
				case 0:
				return __tex_fetch_v4<__IDV("__tex2Dgather_rmnf_v2", 0)>::__run<__T>(
				__obj, __x, __y, __comp);
				case 1:
				return __tex_fetch_v4<__IDV("__tex2Dgather_rmnf_v2", 1)>::__run<__T>(
				__obj, __x, __y, __comp);
				case 2:
				return __tex_fetch_v4<__IDV("__tex2Dgather_rmnf_v2", 2)>::__run<__T>(
				__obj, __x, __y, __comp);
				case 3:
				return __tex_fetch_v4<__IDV("__tex2Dgather_rmnf_v2", 3)>::__run<__T>(
				__obj, __x, __y, __comp);
				}
				}
				};

				#if !defined(__CUDA_ARCH__) \|\| (__CUDA_ARCH__ >= 600)
				template <> struct __tex_fetch_v4<__ID("__itex2Dgather_sparse")> {
				template <class __T>
				__device__ static __T __run(cudaTextureObject_t __obj, float __x, float __y,
				unsigned char *__ir, int __comp) {
				switch (__comp) {
				case 0:
				return __tex_fetch_v4<__IDV("__itex2Dgather_sparse", 0)>::__run<__T>(
				__obj, __x, __y, __ir, __comp);
				case 1:
				return __tex_fetch_v4<__IDV("__itex2Dgather_sparse", 1)>::__run<__T>(
				__obj, __x, __y, __ir, __comp);
				case 2:
				return __tex_fetch_v4<__IDV("__itex2Dgather_sparse", 2)>::__run<__T>(
				__obj, __x, __y, __ir, __comp);
				case 3:
				return __tex_fetch_v4<__IDV("__itex2Dgather_sparse", 3)>::__run<__T>(
				__obj, __x, __y, __ir, __comp);
				}
				}
				};
				#endif

				// 3D
				__IMPL_S("__tex3D_v2", "__tex3D_rmnf_v2", (float __x, float __y, float __z),
				"tex.3d.v4", "f32", "{%0, %1, %2, %3}, [%4, {%5, %6, %7, %7}];",
				("f"(__x), "f"(__y), "f"(__z)));
				__IMPL_ALIAS("__itex3D", "__tex3D_v2");

				__IMPL_S3S("__itex3D_sparse",
				(float __x, float __y, float __z, unsigned char *__ir),
				"{.reg .pred %%p0;\n\t"
				"tex.3d.v4",
				"f32",
				"{%0, %1, %2, %3}\|%%p0, [%5, {%6, %7, %8, %8}];\n\t"
				"selp.u16 %4, 1, 0, %%p0; }",
				("f"(__x), "f"(__y), "f"(__z)));

				__IMPL_S("__tex3DGrad_v2", "__tex3DGrad_rmnf_v2",
				(float __x, float __y, float __z, const float4 *__dPdx,
				const float4 *__dPdy),
				"tex.grad.3d.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6, %7, %7}], "
				"{%8, %9, %10, %10}, {%11, %12, %13, %13};",
				("f"(__x), "f"(__y), "f"(__z), "f"(__dPdx->x), "f"(__dPdx->y),
				"f"(__dPdx->z), "f"(__dPdy->x), "f"(__dPdy->y), "f"(__dPdy->z)));
				__IMPL_ALIAS("__itex3DGrad_v2", "__tex3DGrad_v2");

				__IMPL_S3S("__itex3DGrad_sparse",
				(float __x, float __y, float __z, const float4 *__dPdx,
				const float4 __dPdy, unsigned char __ir),
				"{.reg .pred %%p0;\n\t"
				"tex.grad.3d.v4",
				"f32",
				"{%0, %1, %2, %3}\|%%p0, [%5, {%6, %7, %8, %8}], "
				"{%9, %10, %11, %11}, {%12, %13, %14, %14};\n\t"
				"selp.u16 %4, 1, 0, %%p0; }",
				("f"(__x), "f"(__y), "f"(__z), "f"(__dPdx->x), "f"(__dPdx->y),
				"f"(__dPdx->z), "f"(__dPdy->x), "f"(__dPdy->y), "f"(__dPdy->z)));

				__IMPL_S("__tex3DLod_v2", "__tex3DLod_rmnf_v2",
				(float __x, float __y, float __z, float __level), "tex.level.3d.v4",
				"f32", "{%0, %1, %2, %3}, [%4, {%5, %6, %7, %7}], %8;",
				("f"(__x), "f"(__y), "f"(__z), "f"(__level)));
				__IMPL_ALIAS("__itex3DLod", "__tex3DLod_v2");

				__IMPL_S3S("__itex3DLod_sparse",
				(float __x, float __y, float __z, float __level,
				unsigned char *__ir),
				"{.reg .pred %%p0;\n\t"
				"tex.level.3d.v4",
				"f32",
				"{%0, %1, %2, %3}\|%%p0, [%5, {%6, %7, %8, %8}], %9;\n\t"
				"selp.u16 %4, 1, 0, %%p0; }",
				("f"(__x), "f"(__y), "f"(__z), "f"(__level)));

				// Cubemap
				__IMPL_S("__texCubemap_v2", "__texCubemap_rmnf_v2",
				(float __x, float __y, float __z), "tex.cube.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6, %7, %7}];",
				("f"(__x), "f"(__y), "f"(__z)));
				__IMPL_ALIAS("__itexCubemap", "__texCubemap_v2");

				__IMPL_S3S("__itexCubemap_sparse",
				(float __x, float __y, float __z, unsigned char *__ir),
				"{.reg .pred %%p0;\n\t"
				"tex.cube.v4",
				"f32",
				"{%0, %1, %2, %3}\|%%p0, [%5, {%6, %7, %8, %8}];\n\t"
				"selp.u16 %4, 1, 0, %%p0; }",
				("f"(__x), "f"(__y), "f"(__z)));

				__IMPL_S("__texCubemapGrad_v2", "__texCubemapGrad_rmnf_v2",
				(float __x, float __y, float __z, const float4 *__dPdx,
				const float4 *__dPdy),
				"tex.grad.cube.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6, %7, %7}], "
				"{%8, %9, %10, %10}, {%11, %12, %13, %13};",
				("f"(__x), "f"(__y), "f"(__z), "f"(__dPdx->x), "f"(__dPdx->y),
				"f"(__dPdx->z), "f"(__dPdy->x), "f"(__dPdy->y), "f"(__dPdy->z)));
				__IMPL_ALIAS("__itexCubemapGrad_v2", "__texCubemapGrad_v2");

				__IMPL_S("__texCubemapLayered_v2", "__texCubemapLayered_rmnf_v2",
				(float __x, float __y, float __z, int __layer), "tex.acube.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6, %7, %8}];",
				("r"(__layer), "f"(__x), "f"(__y), "f"(__z)));
				__IMPL_ALIAS("__itexCubemapLayered", "__texCubemapLayered_v2");

				__IMPL_S("__texCubemapLayeredGrad_v2", "__texCubemapLayeredGrad_rmnf_v2",
				(float __x, float __y, float __z, int __layer, const float4 *__dPdx,
				const float4 *__dPdy),
				"tex.grad.acube.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6, %7, %8}], "
				"{%9, %10, %11, %11}, {%12, %13, %14, %14};",
				("r"(__layer), "f"(__x), "f"(__y), "f"(__z), "f"(__dPdx->x),
				"f"(__dPdx->y), "f"(__dPdx->z), "f"(__dPdy->x), "f"(__dPdy->y),
				"f"(__dPdy->z)));
				__IMPL_ALIAS("__itexCubemapLayeredGrad_v2", "__texCubemapLayeredGrad_v2");

				__IMPL_S("__texCubemapLayeredLod_v2", "__texCubemapLayeredLod_rmnf_v2",
				(float __x, float __y, float __z, int __layer, float __level),
				"tex.level.acube.v4", "f32",
				"{%0, %1, %2, %3}, [%4, {%5, %6, %7, %8}], %9;",
				("r"(__layer), "f"(__x), "f"(__y), "f"(__z), "f"(__level)));
				__IMPL_ALIAS("__itexCubemapLayeredLod", "__texCubemapLayeredLod_v2");

				__IMPL_S("__texCubemapLod_v2", "__texCubemapLod_rmnf_v2",
				(float __x, float __y, float __z, float __level), "tex.level.cube.v4",
				"f32", "{%0, %1, %2, %3}, [%4, {%5, %6, %7, %7}], %8;",
				("f"(__x), "f"(__y), "f"(__z), "f"(__level)));
				__IMPL_ALIAS("__itexCubemapLod", "__texCubemapLod_v2");

				// Helper class for extracting slice of data from V4 fetch results.
				template <class __DestT, class __SrcT> struct __convert {
				template <int __NElements = sizeof(__DestT) /
				sizeof(typename __TypeInfoT<__DestT>::__base_t)>
				__device__ static __DestT __run(__SrcT __v);
				template <> __device__ static __DestT __run<1>(__SrcT __v) { return {__v.x}; }
				template <> __device__ static __DestT __run<2>(__SrcT __v) {
				return {__v.x, __v.y};
				}
				template <> __device__ static __DestT __run<3>(__SrcT __v) {
				return {__v.x, __v.y, __v.z};
				}
				template <> __device__ static __DestT __run<4>(__SrcT __v) {
				return {__v.x, __v.y, __v.z, __v.w};
				}
				};

				// These are the top-level function overloads the __nv_tex_surf_handler expands
				// to. Each overload deals with one of the several ways __nv_tex_surf_handler
				// is called by CUDA headers. In the end, each of the overloads does the same
				// job -- it figures out which `__tex_fetch_v4::run` variant should be used to
				// fetch texture data and which `__convert::run` is needed to convert it into
				// appropriate return type.

				// __nv_tex_surf_handler("__tex...", &ret, cudaTextureObject_t handle, args...);
				// Data type and return type are based on ret.
				template <class __op, class __T, class... __Args>
				__device__ static void __tex_fetch(__T *__ptr, cudaTextureObject_t __handle,
				__Args... __args) {
				using __FetchT = typename __TypeInfoT<__T>::__fetch_t;
				*__ptr = __convert<__T, __FetchT>::__run(
				__tex_fetch_v4<__op>::template __run<__FetchT>(__handle, __args...));
				}

				// texture<> objects get magically converted into a texture reference. However,
				// there's no way to convert them to cudaTextureObject_t on C++ level. So, we
				// cheat a bit and use inline assembly to do it. It costs us an extra register
				// and a move, but that is easy for ptxas to optimize away.
				template <class __T>
				__device__ cudaTextureObject_t __tex_handle_to_obj(__T __handle) {
				cudaTextureObject_t __obj;
				asm("mov.b64 %0, %1; " : "=l"(__obj) : "l"(__handle));
				return __obj;
				}

				// __nv_tex_surf_handler ("__tex...", &ret, textureReference, args...);
				// Data type and return type is based on ret.
				template <class __op, class __T, class __HandleT, class... __Args>
				__device__ static void __tex_fetch(__T *__ptr, __HandleT __handle,
				__Args... __args) {
				using __FetchT = typename __TypeInfoT<__T>::__fetch_t;
				*__ptr = __convert<__T, __FetchT>::__run(
				__tex_fetch_v4<__op>::template __run<__FetchT>(
				__tex_handle_to_obj(__handle), __args...));
				}

				// __nv_tex_surf_handler ("__tex...", &type_dummy, &ret, texture<...>, args...);
				// cudaReadModeNormalizedFloat fetches always return float4.
				template <class __op, class __DataT, class __RetT, int __TexT, class... __Args>
				__device__ static void
				__tex_fetch(__DataT , __RetT __ptr,
				texture<__DataT, __TexT, cudaReadModeNormalizedFloat> __handle,
				__Args... __args) {
				using __FetchT = typename __TypeInfoT<__DataT>::__fetch_t;
				*__ptr = __convert<__RetT, float4>::__run(
				__tex_fetch_v4<__op>::template __run<__FetchT>(
				__tex_handle_to_obj(__handle), __args...));
				}

				// __nv_tex_surf_handler ("__tex...", &type_dummy, &ret, texture<...>, args...);
				// For cudaReadModeElementType fetch return type is based on type_dummy.
				template <class __op, class __DataT, class __RetT, int __TexT, class... __Args>
				__device__ static void
				__tex_fetch(__DataT , __RetT __ptr,
				texture<__DataT, __TexT, cudaReadModeElementType> __handle,
				__Args... __args) {
				using __FetchT = typename __TypeInfoT<__DataT>::__fetch_t;
				*__ptr = __convert<__RetT, __FetchT>::__run(
				__tex_fetch_v4<__op>::template __run<__FetchT>(
				__tex_handle_to_obj(__handle), __args...));
				}
				} // namespace __cuda_tex
				} // namespace
				#pragma pop_macro("__ASM_OUT")
				#pragma pop_macro("__ASM_OUTP")
				#pragma pop_macro("__Args")
				#pragma pop_macro("__ID")
				#pragma pop_macro("__IDV")
				#pragma pop_macro("__IMPL_2DGATHER")
				#pragma pop_macro("__IMPL_ALIAS")
				#pragma pop_macro("__IMPL_ALIASI")
				#pragma pop_macro("__IMPL_F1")
				#pragma pop_macro("__IMPL_F3")
				#pragma pop_macro("__IMPL_F3N")
				#pragma pop_macro("__IMPL_F3S")
				#pragma pop_macro("__IMPL_S")
				#pragma pop_macro("__IMPL_S3")
				#pragma pop_macro("__IMPL_S3I")
				#pragma pop_macro("__IMPL_S3N")
				#pragma pop_macro("__IMPL_S3NI")
				#pragma pop_macro("__IMPL_S3S")
				#pragma pop_macro("__IMPL_S3SI")
				#pragma pop_macro("__IMPL_SI")
				#pragma pop_macro("__L")
				#pragma pop_macro("__STRIP_PARENS")
				#endif // __CLANG_CUDA_TEXTURE_INTRINSICS_H__

clang/test/Headers/Inputs/include/cuda.h

Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	struct uint3 {
unsigned x, y, z;		unsigned x, y, z;
__host__ __device__ uint3(unsigned x = 0, unsigned y = 0, unsigned z = 0) : x(x), y(y), z(z) {}		__host__ __device__ uint3(unsigned x = 0, unsigned y = 0, unsigned z = 0) : x(x), y(y), z(z) {}
};		};
struct uint4 {		struct uint4 {
unsigned x, y, z, w;		unsigned x, y, z, w;
__host__ __device__ uint4(unsigned x = 0, unsigned y = 0, unsigned z = 0, unsigned w = 0) : x(x), y(y), z(z), w(w) {}		__host__ __device__ uint4(unsigned x = 0, unsigned y = 0, unsigned z = 0, unsigned w = 0) : x(x), y(y), z(z), w(w) {}
};		};


struct longlong2 {		struct longlong2 {
long long x, y;		long long x, y;
__host__ __device__ longlong2(long long x = 0, long long y = 0) : x(x), y(y) {}		__host__ __device__ longlong2(long long x = 0, long long y = 0) : x(x), y(y) {}
};		};
struct longlong4 {		struct longlong4 {
long long x, y, z, w;		long long x, y, z, w;
__host__ __device__ longlong4(long long x = 0, long long y = 0, long long z = 0, long long w = 0) : x(x), y(y), z(z), w(w) {}		__host__ __device__ longlong4(long long x = 0, long long y = 0, long long z = 0, long long w = 0) : x(x), y(y), z(z), w(w) {}
};		};

struct ulonglong2 {		struct ulonglong2 {
unsigned long long x, y;		unsigned long long x, y;
__host__ __device__ ulonglong2(unsigned long long x = 0, unsigned long long y = 0) : x(x), y(y) {}		__host__ __device__ ulonglong2(unsigned long long x = 0, unsigned long long y = 0) : x(x), y(y) {}
};		};
struct ulonglong4 {		struct ulonglong4 {
unsigned long long x, y, z, w;		unsigned long long x, y, z, w;
__host__ __device__ ulonglong4(unsigned long long x = 0, unsigned long long y = 0, unsigned long long z = 0, unsigned long long w = 0) : x(x), y(y), z(z), w(w) {}		__host__ __device__ ulonglong4(unsigned long long x = 0, unsigned long long y = 0, unsigned long long z = 0, unsigned long long w = 0) : x(x), y(y), z(z), w(w) {}
};		};


struct float2 {		struct float2 {
float x, y;		float x, y;
__host__ __device__ float2(float x = 0, float y = 0) : x(x), y(y) {}		__host__ __device__ float2(float x = 0, float y = 0) : x(x), y(y) {}
};		};
struct float4 {		struct float4 {
float x, y, z, w;		float x, y, z, w;
__host__ __device__ float4(float x = 0, float y = 0, float z = 0, float w = 0) : x(x), y(y), z(z), w(w) {}		__host__ __device__ float4(float x = 0, float y = 0, float z = 0, float w = 0) : x(x), y(y), z(z), w(w) {}
};		};

struct double2 {		struct double2 {
double x, y;		double x, y;
__host__ __device__ double2(double x = 0, double y = 0) : x(x), y(y) {}		__host__ __device__ double2(double x = 0, double y = 0) : x(x), y(y) {}
};		};
struct double4 {		struct double4 {
double x, y, z, w;		double x, y, z, w;
__host__ __device__ double4(double x = 0, double y = 0, double z = 0, double w = 0) : x(x), y(y), z(z), w(w) {}		__host__ __device__ double4(double x = 0, double y = 0, double z = 0, double w = 0) : x(x), y(y), z(z), w(w) {}
};		};

		typedef unsigned long long cudaTextureObject_t;

		enum cudaTextureReadMode {
		cudaReadModeNormalizedFloat,
		cudaReadModeElementType
		};

		enum {
		cudaTextureType1D,
		cudaTextureType2D,
		cudaTextureType3D,
		cudaTextureTypeCubemap,
		cudaTextureType1DLayered,
		cudaTextureType2DLayered,
		cudaTextureTypeCubemapLayered
		};

		struct textureReference {};
		template <class T, int texType = cudaTextureType1D,
		enum cudaTextureReadMode mode = cudaReadModeElementType>
		struct __attribute__((device_builtin_texture_type)) texture
		: public textureReference {};

#endif // !__NVCC__		#endif // !__NVCC__

clang/test/Headers/Inputs/include/texture_fetch_functions.h

This file was added.

				// required for __clang_cuda_runtime_wrapper.h tests
				#pragma once

clang/test/Headers/texture_intrinsics.cu

This file was added.

				// REQUIRES: x86-registered-target
				// REQUIRES: nvptx-registered-target
				//
				// RUN: %clang -std=c++11 -fsyntax-only -target x86_64-linux -nocudainc -nocudalib --cuda-gpu-arch=sm_86 --cuda-device-only -S %s
				// RUN: %clang -std=c++11 -fsyntax-only -target x86_64-linux -nocudainc -nocudalib --cuda-gpu-arch=sm_86 --cuda-host-only -S %s

				// Define bare minimum required for parsing the header file.
				#include "Inputs/include/cuda.h"

				// The header file is expected to compile w/o errors. This ensures that texture
				// ID hash has no collisions for known texture operations, otherwise the
				// compilation would fail with an attempt to redefine a type.
				#include <__clang_cuda_texture_intrinsics.h>