This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Implement experimental support for texture lookups.
ClosedPublic

Authored by tra on Sep 20 2021, 11:27 AM.

Details

Summary

The patch Implements support for testure lookups (mostly) in a header file.

The patch has been tested on a source file with all possible combinations of argument types supported by CUDA headers,
compiled and verified that the generated instructions and their parameters match the code generated by NVCC.
Unfortunately, compiling texture code requires CUDA headers and can't be tested in clang itself.
The test will need to be added to the test-suite later.

While generated code compiles and seems to match NVCC, I do not have any code that uses textures that I could test correctness of the implementation.

The gory details of the implementation follow.


User-facing texture lookup API relies on NVCC's __nv_tex_surf_handler builtin which is actually a set of overloads.
The catch is that it's overloaded not only by the argument types, but also by the value of the first argument.

Implementing it in the compiler itself would be rather messy as there are a lot of texture lookup variants.

Implementing texture lookups in C++ is somewhat more maintainable.
If we could use string literals as a template parameter, the implementation could be done completely in the headers.
Unfortunately, literal classes as template parameters are only available in C++20.

One alternative would be to use run-time dispatch, but, given that texture lookup is a single instruction, the overhead would be substantial-to-prohibitive.
As an alternative, this patch introduces __nvvm_texture_op builtin which maps known texture operations to an integer, which is then used to parametrize texture operations.

A lot of texture operations are fairly uniform, with the differences only in the instruction suffix.
Unfortunately, inline assembly requires its input to be a string literal, so we can not rely on templates to generate it and have to resort to preprocessor to do the job.

Another quirk is that historically there were two ways to refer to a texture.
Newer Api uses cudaTextureObject_t which is an opaque scalar value.
Older APIs were using an object of texture<> type which was magically converted to an opaque texture handle (essentially the cudaTextureObject_t).
There's no good way to do this conversion explicitly, which would require implementing each texture lookup twice, for each way to refer to a texture.
However, we can cheat a bit by introducing a dummy inline assembly.
Nominally it accepts texture<> as input, but compiler will convert it to cudaTextureObject_t, so generated assembly will just return correct handle.
This allows both reference styles to use the same implementation.

Overall code structure :

  • struct __FT; // maps texture data type to the 4-element texture fetch result type.
  • class __tex_fetch_v4<__op>; // implements run methods for specific texture data types.
  • class __convert<DstT,SrcT>; // converts result of __tex_fetch_v4 into expected return type (usually a smaller slice of 4-element fetch result
  • __tex_fetch<__op,...>(); // Calls appropriate __convert(__text_fetch_v4()) variants.
  • #define __nv_tex_surf_handler(__op, __ptr, ...) ; calls appropriate __tex_fetch<>
  • __IMPL* macros do the boilerplate generation of __tex_fetch_v4 variants.

Diff Detail

Event Timeline

tra created this revision.Sep 20 2021, 11:27 AM
tra requested review of this revision.Sep 20 2021, 11:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2021, 11:27 AM
tra added a comment.Sep 20 2021, 4:54 PM

Here's expanded and formatted version of the header: https://gist.github.com/Artem-B/ec4290809650f5092d61d6dafa6b0131
It may help to see what's going on.

tra updated this revision to Diff 373747.Sep 20 2021, 5:00 PM
tra edited the summary of this revision. (Show Details)

cosmetic cleanups.

hliao accepted this revision.Sep 21 2021, 1:16 PM

Cool! I like the idea of *compile-time* dispatch. LGTM except minor warnings from clang-tidy. Could you fix them before committing this change?

This revision is now accepted and ready to land.Sep 21 2021, 1:16 PM
tra updated this revision to Diff 374028.Sep 21 2021, 2:00 PM

Minor cleanups

tra updated this revision to Diff 374034.Sep 21 2021, 2:31 PM

Undo useless NOLINT

tra added a comment.Sep 21 2021, 2:31 PM

Most of clang-tidy warnings are irrelevant -- it tries to parse the header all by itself, without CUDA headers.
It also ignores NOLINTNEXTLINE(clang-diagnostic-error) which was intended to suppress the warning triggered by #error.

The only useful one was in SemaChecking.cpp -- fixed now.

One alternative would be to use run-time dispatch, but, given that texture lookup is a single instruction, the overhead would be substantial-to-prohibitive.

I guess I'm confused... Is the parameter value that we're "overloading" on usually/always a constant? In that case, there's no overhead with runtime dispatch. Or, is it not a constant? In which case, how does nvcc generate a single instruction for this idiom at all?

But then I see switch statements in the code, so now I'm extra confused. :)

Overall, I am unsure of why we need all of this magic. We can rely on LLVM to optimize away constant integer comparisons, and also even comparisons between string literals.

What specifically would be inefficient if this were a series of "real" overloaded functions, with none of the macros, templates, or builtins? (Assuming efficiency is the concern here?)

clang/lib/AST/ExprConstant.cpp
11097 ↗(On Diff #374034)

Write a comment explaining what this function does?

(It seems to...translate a string into an integer? If so, to me, it's strange that it uses a sorted list for this because...what if I add another function? Won't that mess up all the numbers? Anyway, to be clarified in the comment.)

Now that I read more, I see that you don't care about this being a stable mapping etc etc...

I don't really get why this has to be a builtin at all, though. If it's always a string literal, a simple strcmp will do the job, LLVM can optimize this? And I'm almost sure you can assert that the char* is always a string literal, so you can guarantee that it's always optimized away.

11098 ↗(On Diff #374034)

stuuported

11209 ↗(On Diff #374034)

how do we know the arg is a string constant? Looking at the builtin def it doesn't seem that we enforce it there.

clang/lib/Headers/__clang_cuda_texture_intrinsics.h
13

is __compilation intentional? (Maybe search-and-replace bug?)

42

what are you trying to accomplish with an anon ns inside a header?

42

I know you wrote it in the commit message, but this file could really use comments, otherwise I'm afraid you are going to be the only human being on the planet who can edit this...

For starters, it seems that the purpose of this file is to define the __nv_tex_surf_handler "function" -- is that right?

58–59

I have no idea what bt and ft are supposed to stand for. "fetch type" and ...? But __FT stands for "fundamental type" per the comment?

Oh, I found it later, "base type".

I'm all for brevity, but would __base_ty and __fetch_ty be too long?

91

There are only a limited number of these. Could we assert that __T is one of the expected vector types, just for readability and maybe to help the next person who tries to edit this?

92

this is c++11-only. Which, you know what, fine by me. But might be worth an explicit #error at least?

93

I think this is also C++11 syntax

tra added a comment.Sep 22 2021, 10:30 AM

One alternative would be to use run-time dispatch, but, given that texture lookup is a single instruction, the overhead would be substantial-to-prohibitive.

I guess I'm confused... Is the parameter value that we're "overloading" on usually/always a constant? In that case, there's no overhead with runtime dispatch. Or, is it not a constant? In which case, how does nvcc generate a single instruction for this idiom at all?

It's a string literal. And you're actually right, clang does manage to optimize strcmp with a known value. https://godbolt.org/z/h351hfsMf

However, it's only part of the problem. Depending on which particular operation is used, the arguments vary, too. I still need to use templates that effectively need to be parameterized by that string literal argument and I can't easily do it until C++20.
I'd need to push strcmp-based runtime dispatch down to the implementation of the texture lookups with the same operand signature. That's harder to generalize, as I'd have to implement string-based dispatch for quite a few subsets of the operations -- basically for each variant of cartesian product of {dimensionality, Lod, Level, Sparse}.

Another downside is that the string comparison code will result in functions being much larger than necessary. Probably not a big thing overall, but why add overhead that would be paid for by all users and which does not buy us anything? Having one trivial compiler builtin that simplifies things a lot is a better trade-off, IMO.

But then I see switch statements in the code, so now I'm extra confused. :)

That switch is for a special case of texture lookup which may result in one of four texture instruction variants. All others map 1:1.

Overall, I am unsure of why we need all of this magic. We can rely on LLVM to optimize away constant integer comparisons, and also even comparisons between string literals.

It makes it possible to usa a string literal to parameterize templates, which allows to generate variants of __nv_tex_surf_handler in a relatively concise way.

What specifically would be inefficient if this were a series of "real" overloaded functions, with none of the macros, templates, or builtins? (Assuming efficiency is the concern here?)

It's both efficiency and avoidance of typos in repetitive nearly identical code.
There are ~2500 variants of high-level texture lookup variants. They end up calling about 600 different __nv_tex_surf_handler overloads that, in turn, end up generating ~70 unique inline assembly variants.
The current code structure reflects that hierarchy. This is essentially the reason for the parameterization by the operation name happening early, instead of being used as a key for runtime dispatch at the end.

clang/lib/AST/ExprConstant.cpp
11097 ↗(On Diff #374034)

Yes, it's just a 1:1 map. We do not care about specific values as they only matter within one TU. I'll document that.

I can't easily use string literal to parameterize a template.

Hmm. Perhaps I can implement a constexpr perfect_hash(literal) in a header. This would eliminate the need for the builtin.
E.g. https://godbolt.org/z/bzzMbaKhe

Let me give it a try.

Depending on which particular operation is used, the arguments vary, too.

So something like

T __nv_tex_surf_handler(name, arg1) {
  switch (name) {
    ...
    default:
      panic();
  }
}

T __nv_tex_surf_handler(name, arg1, arg2) {
  switch(...) { ... }
}

and so on?

I'd need to push strcmp-based runtime dispatch down to the implementation of the texture lookups with the same operand signature.

Agree.

That's harder to generalize, as I'd have to implement string-based dispatch for quite a few subsets of the operations -- basically for each variant of cartesian product of {dimensionality, Lod, Level, Sparse}.

Another downside is that the string comparison code will result in functions being much larger than necessary. Probably not a big thing overall, but why add overhead that would be paid for by all users and which does not buy us anything?

If it didn't buy us anything, I'd agree. The thing I'm concerned about is readability of this code. Which, if we want to tie it back to users, affects our ability to catch bugs in this implementation.

Having one trivial compiler builtin that simplifies things a lot is a better trade-off, IMO.

Ah, maybe I wasn't clear then. I'm not actually super-concerned with the compiler builtin. It'd be nice to get rid of it if there's a clean way to do so, but if we don't, that's ok. Basically, the builtin is just for changing strcmp(x, "foo") into builtin(x) == builtin("foo"). Fine.

What I'm more concerned with is the spaghetti of macros here to do something as simple as a series of overloaded functions. It seems like a premature optimization, and I don't feel confident I can check it for bugs.

tra added a comment.Sep 22 2021, 12:38 PM

Depending on which particular operation is used, the arguments vary, too.

So something like

T __nv_tex_surf_handler(name, arg1) {
  switch (name) {
    ...
    default:
      panic();
  }
}

T __nv_tex_surf_handler(name, arg1, arg2) {
  switch(...) { ... }
}

and so on?

Yes, and there will be multiple such overloads for each name. So the switch will have to be replicated/adjusted in each overload.

If it didn't buy us anything, I'd agree. The thing I'm concerned about is readability of this code. Which, if we want to tie it back to users, affects our ability to catch bugs in this implementation.

Having one trivial compiler builtin that simplifies things a lot is a better trade-off, IMO.

Ah, maybe I wasn't clear then. I'm not actually super-concerned with the compiler builtin. It'd be nice to get rid of it if there's a clean way to do so, but if we don't, that's ok. Basically, the builtin is just for changing strcmp(x, "foo") into builtin(x) == builtin("foo"). Fine.

What I'm more concerned with is the spaghetti of macros here to do something as simple as a series of overloaded functions. It seems like a premature optimization, and I don't feel confident I can check it for bugs.

The choice is between using macros to generate the boilerplate vs replicating things manually. If we could get templates to generaste inline asm operands, that would be great. Unfortunately it requires using literals, so the macros are the only way to construct the right instruction.
If I do not do that with macros, I'll have to manually write each instruction variant and get it right every time.

I can preprocess the macros and commit the results. That would be about an order of magnitude more code than what we have now. That would be harder to change en masse without errors. E.g. try spotting the differences between any two neighboring functions there: https://gist.github.com/Artem-B/ec4290809650f5092d61d6dafa6b0131

tra updated this revision to Diff 374360.Sep 22 2021, 1:52 PM

Switched to purely in-header implementation based on constexpr perfect hash.

tra updated this revision to Diff 374362.Sep 22 2021, 2:00 PM

Require c++11 for texture support.

tra updated this revision to Diff 374391.Sep 22 2021, 4:00 PM
tra marked 2 inline comments as done.

Added better C++11 guards.

tra added inline comments.Sep 22 2021, 4:02 PM
clang/lib/Headers/__clang_cuda_texture_intrinsics.h
42

what are you trying to accomplish with an anon ns inside a header?

I wanted to give all functions internal linkage, so they do not pollute visible symbols. Without that and with numeric tag IDs not being stable, we could end up having ODR issues in code compiled with -fgpu-rdc by different clang versions.

I've moved all defined functions into namespace __cuda_tex, so I don't have to use an extra prefix on all the types the header creates.

For starters, it seems that the purpose of this file is to define the __nv_tex_surf_handler "function" -- is that right?

Yes. I've added a comment at the top of the doc.

92

I've added an include guard instead. This header is included from the cuda runtime wrapper for all compilations. We don't want to break folks who use c++98, but don't need textures. If they do try to use them, they will get a static assert.

tra updated this revision to Diff 374404.Sep 22 2021, 5:12 PM

Cleanups. Added more comments.

tra updated this revision to Diff 374405.Sep 22 2021, 5:13 PM

Removed a test file committed by mistake.

tra updated this revision to Diff 374604.Sep 23 2021, 10:17 AM

Added a test.

tra updated this revision to Diff 374642.Sep 23 2021, 12:08 PM

Disable sparse ops for pre-sm_60 GPUs.

tra updated this revision to Diff 374644.Sep 23 2021, 12:21 PM

Sort push/pop_macro.

jlebar accepted this revision.Sep 24 2021, 1:26 PM

Okay, I give up on the phab interface. It's unreadable with all the existing
comments and lint errors.

Hope you don't mind comments this way. I'm just going to put it all in a giant
code block so it doesn't get wrapped or whatever.

+// __nv_tex_surf_handler() provided by this header as a macro.
+#define __nv_tex_surf_handler(__op, __ptr, ...)                                \
+  __cuda_tex::__tex_fetch<__cuda_tex::__Tag<__cuda_tex::__tex_op_hash(__op)>>( \
+      __ptr, __VA_ARGS__)

::__cuda_tex

+// Put all functions into anonymous namespace so they have internal linkage.

Say a little more?  Specifically, you want anon ns because this is device code
and it has to work even without being linked.

(Also, are you sure that plain `inline` doesn't do the right thing?  Like, we
have lots of CUDA headers that are `inline`'ed without all being in an anon
ns.)

+// First, we need a perfect hash function and a few constexpr helper functions
+// for converting a string literal into a numeric value which can be used to
+// parametrize a template. We can not use string literals for that as that would
+// require C++20.
+//
+// The hash function was generated with 'gperf' and then manually converted into
+// its constexpr equivalent.
+//
+// NOTE: the perfect hashing scheme comes with inherent self-test. If the hash
+// function has a collision for any of the texture operations, the compilation
+// will fail due to an attempt to redefine a tag with the same value. If the
+// header compiles, then the hash function is good enough for the job.

I guess if it has a self-test then that's fine.  Though is this really better
than a series of `if` statements with strcmp?  I guess I am scared of this kind
of thing because I did it once in ccache.  I thought I was very clever and got
a good speedup.  1 year later I found out I'd broken handling of __DATE__ and
__TIME__.  o.O
clang/lib/Headers/__clang_cuda_texture_intrinsics.h
27

::__cuda_tex (appears twice)

54

Write a little more? This looks super-suspicious, but you need it specifically because these are *device* functions.

Presumably as a separate commit we should add tests to the test_suite repository to ensure that this at least still compiles with different versions of CUDA?

tra updated this revision to Diff 374979.Sep 24 2021, 4:01 PM

More comments.

tra added a comment.Sep 24 2021, 4:12 PM

Okay, I give up on the phab interface. It's unreadable with all the existing
comments and lint errors.

Yeah. Phabricator experience is not great.

+// Put all functions into anonymous namespace so they have internal linkage.

Say a little more? Specifically, you want anon ns because this is device code
and it has to work even without being linked.

(Also, are you sure that plain inline doesn't do the right thing? Like, we
have lots of CUDA headers that are inline'ed without all being in an anon
ns.)

We do want inlining, but the main purpose here is to avoid potential ODR for use with -fgpu-rdc where multiple TUs may be compiled with different versions of this header. Because the hash may change, we could end up with thesame Tag types (and fetch functions) with the same names meaning different things for different TUs.

+ NOTE: the perfect hashing scheme comes with inherent self-test. If the hash
+
function has a collision for any of the texture operations, the compilation
+ will fail due to an attempt to redefine a tag with the same value. If the
+
header compiles, then the hash function is good enough for the job.

I guess if it has a self-test then that's fine. Though is this really better
than a series of if statements with strcmp?

Yes, I think somewhat obfuscated metaprogramming here wins on points, IMO.

  • it's fairly well-structured, even if macros make it a bit of a pain to dig through.
  • assumptions about the perfect hash are minimal -- it's just a 1:1 string->integer map. If that assumption is violated we're guaranteed to get a compilation error when we instantiate the templates that map to the same value. I did test that by changing the hash function.
  • strcmp() will result in 100+ comparisons. That alone will be a pain to write manually. In my experience, having more than a handful of nearly-identical, but critically different chunks of code makes the whole thing very error-prone. I've tried that early on before I've figured out how to parameterize templates by a string.
  • We'll also need to use it in a function template, so the code will get replicated over all the instances of the signatures we'll need to impement. While it's probably no a showstopper, it's still additional IR we'd have to deal with. Adding incremental burden on all CUDA users is worse than additional mental burden on whoever may need to read this code (most likely me).

I guess I am scared of this kind of thing because I did it once in ccache. I thought I was very clever and got
a good speedup. 1 year later I found out I'd broken handling of DATE and TIME. o.O

Being wary of making an easy-to-miss errors here, I literally did exhaustive testing of all variants (2972 of them) of high-level API calls provided by NVIDIA headers and verified that we do end up generating identical instructions and their parameters.
I will add that test in the test-suite as it needs actual CUDA headers.

Presumably as a separate commit we should add tests to the test_suite repository to ensure that this at least still compiles with different versions of CUDA?

That's the plan. I've tested thhe patch manually down to CUDA-9. It will not work with CUDA-8 or older as they have completely different under-the hood implementation in CUDA headers. I'll add an include guard for the old CUDA versions.

tra updated this revision to Diff 377687.Oct 6 2021, 2:30 PM

Removed obsolete comment.

tra updated this revision to Diff 377703.Oct 6 2021, 3:07 PM

Use int for string hash calculations to avoid dealing with char signedness.

This revision was landed with ongoing or failed builds.Oct 6 2021, 3:16 PM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Oct 6 2021, 5:13 PM

This breaks tests on Mac: http://45.33.8.238/macm1/19372/step_7.txt

Please take a look :)

kgk added a subscriber: kgk.Oct 6 2021, 10:12 PM

Will the new macros in this patch also be useful for supporting the surface-related methods that also use __nv_tex_surf_handler (from surface_indirect_functions.h)?

I gave this new code a try with surf2Dread and surf2Dwrite, and based on the errors, it look like it may just be a matter of creating the right mappings from Tag to the correct asm using these new macros (e.g. __isurf2Dwrite_v2 and isurf2Dread).

tra added a comment.Oct 7 2021, 9:59 AM

Will the new macros in this patch also be useful for supporting the surface-related methods that also use __nv_tex_surf_handler (from surface_indirect_functions.h)?

I gave this new code a try with surf2Dread and surf2Dwrite, and based on the errors, it look like it may just be a matter of creating the right mappings from Tag to the correct asm using these new macros (e.g. __isurf2Dwrite_v2 and isurf2Dread).

Only textures are supported at the moment, but adding support for surface operations would indeed be very similar.

Basically we just need to add specializations for the surface operations. It's fairly tedious, but straightforward in principle.

clang/lib/Headers/__clang_cuda_texture_intrinsics.h
62–65

An alternative would be to use something like this: https://github.com/gelldur/gcpp/blob/master/src/gcpp/string/ConstexprString.hpp

That would be a bit too complicated for this limited use case.

kgk added a comment.Oct 7 2021, 4:29 PM

Will the new macros in this patch also be useful for supporting the surface-related methods that also use __nv_tex_surf_handler (from surface_indirect_functions.h)?

I gave this new code a try with surf2Dread and surf2Dwrite, and based on the errors, it look like it may just be a matter of creating the right mappings from Tag to the correct asm using these new macros (e.g. __isurf2Dwrite_v2 and isurf2Dread).

Only textures are supported at the moment, but adding support for surface operations would indeed be very similar.

Basically we just need to add specializations for the surface operations. It's fairly tedious, but straightforward in principle.

Very cool! I am selfishly curious if support for surface operations is something you plan to add. I had a go at implementing it myself today based on this patch, and found it a bit harder than I was expecting 😅

I appreciate your work on this; it's great to see cuda texture support being added to clang!

tra added a comment.Oct 7 2021, 5:16 PM

Very cool! I am selfishly curious if support for surface operations is something you plan to add. I had a go at implementing it myself today based on this patch, and found it a bit harder than I was expecting 😅

I don't have immediate plans to do it. It's pretty far down my todo list, so I don't know when/if I'll get to it. So, you'll have plenty of time to beat me to it. :-]