The patch Implements support for testure lookups (mostly) in a header file.
The patch has been tested on a source file with all possible combinations of argument types supported by CUDA headers,
compiled and verified that the generated instructions and their parameters match the code generated by NVCC.
Unfortunately, compiling texture code requires CUDA headers and can't be tested in clang itself.
The test will need to be added to the test-suite later.
While generated code compiles and seems to match NVCC, I do not have any code that uses textures that I could test correctness of the implementation.
The gory details of the implementation follow.
User-facing texture lookup API relies on NVCC's __nv_tex_surf_handler builtin which is actually a set of overloads.
The catch is that it's overloaded not only by the argument types, but also by the value of the first argument.
Implementing it in the compiler itself would be rather messy as there are a lot of texture lookup variants.
Implementing texture lookups in C++ is somewhat more maintainable.
If we could use string literals as a template parameter, the implementation could be done completely in the headers.
Unfortunately, literal classes as template parameters are only available in C++20.
One alternative would be to use run-time dispatch, but, given that texture lookup is a single instruction, the overhead would be substantial-to-prohibitive.
As an alternative, this patch introduces __nvvm_texture_op builtin which maps known texture operations to an integer, which is then used to parametrize texture operations.
A lot of texture operations are fairly uniform, with the differences only in the instruction suffix.
Unfortunately, inline assembly requires its input to be a string literal, so we can not rely on templates to generate it and have to resort to preprocessor to do the job.
Another quirk is that historically there were two ways to refer to a texture.
Newer Api uses cudaTextureObject_t which is an opaque scalar value.
Older APIs were using an object of texture<> type which was magically converted to an opaque texture handle (essentially the cudaTextureObject_t).
There's no good way to do this conversion explicitly, which would require implementing each texture lookup twice, for each way to refer to a texture.
However, we can cheat a bit by introducing a dummy inline assembly.
Nominally it accepts texture<> as input, but compiler will convert it to cudaTextureObject_t, so generated assembly will just return correct handle.
This allows both reference styles to use the same implementation.
Overall code structure :
- struct __FT; // maps texture data type to the 4-element texture fetch result type.
- class __tex_fetch_v4<__op>; // implements run methods for specific texture data types.
- class __convert<DstT,SrcT>; // converts result of __tex_fetch_v4 into expected return type (usually a smaller slice of 4-element fetch result
- __tex_fetch<__op,...>(); // Calls appropriate __convert(__text_fetch_v4()) variants.
- #define __nv_tex_surf_handler(__op, __ptr, ...) ; calls appropriate __tex_fetch<>
- __IMPL* macros do the boilerplate generation of __tex_fetch_v4 variants.
is __compilation intentional? (Maybe search-and-replace bug?)