This adds a cuda header type with file extension "cuh". The output type file
extension is "cuhi" - not sure if this is a good choice. This allows
language servers to properly handle cuh files without additional arguments.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I haven't added any tests for this change. This is my first commit, so I am not sure how to write a test or what tests are appropriate.
Just to be clear, this is currently a WIP - it would still be good if someone took a look at this because I don't know if
this approach makes any sense.
I made this change with the aim of allowing language servers and other tools to cleanly handle
.cuh files, but in theory it should also be possible to actually generate and use precompiled cuda headers.
Perhaps unsuprisingly, precompiled cuda headers don't actually work (crash) with the changes made so far.
However, everything does *seem* to be working in the language server/"-fsyntax-only" context
(aside from incorrectly issuing a warning for '#pragma once" - I will fix this later).
I don't have any idea what would be required to actually get cuda precompiled headers working.
I will spend some time looking into this when I get a chance. Alternatively, we could just disable
generating precompiled cuda headers but allow for syntax checking (-fsyntax-only and associated front end functions).
Hi, welcome! Thank you for the careful and well-motivated first commit. (I also see https://github.com/ccache/ccache/issues/772, hooray for noticing that...)
I am also not 100% sure how to write a test, but I think you may be able to observe the effect of your driver changes by running clang -###. I'd expect you'd then see -x cuda in the output?
Same for the .cuhi change; I presume that with -### you'll be able to see that the driver is building a PCH with output name .cuhi even if that would crash if it actually went through.
This adds a cuda header type with file extension "cuh". The output type file
extension is "cuhi" - not sure if this is a good choice. This allows
language servers to properly handle cuh files without additional arguments.
CUDA compilation is ... odd. While I don't have strong objections to this change, I'm also not convinced that it's particularly useful, either.
The problem is that just telling compiler that the header is a CUDA source is not sufficient. Clang relies on a lot of CUDA SDK headers in order to make the standard CUDA things like __host__, 'device', threadIdx, etc work. For the CUDA source files clang pre-includes a lot of headers. I'm not quite sure what clang would be supposed to do when it's given a .cuh as an input. If we pre-include the CUDA headers as we do for .cu compilation, we'll get way more of the preprocessed output than the user would likely expect. If we do not pre-include CUDA headers, we'll not get CUDA macros expanded and the output would not be suitable to pass on to the further compilation. On one hand it does not make things worse. On the other, it does not make them better, either. If anything, Clang not recognizing CUDA headers on their own is probably better than accepting them and producing invalid output.
Perhaps I'm missing something. Do you have specific use case in mind for this change?
Also, AFAICT, there's no such thing as a canonical CUDA header file extension. Some projects use .cuh, some use .cu.h, some use just .h and other C++ extensions, so this change would only help with a subset of the headers.
Then there's the fact that for CUDA compilation clang needs the headers from CUDA SDK. You can't just take a CUDA header and preprocess it with the default flags. If you want the preprocessed file content to match what would be seen in a real compilation when the header is included from the CUDA source file, you must ensure that the header was preprocesses with exactly the same CUDA SDK as the source file it would be included from. In general, --cuda-path should probably be treated as a required argument. I do not see a practical way to make properly handle cuh files without additional arguments work reliably. For that we'd need Clang to carry all required CUDA headers and that's not going to happen any time soon. That said, it may be worth considering implementing a stand-alone CUDA compilation mode which would provide the bare minimum of CUDA headers without having to include CUDA SDK headers. That would be useful for various tooling scenarios where we may need to deal with CUDA sources/headers but where we do not have CUDA SDK available and/or do not have access to correct --cuda-path. That would probably need to be implemented first. Once it's in place, preprocessing a CUDA header may make more sense as we'll be able to produce preprocessed output that would be closer to what would be expected from a preprocessor -- something compileable without too much extra stuff.
My primary goal for this change was to allow for language servers and other tooling to properly handle cuda header files. From my understanding the way that language servers handle c++ header files is by compiling them with -xc++-header and -fsyntax-only. This is certainly true for ccls and it seems to be true for clangd.
So this can be accomplished without actually able to produce preprocessed output for cuda headers - it only requires handling the "-fsyntax-only" use case.
A secondary goal was to make it so that header tab completion recognizes .cuh files.
This change doesn't depend on the other changes - it only requires a minor edit to clang/lib/Sema/SemaCodeComplete.cpp.
So, as I mentioned before, one approach would be to avoid all the issues with preprocessed output for now by only allowing for compiling with -xcuda-header if -fsyntax-only is enabled.
Actually generating precompiled header output could be implemented later or not at all.
From my limited testing, the changes made so far are sufficient to allow for language servers to handle cuda headers.
clangd seems to just work (but I haven't tested it much), if --cuda-gpu-arch is supplied (and --cuda-path if needed) via compile_flags.txt.
I have a branch of ccls with a few minor changes working as well: https://github.com/rgreenblatt/ccls/tree/cuh_support.
Also, AFAICT, there's no such thing as a canonical CUDA header file extension. Some projects use .cuh, some use .cu.h, some use just .h and other C++ extensions, so this change would only help with a subset of the headers.
The way things are setup in Types.def, defining a new file type requires an extension. I think cuh is the best choice for this extension.
Not having a canonical CUDA header file extension is unfortunate, but this could be addressed at the tooling level if desired. For example, one could imagine designating a subset of headers to be built with -xcuda-header via a regex or whitelist.
As far as I know, ccls and clangd don't currently have a nice way of doing additional header specific compile commands, but I can't imagine this would be particularly difficult to implement. Another option would be to add -xcuda-header as a compiler flag for all headers as c++ headers should generally be valid cuda.
In general, --cuda-path should probably be treated as a required argument. I do not see a practical way to make properly handle cuh files without additional arguments work reliably. For that we'd need Clang to carry all required CUDA headers and that's not going to happen any time soon. That said, it may be worth considering implementing a stand-alone CUDA compilation mode which would provide the bare minimum of CUDA headers without having to include CUDA SDK headers. That would be useful for various tooling scenarios where we may need to deal with CUDA sources/headers but where we do not have CUDA SDK available and/or do not have access to correct --cuda-path.
For the language server use case this isn't necessary much of a problem.
ccls allows for configuring cuda specific arguments.
If a compile_commands.json is used, figuring out the correct compiler flags for any type of header file is in general non-trivial.
I think ccls tries to find the closest match in filename using some sort of metric.
One could imagine trying to match cuda headers with cuda source file to get the correct values for --cuda-path and --cuda-gpu-arch.
That would still require properly defined CUDA macros. CUDA in clang relies on various CUDA attributes, currently wrapped in __{host|device|global|etc.}__ macros in order to compile the code. Without them, -fsyntax-only will not be give you correct results on most of the CUDA code.
Most likely you'll see tons of errors when compiler sees __device__ and has no idea what to do with it. Hence my suggestion that clang needs at least a minimum subset of CUDA headers to provide the critical subset of macros sufficient to convey critical semantics of CUDA code.
A secondary goal was to make it so that header tab completion recognizes .cuh files.
This change doesn't depend on the other changes - it only requires a minor edit to clang/lib/Sema/SemaCodeComplete.cpp.
Maybe. It depends on how well clang can recover from the errors induced by the unexpanded CUDA macros. This could range from OK on simple code to rather badly if we fail to instantiate templated code.
From my limited testing, the changes made so far are sufficient to allow for language servers to handle cuda headers.
Trivial CUDA headers -- maybe. I have doubts that it would work on something more interesting. E.g. try using it on https://github.com/NVIDIA/cutlass.
clangd seems to just work (but I haven't tested it much), if --cuda-gpu-arch is supplied (and --cuda-path if needed) via compile_flags.txt.
Yes, if you provide the flags and use it on complete .cu file, then clang driver takes care of the CUDA magic, pre-includes the right headers so the file parses correctly.
If the flags are supplied, then the same process will work for the header files, too. You could be able to process them with -x cuda which will apply the same magic pre-include.
However, applying the same magic to everything that has CUDA header extension as the input is not the right thing to do, IMO, as that would not be what the end user expect.
In other words, the magic would be OK for tooling, but not for the general use by default.
Also, AFAICT, there's no such thing as a canonical CUDA header file extension. Some projects use .cuh, some use .cu.h, some use just .h and other C++ extensions, so this change would only help with a subset of the headers.
The way things are setup in Types.def, defining a new file type requires an extension. I think cuh is the best choice for this extension.
I'm OK with this part, if it's necessary in the end.
Not having a canonical CUDA header file extension is unfortunate, but this could be addressed at the tooling level if desired. For example, one could imagine designating a subset of headers to be built with -xcuda-header via a regex or whitelist.
As far as I know, ccls and clangd don't currently have a nice way of doing additional header specific compile commands, but I can't imagine this would be particularly difficult to implement. Another option would be to add -xcuda-header as a compiler flag for all headers as c++ headers should generally be valid cuda.
Addressing it by the tooling looks like a good place to do it. Applying CUDA magic to all C++ sources by default is probably not a good idea as all the extra stuff will be observable and some C++ headers that may have checks for CUDA-specific things may get confused. I can't think of a good way to tell if particular source uses CUDA extensions or not, other than by trying to compile it. But then, again, what it it works for both CUDA and C++ mode. We would not know which mode user intended without them explicitly telling us. We may need some sort of knob what to assume if we can't figure it out some other way.
In general, --cuda-path should probably be treated as a required argument. I do not see a practical way to make properly handle cuh files without additional arguments work reliably. For that we'd need Clang to carry all required CUDA headers and that's not going to happen any time soon. That said, it may be worth considering implementing a stand-alone CUDA compilation mode which would provide the bare minimum of CUDA headers without having to include CUDA SDK headers. That would be useful for various tooling scenarios where we may need to deal with CUDA sources/headers but where we do not have CUDA SDK available and/or do not have access to correct --cuda-path.
For the language server use case this isn't necessary much of a problem.
ccls allows for configuring cuda specific arguments.
If a compile_commands.json is used, figuring out the correct compiler flags for any type of header file is in general non-trivial.
Agreed. There may not be a definitive single answer in some cases. I.e. the same header could be used from both C++ and CUDA compilations. I guess, ultimately we may need to make tooling aware that the same file may have multiple compiled forms. E.g. CUDA sources have multiple instances -- one for the host and one per GPU we compile for. The AST for each instance is not necessarily identical. Usually it's close, so for now we're getting by by using only host-side compilation, but that may give us incomplete picture. This is somewhat orthogonal to figuring out how to handle CUDA sources in principle, so for now let's still stick with a 1:1 source-to-AST model. For CUDA/C++ choice, CUDA may indeed be a better choice CUDA-mode AST would usually be a superset of a C++ compilation of the same file.
Hmm. Looks like we'll need some sort of heuristinc (or user knob) to tell us how to parse a header file in a pjoject with mixed C++/CUDA code.
I think ccls tries to find the closest match in filename using some sort of metric.
One could imagine trying to match cuda headers with cuda source file to get the correct values for --cuda-path and --cuda-gpu-arch.
This would work in some cases, but not in others. E.g. a header used from both C++ and CUDA cources.
All that said, I'm all for improving CUDA handling within the tooling. This patch will play its role, but I still believe it's a bit premature. It will make some use-cases work, but I think it may be a good chance to make tooling work with CUDA sources in a more consistent manner.
Here's my straw-man proposal:
- Create a new lib/Headers/__clang_cuda_standalone_defs.h and populate it with the bare minimum of macros and declarations: __host__, __device__, other attributes, builtin vars, malloc/free/new/delete, handful of basic types like dim3/uint3, math functions etc.... Basically as much as we can w/o having to use actual CUDA headers.
- teach tooling to pre-include that stand-alone header whenever it wants to parse a CUDA source, but has no explicit flags pointing compiler to the CUDA SDK. Pre-include the standard runtime wrapper header otherwise. This may be done in the driver itself as a fallback mechanism in case CUDA SDK is not found. Not sure yet.
We may not need the special CUDA header type after that. The tooling heuristic should be able to deal with detecting whether it deals with CUDA sources better than clang -- compilation database + user input should help. If/when tooling knows it deals with CUDA, it can tell so to clang with -x cuda.
Does it make sense?
Yep, most of what your saying makes sense to me. Thanks for taking the time to review this.
A few notes:
That would still require properly defined CUDA macros. CUDA in clang relies on various CUDA attributes, currently wrapped in __{host|device|global|etc.}__ macros in order to compile the code. Without them, -fsyntax-only will not be give you correct results on most of the CUDA code.
Most likely you'll see tons of errors when compiler sees __device__ and has no idea what to do with it. Hence my suggestion that clang needs at least a minimum subset of CUDA headers to provide the critical subset of macros sufficient to convey critical semantics of CUDA code.
Actually this already works (roughly) with the changes made so far.
For example consider the following header:
#pragma once __global__ void a() { unsigned block_idx = blockIdx.x; unsigned thread_idx = threadIdx.x; __shfl_down_sync(1, 2, 1); }
When saved as a .cuh and compiled as clang++ file.cuh -fsyntax-only --cuda-gpu-arch=sm_75 (using clang++ built from this commit), this works fine other than an invalid diagnostic for the #pragma once.
Obviously there are several errors when building this with -x c++-header.
The reason why this "works" is because of the change to Driver.cpp. This change makes it so that -x cuda-header is handeled similarly (identically?) to that of -x cuda.
Further changes to Driver.cpp will need to happen to avoid warning about #pragma once and (if desired) to actually output a precompiled header.
A secondary goal was to make it so that header tab completion recognizes .cuh files.
This change doesn't depend on the other changes - it only requires a minor edit to clang/lib/Sema/SemaCodeComplete.cpp.
Maybe. It depends on how well clang can recover from the errors induced by the unexpanded CUDA macros. This could range from OK on simple code to rather badly if we fail to instantiate templated code.
I am pretty sure that header tab completion is totally unrelated to the syntactic validity of headers; it's just finding the list of files in the include path which match the text entered so far and then filtering out files without a accepted extension. This change just adds ".cuh" to the list of acceptable extensions.
Here's my straw-man proposal:
- Create a new lib/Headers/__clang_cuda_standalone_defs.h and populate it with the bare minimum of macros and declarations: __host__, __device__, other attributes, builtin vars, malloc/free/new/delete, handful of basic types like dim3/uint3, math functions etc.... Basically as much as we can w/o having to use actual CUDA headers.
- teach tooling to pre-include that stand-alone header whenever it wants to parse a CUDA source, but has no explicit flags pointing compiler to the CUDA SDK. Pre-include the standard runtime wrapper header otherwise. This may be done in the driver itself as a fallback mechanism in case CUDA SDK is not found. Not sure yet.
Makes sense to me. Maybe this approach should always be used when building with "-fsyntax-only" regardless of whether or not the file is a header?
We may not need the special CUDA header type after that. The tooling heuristic should be able to deal with detecting whether it deals with CUDA sources better than clang -- compilation database + user input should help. If/when tooling knows it deals with CUDA, it can tell so to clang with -x cuda.
This seems like a decent approach to me, but this will result in incorrectly issuing a diagnostic for #pragma once.
This can of course be fixed by directly disabling the warning, but this does seem a bit hacky. There may also be other header specific behavior, but I can't think of any.
On the whole, it does seem a bit gross for tooling to have to compile headers as though they are main files.
clang/lib/Driver/Driver.cpp | ||
---|---|---|
2466 | From my understanding this basically makes the treatment of TY_CUDAHeader (-xcuda-header) identical to TY_CUDA (-xcuda). |
'Works' is not exactly the same as 'works correctly'. This example makes a() look like a regular host function, instead of the kernel, and that affects how the rest of the TU get parsed.
I.e. you'll have further errors if somewhere down below the file has a<<<1,1>>>(). Similar story with ignoring __host__ and __device__ attributes -- that may lead to various overload resolution errors, or reporting conflicting redeclarations/redefinitions for the perfectly valid host/device function overloads. The list goes on.
Obviously there are several errors when building this with -x c++-header.
The reason why this "works" is because of the change to Driver.cpp. This change makes it so that -x cuda-header is handeled similarly (identically?) to that of -x cuda.
Further changes to Driver.cpp will need to happen to avoid warning about #pragma once and (if desired) to actually output a precompiled header.
CUDA compilation in clang rather heavily depends on compiler understanding CUDA-related attributes and those come from the header files. You can make some things work without those attributes, but you will have more cases where you'll get wrong results ranging from slightly wrong to mostly wrong. :-/ If you're curious, search for CUDA in lib/Sema/ and you'll see plenty of places where CUDA attributes affect how compiler interprets the sources.
A secondary goal was to make it so that header tab completion recognizes .cuh files.
This change doesn't depend on the other changes - it only requires a minor edit to clang/lib/Sema/SemaCodeComplete.cpp.Maybe. It depends on how well clang can recover from the errors induced by the unexpanded CUDA macros. This could range from OK on simple code to rather badly if we fail to instantiate templated code.
I am pretty sure that header tab completion is totally unrelated to the syntactic validity of headers; it's just finding the list of files in the include path which match the text entered so far and then filtering out files without a accepted extension. This change just adds ".cuh" to the list of acceptable extensions.
SGTM. I'm not against the change. It's just the patch does too much (automatically inferring thet .cuh is a CUDA file) and not enough (does not really handle stand-alone CUDA headers in consistent manner).
I'm fine adding CUDA header kind, but without automatic inference of the type via extension. E.g. assigning a cuda-headers-are-not-really-supported-yet as an extension would address my concern, while allowing the tools or interested users to use -x cuda-header if they need it.
We may not need the special CUDA header type after that. The tooling heuristic should be able to deal with detecting whether it deals with CUDA sources better than clang -- compilation database + user input should help. If/when tooling knows it deals with CUDA, it can tell so to clang with -x cuda.
This seems like a decent approach to me, but this will result in incorrectly issuing a diagnostic for #pragma once.
#pragma once has been biting me with tooling used on C++ headers, too, so I suspect it's not a CUDA-specific issue.
This can of course be fixed by directly disabling the warning, but this does seem a bit hacky. There may also be other header specific behavior, but I can't think of any.
On the whole, it does seem a bit gross for tooling to have to compile headers as though they are main files.
I'm not sure if it should matter much if we only care about -fsyntax-only. The warning for #pragma once is a rather unusual case where the distinction does matter. Introducing a whole new file kind just to work around the warning looks even more hacky than silencing the warning to me. :-/
In any case, just adding a file kind is simple enough. I'm OK with it as long as we have a way not to expose it to the end-users by default until we can make clang's behavior for that data type sensible.
No, this part is definitely working, the full set of SDK headers is included. I have confirmed this by compiling a .cu and a .cuh file with -E and checking that the output is identical. Further, I have confirmed that global<<<_, _>>>() builds, that __host__ only functions can't be used on the device, and that __device__ functions can't be used on the host.
I decided to go ahead and understand exactly what is going on when building with -x cuda-header
Prior to the last update I made, when compling a cuda-header this was roughly what was happening:
- Preprocess the header in host mode with type TY_CUDAHeader
- For each cuda arch, build with type TY_CUDA_DEVICE. This builds the header as though it was a .cu src file and this is where the #pragma once warnings were occuring.
- if -f-syntax only was used, exit here
- try to construct a fat binary and violently explode in a crash because the host mode was compiled with type TY_CUDAHeader instead of TY_CUDA. TY_CUDAHeader attempts to generate a precompiled header rather than a binary. (Edit: actually the crash was happening earlier without -fsyntax-only)
To fix this, I added a new type TY_CUDAHeader_DEVICE. This type precompiles a header for a specific device architecture.
This type is used in place of TY_CUDA_DEVICE in the appropriate case.
I think having a header device type is the correct approach if a separate header type is used.
Now compilation looks like this:
- Preprocess the header in host mode with type TY_CUDAHeader
- For each cuda arch, build with type TY_CUDAHeader_DEVICE. This doesn't issue a warning for #pragma once - it is correctly (I think) considering the file as a header.
- if -f-syntax only was used, exit here
- Output precompiled headers for the host and each cuda arch. For example: file.cuh-cuda-nvptx64-nvidia-cuda-sm_60.gch, file.cuh-cuda-nvptx64-nvidia-cuda-sm_75.gch, file.cuh.gch
As far as I can tell this process is now working as expected.
Of course, there is no way to use these precompiled headers right now, so I have no idea if they are at all valid.
Also, I haven't run this with assertions enabled yet (waiting on a build), so it might trip something.
I have tested that everything works as expected using the following header file:
#pragma once __device__ int device_only() { __syncthreads(); return 0; } __host__ int host_only() { return 1; } __host__ __device__ void check_all_archs() { #ifdef __CUDA_ARCH__ #if __CUDA_ARCH__ == 750 #pragma message "sm_75" // host_only(); #elif __CUDA_ARCH__ == 600 #pragma message "sm_60" // host_only(); #else #pragma message "other sm" // host_only(); #endif #else #pragma message "host" // device_only(); #endif } __global__ void global() { int out_device = device_only(); // int host_error = host_only(); } void f() { check_all_archs(); global<<<1, 1>>>(); }
@tra I am guessing you have already thought about this, but
one thing which is worth noting is that language servers
only maintain a single AST per file (and this probably won't change).
This is the host AST of course.
For example, in the above program language servers
will only have a diagnostic for #pragma message "host".
So, the values of --cuda-gpu-arch aren't relevant for language servers
beyond determining what the preprocessor includes.
However, the fact that the preprocessor includes depend on arch
means that the __clang_cuda_standalone_defs.h approach won't always be perfectly
correct.
We were talking about the case where clang can't expand CUDA-specific macros.
It works in your specific case because you do have CUDA installed and compiler did find it. It's not necessarily the case for everyone everywhere. We want clang to work for tooling even when CUDA SDK is not installed on the machine at all. My argument was that without a sufficient set of CUDA macros clang can't parse CUDA sources correctly. CUDA SDK is currently the only source of those macros. We need clang to provide its own for the cases when CUDA SDK is not available.
Try compiling that source with -nocudainc which will illustrate what happens when CUDA SDK is not found. E.g. https://godbolt.org/z/MoKvfq
To fix this, I added a new type TY_CUDAHeader_DEVICE. This type precompiles a header for a specific device architecture.
This type is used in place of TY_CUDA_DEVICE in the appropriate case.
I think having a header device type is the correct approach if a separate header type is used.
I think this kind of compilation pipeline restructuring is a bit premature. We need to address being able to parse CUDA headers reliably first.
There's also a question of whether it's necessary. It's not very useful for the Driver functionality itself and can be implemented in the tooling by explicitly telling the driver what we want. Your patch appears to implement it the other way around -- the tooling relies on the driver to implement the CUDA header handling magic. I think the right approach is to teach tooling how to handle CUDA and keep the clang driver changes to a minimum. Based on our previous conversation, se seem to need only -x cuda-header. Everything else can be controlled by the tooling with --cuda-host-only/--cuda-device-only --cuda-gpu-arch=XXX. Considering that tooling can't deal with more than one sub-compilation, using these flags is going to be necessary anyways.
@tra I am guessing you have already thought about this, but
one thing which is worth noting is that language servers
only maintain a single AST per file (and this probably won't change).
Using AST from the host-side compilation was an easy-to-do trade-off. It works reasonably well most of the time, at least for the sources we have exact compilation flags for. Re-engineering tooling to deal with multiple ASTs per TU just for CUDA may not be worth it.
This is the host AST of course.
For example, in the above program language servers
will only have a diagnostic for #pragma message "host".
So, the values of --cuda-gpu-arch aren't relevant for language servers
beyond determining what the preprocessor includes.
However, the fact that the preprocessor includes depend on arch
means that the __clang_cuda_standalone_defs.h approach won't always be perfectly
correct.
Yes, that's exactly why I mentioned before that ideally we may need multiple ASTs as each device compilation will have a slightly different one. The goal of __clang_cuda_standalone_defs.h is to make it possible to parse CUDA sources at all w/o having to rely on CUDA SDK. It is not intended to solve the multiple-AST issue.
Somehow I totally forgot about or missed this context.
I think this kind of compilation pipeline restructuring is a bit premature.
Yeah for sure. I was trying to make clang++ file.cuh ... do something sane - this isn't at all needed for tooling. Also I was mostly doing this to improve my understanding of what is going on in Driver.cpp wrt. cuda compilation.
Sorry about missing or misunderstanding a lot of what you're saying and saying obvious things.
However, the fact that the preprocessor includes depend on arch
means that the __clang_cuda_standalone_defs.h approach won't always be perfectly
correct.... The goal of __clang_cuda_standalone_defs.h is to make it possible to parse CUDA sources at all w/o having to rely on CUDA SDK. ...
Should __clang_cuda_standalone_defs.h depend on the arch? For example, __match_all_sync doesn't exist in sm_35 but does exist in sm_75.
For tooling this might not matter because the host cuda includes don't depend on the arch.
It's open for discussion.
At the very minimum we need the standard CUDA macros, builtin variables, and, maybe, few decls for things compiler expects from CUDA runtime.
This would be a good starting point: https://github.com/llvm/llvm-project/blob/master/clang/test/SemaCUDA/Inputs/cuda.h
It will allow stand-alone CUDA parsing in general.
Arch-dependent things in CUDA headers are largely dealing with compiler builtins, so if they are missing, it should not be a very big deal. I believe tooling can deal with missing function declarations.
Missing types, if there are any may cause more issues, but I can't think of anything interesting types in CUDA headers that are type-dependent. Maybe bfloat16/tfloat32 for sm_8x+. It should not be a showstopper either.
In any case, stand-alone mode is unlikely to be very useful for a real CUDA compilation, so I'm not too worried about some missing pieces.
Tooling can then pass -x cuda[-header] -nocudainc --cuda-host-only -include __clang_cuda_standalone_defs.h when it processes CUDA files.
Then we'll need to test this stand-alone header and see if we need anything else to extract for reasonably complete and correct AST from CUDA sources & headers. Augment the stand-alone wrapper as necessary and repeat.
If standalone mode turns out to be useful beyond tooling, we may consider how to expose it to the end-users. Maybe a --cuda-path=none or some other flag to tell it to pre-include the stand-alone variant of the wrapper.
From my understanding this basically makes the treatment of TY_CUDAHeader (-xcuda-header) identical to TY_CUDA (-xcuda).
Some changes will need to happen below this for correct handling of header files (for example not warning about using #pragma once).