CUDA support can be enabled in clang-repl with --cuda flag.
Device code linking is not yet supported. inline must be used with all
device functions.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Thanks for working on this!
clang/lib/Interpreter/Offload.cpp | ||
---|---|---|
1 ↗ | (On Diff #510808) | How about DeviceOffload.cpp? |
clang/tools/clang-repl/ClangRepl.cpp | ||
155 | To cover the case where platforms have no /tmp we could use fs::createTemporaryFile. However, some platforms have read-only file systems. What do we do there? |
clang/tools/clang-repl/ClangRepl.cpp | ||
---|---|---|
27 | Where will clang-repl find CUDA headers? Generally speaking --cuda-path is essential for CUDA compilation as it's fairly common for users to have more than one CUDA SDK versions installed, or to have them installed in a non-default location. | |
157 | Is there any doc describing the big picture approach to CUDA REPL implementation and how all the pieces tie together? From the patch I see that we will compile GPU side of the code to PTX, pack it into fatbinary, but it's not clear now do we get from there to actually launching the kernels. Loading libcudart.so here also does not appear to be tied to anything else. I do not see any direct API calls, and the host-side compilation appears to be done w.o passing the GPU binary to it, which would normally trigger generation of the glue code to register the kernels with CUDA runtime. I may be missing something, too. I assume the gaps will be filled in in future patches, but I'm still curious about the overall plan. |
clang/tools/clang-repl/ClangRepl.cpp | ||
---|---|---|
157 | Hi @tra, thanks for asking. Our reference implementation was done in Cling a while ago by @SimeonEhrig. One of his talks which I think describes well the big picture could be found here: https://compiler-research.org/meetings/#caas_04Mar2021 |
Initial interactive CUDA support for clang-repl
What should a user expect to be supported, functionality wise? I assume it should cover parsing and compilation. I'm not so sure about the execution. Should it be expected to actually launch kernels, or will that come in a future patch?
clang/tools/clang-repl/ClangRepl.cpp | ||
---|---|---|
157 | Cling does ring the bell. The slides from the link above do look OK. |
clang/tools/clang-repl/ClangRepl.cpp | ||
---|---|---|
157 | There is also a video. |
clang/tools/clang-repl/ClangRepl.cpp | ||
---|---|---|
155 | Actual, we can avoid temporary files completely. The reason, why the fatbinary code is written to a file is the following code in the code generator of the CUDA runtime functions: In the past, I avoided to change the code, because this was an extra Clang patch for Cling. Maybe we can use the llvm virtualFileSystem: https://llvm.org/doxygen/classllvm_1_1vfs_1_1InMemoryFileSystem.html |
clang/include/clang/Interpreter/Interpreter.h | ||
---|---|---|
64–72 | If I understand the change correctly, the "old" create function on its own is not sufficient anymore to create a fully working CompilerInstance - should we make it private? (and unify the two Builder classes into one?) Alternatively, we could keep the current function working and move the bulk of the work to an createInternal function. What do you think? | |
clang/lib/CodeGen/CodeGenAction.cpp | ||
297 | This looks like a change that has implications beyond support for CUDA. Would it make sense to break this out into a separate review, ie does this change something in the current setup? | |
clang/lib/Interpreter/IncrementalParser.cpp | ||
197 | Is it possible to move the static function here instead of forward declaring? Or does it depend on other functions in this file? | |
224–237 | Should this be done as part of the "generic" IncrementalParser constructor, or only in the CUDA case by something else? Maybe Interpreter::createWithCUDA? | |
clang/lib/Interpreter/Interpreter.cpp | ||
348–352 | Just wondering about the order here: Do we have to parse the device side first? Does it make a difference for diagnostics? Maybe you can add a comment about the choice here... | |
clang/lib/Interpreter/Offload.cpp | ||
1 ↗ | (On Diff #510808) | Or IncrementalCUDADeviceParser.cpp for the moment - not sure what other classes will be added in the future, and if they should be in the same TU. |
90–91 ↗ | (On Diff #510808) | Is padding to 8 bytes a requirement for PTX? Maybe add a coment... |
clang/lib/Interpreter/Offload.h | ||
20 ↗ | (On Diff #510808) | unused |
With this patch alone, we can launch kernels with the usual syntax. The __device__ functions need to be inline for now. We plan to automate that in the future.
clang/include/clang/Interpreter/Interpreter.h | ||
---|---|---|
64–72 | Yes the old create should probably be private. I was also thinking we could merge IncrementalCudaCompilerBuilder with IncrementalCompilerBuilder and make it stateful with CUDA SDK path for example. Then we could do something like: IncrementalCompilerBuilder Builder; Builder.setCudaPath(...); auto DeviceCI = Builder.createCudaDevice(); auto HostCI = Builder.createCudaHost(); | |
clang/lib/CodeGen/CodeGenAction.cpp | ||
297 | It actually was a separate patch: D146388 Should I submit that for review? | |
clang/lib/Interpreter/IncrementalParser.cpp | ||
197 | We can probably move this. I just wanted to preserve the history, | |
224–237 | That seems safer. Although I did not notice any side effects. | |
clang/lib/Interpreter/Interpreter.cpp | ||
348–352 | The fatbinary from the device side is used in the host pipeline. | |
clang/lib/Interpreter/Offload.cpp | ||
1 ↗ | (On Diff #510808) | I wanted to avoid "CUDA" in case we use it later for HIP. |
clang/tools/clang-repl/ClangRepl.cpp | ||
157 |
We do pass the generated fatbinary to the host side. The device code compilation happens before host side. |
clang/lib/CodeGen/CodeGenAction.cpp | ||
---|---|---|
297 | The problem with going for a separate change is that we cannot test it. Landing it without a test makes the history commit unclear. This patch (and the tests we add here) will at least indirectly test that change. |
clang/lib/CodeGen/CodeGenAction.cpp | ||
---|---|---|
297 | Ok, that's why I was asking if it changes something in the current setup (ie can be tested). Thanks for clarifying. |
Except using an in-memory solution for generated fatbin code, the code looks good to me.
Combined IncrementalCompilerBuilder and IncrementalCudaCompilerBuilder
Added --cuda-path support
Use sys::fs::createTemporaryFile() instead of hardcoding the path
Other minor refactoring
I am planning to have the in-memory fat binary file as a separate patch
clang/lib/Interpreter/Offload.cpp | ||
---|---|---|
90–91 ↗ | (On Diff #510808) | This was actually in original Cling implementation but it does not seem to be required. |
Use virtual file system to store CUDA fatbinaries in memory
Adapted Interpreter tests to use the CompilerBuilder
Generally, looks good to me. I'd like to wait for @Hahnfeld and @tra's feedback at least for another week before merging.
@dblaikie, I know that generally we do not want to run tests on the bots and that makes testing quite hard for this patch. Do you have a suggestion how to move forward here? In principle, we could have another option where we might has the JIT if it can execute code on the device if available.
clang/lib/Interpreter/Offload.cpp | ||
---|---|---|
1 ↗ | (On Diff #510808) | Was DeviceOffload.cpp not a better name for the file and its intent? |
clang/lib/Interpreter/Offload.cpp | ||
---|---|---|
1 ↗ | (On Diff #510808) | Yeah that seems alright, I will change in the next revision. |
lib/CodeGen changes look OK to me.
clang/lib/CodeGen/CodeGenModule.cpp | ||
---|---|---|
6276 | Could you give me an example of what exactly we'll be skipping here? |
I can confirm the code change in CodeGen works as expected. clang-repl does not generate temporary files anymore, if a CUDA kernel is compiled.
Compiling a simple CUDA application still working and saving the generated PTX and fatbin code via clang++ ../helloWorld.cu -o helloWorld -L/usr/local/cuda/lib64 -lcudart_static --save-temps is also still working.
Some comments, but otherwise LGTM
clang/include/clang/Interpreter/Interpreter.h | ||
---|---|---|
45 | and this should probably be run through clang-format... | |
47 | ||
clang/lib/CodeGen/CodeGenModule.cpp | ||
6276 | This concerns statements at the global scope that only concern the REPL; see https://reviews.llvm.org/D127284 for the original revision. Global variables on the other hand are passed via EmitTopLevelDecl -> EmitGlobal. | |
clang/lib/Interpreter/Interpreter.cpp | ||
150–151 | This comment should move as well | |
151–153 | This doesn't do what the comments claim - it appends at the end, not prepends. For that it would need to be ClangArgv.insert(ClangArgv.begin(), "-c"). @v.g.vassilev what do we want here? (probably doesn't block this revision, but it's odd nevertheless) | |
clang/lib/Interpreter/Offload.h | ||
36–37 ↗ | (On Diff #516061) | unused |
clang/lib/Interpreter/Interpreter.cpp | ||
---|---|---|
151–153 | Yeah, this forces the clang::Driver to have some sort of action. In turn, this helps produce diagnostics from the driver before failing. That's a known bug since the early days of clang that nobody addressed... |
Added a check to run CUDA tests only on systems with CUDA. We need some ideas for the actual tests.
Rename Offload.cpp to DeviceOffload.cpp
Other syntax/style fixes
Generally lgtm, let's extend the test coverage.
clang/lib/Interpreter/DeviceOffload.cpp | ||
---|---|---|
2 | Likewise. | |
clang/lib/Interpreter/DeviceOffload.h | ||
2 | We should probably update the name here as well and maybe drop CUDA? | |
clang/test/Interpreter/CUDA/sanity.cu | ||
11 | Let's extend the coverage with some more standard hello world examples. We can draw some inspiration from https://github.com/root-project/cling/tree/master/test/CUDADeviceCode |
Add some CUDA basic functionality tests.
Disallow undo-ing of the initial PTU. This should fix the undo command test.
clang/lib/CodeGen/ModuleBuilder.cpp | ||
---|---|---|
39 | IIUC history correctly, here the intentional copy was to prevent some layering violation for what was called in 2009 CompileOpts. I believe that is not the case, can you check if we can take a const reference here? |
clang/lib/CodeGen/ModuleBuilder.cpp | ||
---|---|---|
39 | I don't understand how the reference causes layering violation but if I change it to a const reference instead, the option modification code becomes slightly less awkward and all tests seem to be fine. |
clang/lib/CodeGen/ModuleBuilder.cpp | ||
---|---|---|
39 | Let's try that then. |
Added some std::move fixes for Error -> Expected conversions
We need to figure out a solution when NVPTX backend is not enabled. clang-repl probably should not depends on that. Example: https://lab.llvm.org/buildbot#builders/175/builds/29764
We can do Triple::isNVPTX and then initialize the asm printer. @lhames could have better idea.
Workaround for depending on NVPTX symbols: initialize all available targets instead. If NVPTX is not available, it will complain when we try to actually execute anything in CUDA mode.
Rebased and fixed conflicts on recent value printing related patches.
and this should probably be run through clang-format...