This is an archive of the discontinued LLVM Phabricator instance.

[clang-repl][CUDA] Initial interactive CUDA support for clang-repl
ClosedPublic

Authored by argentite on Mar 19 2023, 12:15 PM.

Details

Summary

CUDA support can be enabled in clang-repl with --cuda flag.
Device code linking is not yet supported. inline must be used with all
device functions.

Diff Detail

Event Timeline

argentite created this revision.Mar 19 2023, 12:15 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 19 2023, 12:16 PM

Use full name of CUDA library

argentite updated this revision to Diff 510808.Apr 4 2023, 8:00 AM
argentite edited the summary of this revision. (Show Details)

Clear LinkModules on every interpreter iteration

argentite published this revision for review.Apr 4 2023, 8:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2023, 8:09 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Thanks for working on this!

clang/lib/Interpreter/Offload.cpp
1 ↗(On Diff #510808)

How about DeviceOffload.cpp?

clang/tools/clang-repl/ClangRepl.cpp
155

To cover the case where platforms have no /tmp we could use fs::createTemporaryFile. However, some platforms have read-only file systems. What do we do there?

tra added a subscriber: tra.Apr 4 2023, 10:38 AM
tra added inline comments.
clang/tools/clang-repl/ClangRepl.cpp
27

Where will clang-repl find CUDA headers? Generally speaking --cuda-path is essential for CUDA compilation as it's fairly common for users to have more than one CUDA SDK versions installed, or to have them installed in a non-default location.

157

Is there any doc describing the big picture approach to CUDA REPL implementation and how all the pieces tie together?

From the patch I see that we will compile GPU side of the code to PTX, pack it into fatbinary, but it's not clear now do we get from there to actually launching the kernels. Loading libcudart.so here also does not appear to be tied to anything else. I do not see any direct API calls, and the host-side compilation appears to be done w.o passing the GPU binary to it, which would normally trigger generation of the glue code to register the kernels with CUDA runtime. I may be missing something, too.

I assume the gaps will be filled in in future patches, but I'm still curious about the overall plan.

v.g.vassilev added inline comments.Apr 4 2023, 10:47 AM
clang/tools/clang-repl/ClangRepl.cpp
157

Hi @tra, thanks for asking. Our reference implementation was done in Cling a while ago by @SimeonEhrig. One of his talks which I think describes well the big picture could be found here: https://compiler-research.org/meetings/#caas_04Mar2021

tra added a comment.Apr 4 2023, 11:19 AM

Initial interactive CUDA support for clang-repl

What should a user expect to be supported, functionality wise? I assume it should cover parsing and compilation. I'm not so sure about the execution. Should it be expected to actually launch kernels, or will that come in a future patch?

clang/tools/clang-repl/ClangRepl.cpp
157

Cling does ring the bell. The slides from the link above do look OK.

v.g.vassilev added inline comments.Apr 4 2023, 11:20 AM
clang/tools/clang-repl/ClangRepl.cpp
157

There is also a video.

SimeonEhrig added inline comments.Apr 4 2023, 12:10 PM
clang/tools/clang-repl/ClangRepl.cpp
155

Actual, we can avoid temporary files completely. The reason, why the fatbinary code is written to a file is the following code in the code generator of the CUDA runtime functions:

https://github.com/llvm/llvm-project/blob/d9d840cdaf51a9795930750d1b91d614a3849137/clang/lib/CodeGen/CGCUDANV.cpp#L722-L732

In the past, I avoided to change the code, because this was an extra Clang patch for Cling.

Maybe we can use the llvm virtualFileSystem: https://llvm.org/doxygen/classllvm_1_1vfs_1_1InMemoryFileSystem.html
But this is just an idea. I have no experience, if this is working for us.

Hahnfeld added inline comments.Apr 5 2023, 1:12 AM
clang/include/clang/Interpreter/Interpreter.h
64–72

If I understand the change correctly, the "old" create function on its own is not sufficient anymore to create a fully working CompilerInstance - should we make it private? (and unify the two Builder classes into one?)

Alternatively, we could keep the current function working and move the bulk of the work to an createInternal function. What do you think?

clang/lib/CodeGen/CodeGenAction.cpp
297

This looks like a change that has implications beyond support for CUDA. Would it make sense to break this out into a separate review, ie does this change something in the current setup?

clang/lib/Interpreter/IncrementalParser.cpp
197

Is it possible to move the static function here instead of forward declaring? Or does it depend on other functions in this file?

224–237

Should this be done as part of the "generic" IncrementalParser constructor, or only in the CUDA case by something else? Maybe Interpreter::createWithCUDA?

clang/lib/Interpreter/Interpreter.cpp
348–352

Just wondering about the order here: Do we have to parse the device side first? Does it make a difference for diagnostics? Maybe you can add a comment about the choice here...

clang/lib/Interpreter/Offload.cpp
1 ↗(On Diff #510808)

Or IncrementalCUDADeviceParser.cpp for the moment - not sure what other classes will be added in the future, and if they should be in the same TU.

90–91 ↗(On Diff #510808)

Is padding to 8 bytes a requirement for PTX? Maybe add a coment...

clang/lib/Interpreter/Offload.h
20 ↗(On Diff #510808)

unused

Initial interactive CUDA support for clang-repl

What should a user expect to be supported, functionality wise? I assume it should cover parsing and compilation. I'm not so sure about the execution. Should it be expected to actually launch kernels, or will that come in a future patch?

With this patch alone, we can launch kernels with the usual syntax. The __device__ functions need to be inline for now. We plan to automate that in the future.

clang/include/clang/Interpreter/Interpreter.h
64–72

Yes the old create should probably be private. I was also thinking we could merge IncrementalCudaCompilerBuilder with IncrementalCompilerBuilder and make it stateful with CUDA SDK path for example. Then we could do something like:

IncrementalCompilerBuilder Builder;
Builder.setCudaPath(...);
auto DeviceCI = Builder.createCudaDevice();
auto HostCI = Builder.createCudaHost();
clang/lib/CodeGen/CodeGenAction.cpp
297

It actually was a separate patch: D146388 Should I submit that for review?
It seems to be required because we call LinkInModules() once for every interpreter iteration.

clang/lib/Interpreter/IncrementalParser.cpp
197

We can probably move this. I just wanted to preserve the history,

224–237

That seems safer. Although I did not notice any side effects.

clang/lib/Interpreter/Interpreter.cpp
348–352

The fatbinary from the device side is used in the host pipeline.

clang/lib/Interpreter/Offload.cpp
1 ↗(On Diff #510808)

I wanted to avoid "CUDA" in case we use it later for HIP.

clang/tools/clang-repl/ClangRepl.cpp
157

I do not see any direct API calls, and the host-side compilation appears to be done w.o passing the GPU binary to it, which would normally trigger generation of the glue code to register the kernels with CUDA runtime.

We do pass the generated fatbinary to the host side. The device code compilation happens before host side.

v.g.vassilev added inline comments.Apr 5 2023, 1:49 AM
clang/lib/CodeGen/CodeGenAction.cpp
297

The problem with going for a separate change is that we cannot test it. Landing it without a test makes the history commit unclear. This patch (and the tests we add here) will at least indirectly test that change.

Hahnfeld added inline comments.Apr 5 2023, 1:52 AM
clang/lib/CodeGen/CodeGenAction.cpp
297

Ok, that's why I was asking if it changes something in the current setup (ie can be tested). Thanks for clarifying.

Except using an in-memory solution for generated fatbin code, the code looks good to me.

argentite marked 3 inline comments as done.

Combined IncrementalCompilerBuilder and IncrementalCudaCompilerBuilder
Added --cuda-path support
Use sys::fs::createTemporaryFile() instead of hardcoding the path
Other minor refactoring

I am planning to have the in-memory fat binary file as a separate patch

clang/lib/Interpreter/Offload.cpp
90–91 ↗(On Diff #510808)

This was actually in original Cling implementation but it does not seem to be required.

v.g.vassilev added a subscriber: dblaikie.

I am adding @dblaikie as he might have ideas how to test this patch.

argentite updated this revision to Diff 516061.Apr 22 2023, 5:49 AM

Use virtual file system to store CUDA fatbinaries in memory
Adapted Interpreter tests to use the CompilerBuilder

Generally, looks good to me. I'd like to wait for @Hahnfeld and @tra's feedback at least for another week before merging.

@dblaikie, I know that generally we do not want to run tests on the bots and that makes testing quite hard for this patch. Do you have a suggestion how to move forward here? In principle, we could have another option where we might has the JIT if it can execute code on the device if available.

clang/lib/Interpreter/Offload.cpp
1 ↗(On Diff #510808)

Was DeviceOffload.cpp not a better name for the file and its intent?

argentite added inline comments.Apr 23 2023, 3:33 AM
clang/lib/Interpreter/Offload.cpp
1 ↗(On Diff #510808)

Yeah that seems alright, I will change in the next revision.

tra added a comment.Apr 24 2023, 10:36 AM

lib/CodeGen changes look OK to me.

clang/lib/CodeGen/CodeGenModule.cpp
6276

Could you give me an example of what exactly we'll be skipping here?
Will it affect __device__ variables?

lib/CodeGen changes look OK to me.

I can confirm the code change in CodeGen works as expected. clang-repl does not generate temporary files anymore, if a CUDA kernel is compiled.

Compiling a simple CUDA application still working and saving the generated PTX and fatbin code via clang++ ../helloWorld.cu -o helloWorld -L/usr/local/cuda/lib64 -lcudart_static --save-temps is also still working.

Some comments, but otherwise LGTM

clang/include/clang/Interpreter/Interpreter.h
45

and this should probably be run through clang-format...

47
clang/lib/CodeGen/CodeGenModule.cpp
6276

This concerns statements at the global scope that only concern the REPL; see https://reviews.llvm.org/D127284 for the original revision. Global variables on the other hand are passed via EmitTopLevelDecl -> EmitGlobal.

clang/lib/Interpreter/Interpreter.cpp
150–151

This comment should move as well

151–153

This doesn't do what the comments claim - it appends at the end, not prepends. For that it would need to be ClangArgv.insert(ClangArgv.begin(), "-c"). @v.g.vassilev what do we want here? (probably doesn't block this revision, but it's odd nevertheless)

clang/lib/Interpreter/Offload.h
36–37 ↗(On Diff #516061)

unused

v.g.vassilev added inline comments.May 9 2023, 2:19 AM
clang/lib/Interpreter/Interpreter.cpp
151–153

Yeah, this forces the clang::Driver to have some sort of action. In turn, this helps produce diagnostics from the driver before failing. That's a known bug since the early days of clang that nobody addressed...

argentite updated this revision to Diff 520636.May 9 2023, 2:46 AM
argentite marked 3 inline comments as done.

Added a check to run CUDA tests only on systems with CUDA. We need some ideas for the actual tests.
Rename Offload.cpp to DeviceOffload.cpp
Other syntax/style fixes

Generally lgtm, let's extend the test coverage.

clang/lib/Interpreter/DeviceOffload.cpp
2

Likewise.

clang/lib/Interpreter/DeviceOffload.h
2

We should probably update the name here as well and maybe drop CUDA?

clang/test/Interpreter/CUDA/sanity.cu
11

Let's extend the coverage with some more standard hello world examples. We can draw some inspiration from https://github.com/root-project/cling/tree/master/test/CUDADeviceCode

argentite marked an inline comment as done.

Add some CUDA basic functionality tests.
Disallow undo-ing of the initial PTU. This should fix the undo command test.

v.g.vassilev added inline comments.May 12 2023, 1:54 PM
clang/lib/CodeGen/ModuleBuilder.cpp
39

IIUC history correctly, here the intentional copy was to prevent some layering violation for what was called in 2009 CompileOpts. I believe that is not the case, can you check if we can take a const reference here?

argentite added inline comments.May 13 2023, 11:09 AM
clang/lib/CodeGen/ModuleBuilder.cpp
39

I don't understand how the reference causes layering violation but if I change it to a const reference instead, the option modification code becomes slightly less awkward and all tests seem to be fine.

v.g.vassilev added inline comments.May 13 2023, 11:34 AM
clang/lib/CodeGen/ModuleBuilder.cpp
39

Let's try that then.

argentite updated this revision to Diff 522041.May 14 2023, 9:56 PM

Remove the copy of CodeGenOpts in CodeGeneratorImpl

This is looking good. Can you address my minor comments and run clang-format?

clang/lib/Interpreter/DeviceOffload.cpp
2

ping.

clang/lib/Interpreter/DeviceOffload.h
2

We should rename Offload.h to DeviceOffload.h.

Update the filenames

This revision is now accepted and ready to land.May 16 2023, 11:08 AM
This revision was landed with ongoing or failed builds.May 20 2023, 1:57 AM
This revision was automatically updated to reflect the committed changes.
argentite reopened this revision.May 20 2023, 2:40 AM
This revision is now accepted and ready to land.May 20 2023, 2:40 AM
argentite updated this revision to Diff 524006.May 20 2023, 2:41 AM

Added some std::move fixes for Error -> Expected conversions

We need to figure out a solution when NVPTX backend is not enabled. clang-repl probably should not depends on that. Example: https://lab.llvm.org/buildbot#builders/175/builds/29764

Added some std::move fixes for Error -> Expected conversions

We need to figure out a solution when NVPTX backend is not enabled. clang-repl probably should not depends on that. Example: https://lab.llvm.org/buildbot#builders/175/builds/29764

We can do Triple::isNVPTX and then initialize the asm printer. @lhames could have better idea.

argentite updated this revision to Diff 525572.May 25 2023, 6:16 AM

Workaround for depending on NVPTX symbols: initialize all available targets instead. If NVPTX is not available, it will complain when we try to actually execute anything in CUDA mode.
Rebased and fixed conflicts on recent value printing related patches.