Page MenuHomePhabricator

CUDA ctor/dtor Module-Unique Symbol Name
AbandonedPublic

Authored by SimeonEhrig on Mar 13 2018, 8:55 AM.

Details

Summary

This allows multi-module / incremental compilation environments to have unique global CUDA constructor and destructor function names.

This feature is necessary for the cling (https://github.com/root-project/cling), which based on the clang. Cling is a C++-Interpreter (technically, it is a JIT with an interactive frontend – the using is really similar to the python interpreter) , which is developed by a team of the CERN. I want to add a new feature, which allows to interpreter CUDA-code, which is written with the Runtime-API.

This request address the follow problem. Compiling a cuda program with clang generates one llvm module per TU. Every llvm module has a cuda ctor and dtor (if a cuda fatbinary exist), with at least a function call to register the fatbinary. The ctor/dtor can also include function calls to register global functions and variables at runtime, depending on user's code.

In cling, we do not have finalized TU and instead it can be extended. The TU is extended with llvm modules as long as the governing cling instance is running. As we type a new line of code we generate a new llvm module which is added to the TU. Cling detects functions by the name. If the name (symbol) already exists it uses the existing translation. Otherwise it translates the function on first use (but it never translates twice).

If we iteratively (and iteractively) add new CUDA code to a governing cling instance, we have more than one cuda *module* ctor/dtor per TU. But the problem is that the *content* of every ctor/dtor function can be different. Unfurtunately we can not differenciate them by symbol name yet, since they all get the same name. So Cling will always use the translation of the first module.

In order to solve this problem, I added the module name (which is unique in cling) as a suffix to the cuda ctor/dtor function name as “_<ModuleName>” . For clang the ModuleName is by default the name of the input file – as in D34059 we escape its name for sanity. This means symbols will change with this patch (are ABI incompatible with previous releases). This solution is identical to the patch in https://reviews.llvm.org/D34059 – I just removed the file ending for brevity from the symbols.

Just for reference, a prototype of our CUDA JIT is available under:

https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter

In addition to that prototype, we added the functionality to cling itself and this is the only additional clang patch we need.

https://github.com/SimeonEhrig/cling/tree/cudaDeviceSide

Diff Detail

Event Timeline

SimeonEhrig created this revision.Mar 13 2018, 8:55 AM
jlebar added a reviewer: tra.Mar 13 2018, 9:00 AM
rjmccall added inline comments.Mar 13 2018, 9:55 AM
lib/CodeGen/CGCUDANV.cpp
281

Please explain in the comment *why* you're doing this. It's just for debugging, right? So that it's known which object file the constructor function comes from.

370

This doesn't actually seem more useful than the empty string.

unittests/CodeGen/IncrementalProcessingTest.cpp
178

"In CUDA incremental processing, a CUDA ctor or dtor will be generated for every statement if a fatbinary file exists."

change comment of the example function for TEST(IncrementalProcessing, EmitCUDAGlobalInitFunc)

tra added inline comments.Mar 13 2018, 11:28 AM
lib/CodeGen/CGCUDANV.cpp
281

I'm also interested in in the motivation for this change.

Also, if the goal is to have an unique module identifier, would compiling two different files with the same name be a problem? If the goal is to help identifying a module, this may be OK, if not ideal. If you really need to have unique name, then you may need to do something more elaborate. NVCC appears to use some random number (or hash of something?) for that.

301–302

I'd rather not use '-' in a symbol. It's likely to end up being escaped in some way. '_' is a safer bet.

unittests/CodeGen/IncrementalProcessingTest.cpp
176–178

I don't understand the comment. What is 'CUDA incremental processing' and what exactly is meant by 'statement' here? I'd appreciate if you could give me more details. My understanding is that ctor/dtor are generated once per TU. I suspect "incremental processing" may change that, but I have no idea what exactly does it do.

SimeonEhrig marked an inline comment as done.Mar 14 2018, 3:38 AM
SimeonEhrig added inline comments.
lib/CodeGen/CGCUDANV.cpp
281

The motivation is the same at this review: https://reviews.llvm.org/D34059
We try to enable incremental compiling of cuda runtime code, so we need unique ctor/dtor names, to handle the cuda device code over different modules.

281

We need this modification for our C++-interpreter Cling, which we want to expand to interpret CUDA runtime code. Effective, it's a jit, which read in line by line the program code. Every line get his own llvm::Module. The Interpreter works with incremental and lazy compilation. Because the lazy compilation, we needs this modification. In the CUDA mode, clang generates for every module an _ _cuda_module_ctor and _ _cuda_module_dtor, if the compiler was started with a path to a fatbinary file. But the ctor is also depend on the source code, which will translate to llvm IR in the module. For Example, if a _ _global_ _ kernel will defined, the CodeGen add the function call __cuda_register_globals() to the ctor. But the lazy compilations prevents, that we can translate a function, which is already translate. Without the modification, the interpreter things, that the ctor is always same and use the first translation of the function, which was generate. Therefore, it is impossible to add new kernels.

unittests/CodeGen/IncrementalProcessingTest.cpp
176–178

A CUDA ctor/dtor will generates for every llvm::module. The TU can also composed of many modules. In our interpreter, we add new code to our AST with new modules at runtime.
The ctor/dtor generation is depend on the fatbinary code. The CodeGen checks, if a path to a fatbinary file is set. If it is, it generates an ctor with at least a __cudaRegisterFatBinary() function call. So, the generation is independent of the source code in the module and we can use every statement. A statement can be an expression, a declaration, a definition and so one.

rjmccall added inline comments.Mar 14 2018, 10:40 AM
lib/CodeGen/CGCUDANV.cpp
281

I'm not asking you to explain to *me* why you're doing this, I'm asking you to explain *in the comment* why you're doing this.

That said, we should discuss this. It sounds like you need the function to have a unique name because otherwise you're seeing inter-module conflicts between incremental slices. Since the function is emitted with internal linkage, I assume that those conflicts must be because you're promoting internal linkage to external in order to make incremental processing able to link to declarations from an earlier slice of the translation unit. I really think that a better solution would be to change how we assign LLVM linkage to static global declarations in IRGen — basically, recognizing the difference between internal linkage (where different parts of the translation unit can still refer to the same entity) and no linkage at all (where they cannot). We could then continue to emit truly private entities, like global ctors/dtors, lambda bodies, block functions, and so on, with internal/private linkage without worrying about how your pass will mess up the linkage later.

tra added inline comments.Mar 14 2018, 11:19 AM
unittests/CodeGen/IncrementalProcessingTest.cpp
176–178

I still don't understand how it's going to work. Do you have some sort of design document outlining how the interpreter is going to work with CUDA?

The purpose of the ctor/dtor is to stitch together host-side kernel launch with the GPU-side kernel binary which resides in the GPU binary created by device-side compilation.

So, the question #1 -- if you pass GPU-side binary to the compiler, where did you get it? Normally it's the result of device-side compilation of the same TU. In your case it's not quite clear what exactly would that be, if you feed the source to the compiler incrementally. I.e. do you somehow recompile everything we've seen on device side so far for each new chunk of host-side source you feed to the compiler?

Next question is -- assuming that device side does have correct GPU-side binary, when do you call those ctors/dtors? JIT model does not quite fit the assumptions that drive regular CUDA compilation.

Let's consider this:

__global__ void foo();
__global__ void bar();

// If that's all we've  fed to compiler so far, we have no GPU code yet, so there 
// should be no fatbin file. If we do have it, what's in it?

void launch() {
  foo<<<1,1>>>();
  bar<<<1,1>>>();
}
// If you've generated ctors/dtors at this point they would be 
// useless as no GPU code exists in the preceding code.

__global__ void foo() {}
// Now we'd have some GPU code, but how can we need to retrofit it into 
// all the ctors/dtors we've generated before. 
__global__ void bar() {}
// Does bar end up in its own fatbinary? Or is it combined into a new 
// fatbin which contains both boo and bar?
// If it's a new fatbin, you somehow need to update existing ctors/dtors, 
// unless you want to leak CUDA resources fast.
// If it's a separate fatbin, then you will need to at the very least change the way 
// ctors/dtors are generated by the 'launch' function, because now they need to 
// tie each kernel launch to a different fatbin.

It looks to me that if you want to JIT CUDA code you will need to take over GPU-side kernel management.
ctors/dtors do that for full-TU compilation, but they rely on device-side code being compiled and available during host-side compilation. For JIT, the interpreter should be in charge of registering new kernels with the CUDA runtime and unregistering/unloading them when a kernel goes away. This makes ctors/dtors completely irrelevant.

v.g.vassilev added inline comments.Mar 14 2018, 2:35 PM
lib/CodeGen/CGCUDANV.cpp
281

@rjmccall, I agree. What's the best way to discuss this? My irc handle is vvassilev and I am in CET timezone. I will be online for approx. 2 hours from now on.

rjmccall added inline comments.Mar 14 2018, 3:39 PM
lib/CodeGen/CGCUDANV.cpp
281

Sorry, I seem to have missed you for today. I think for the next day it would be best to just trade e-mail, because I have errands to run in the morning and early afternoon tomorrow.

I think the major piece of the plan would be to make things like the computation of GVALinkage in ASTContext.cpp consider your incremental mode. Currently, basicGVALinkageForFunction and basicGVALinkageForVariable only consider isExternallyVisible(), which conflates no-linkage and internal-linkage; you would need to map internal linkage to GVA_StrongExternal when processing in incremental mode.

That alone might not be sufficient because there are things with no formal linkage that still do need to be shared across incremental slices; for example, anonymous structures at global scope. To get those things right, we will need to split NoLinkage by adding an InternalNoLinkage for declarations that formally have no linkage but in reality are visible throughout the translation unit; but that seems quite feasible.

We should get Richard's thoughts on that plan first, though.

rsmith added inline comments.Mar 14 2018, 4:13 PM
lib/CodeGen/CGCUDANV.cpp
281

I'll be busy with the C++ committee meeting for at least the rest of the week, sorry for any latency here.

The case where a type has no linkage and no stable mangling seems to also require persisting some kind of Decl -> IR module mangling (a mangle numbering context for the whole TU maybe?) in addition to linkage promotion. The linkage promotion itself seems reasonable to me, and doing it at the GVALinkage level seems consistent with how we handle some related cases (eg, Modules TS linkage promotion).

rjmccall added inline comments.Mar 14 2018, 6:33 PM
lib/CodeGen/CGCUDANV.cpp
281

I think we can probably punt on that issue for now, since it's, what, the linkage of an anonymous struct defined in a declaration that is not a typedef of the struct? In practice, as long as we still give the declared entity internal linkage, things will continue to work unless the user pulls that type out with typeof / decltype and starts declaring other entities or instantiating templates with it — which is to say, it's a problem only in corner cases of corner cases.

SimeonEhrig added inline comments.Mar 15 2018, 3:22 AM
unittests/CodeGen/IncrementalProcessingTest.cpp
176–178

At the moment, there is no documentation, because we still develop the feature. I try to describe how it works.

The device side compilation works with a second compiler (a normal clang), which we start via syscall. In the interpreter, we check if the input line is a kernel definition or a kernel launch. Then we write the source code to a file and compile it with the clang to a PCH-file. Then the PCH-file will be compiled to PTX and then to a fatbin. If we add a new kernel, we will send the source code with the existing PCH-file to clang compiler. So we easy extend the AST and generate a PTX-file with all defined kernels.

An implementation of this feature can you see at my prototype: https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter

Running the ctor/dtor isn't hard. I search after the JITSymbol and generate an function pointer. Than I can simply run it. This feature can you also see in my prototype. So, we can run the ctor, if new fatbin code is generated and the dtor before, if code was already registered. The CUDA runtime also provide the possibility to run the (un)register functions many times.

__global__ void foo();
__global__ void bar();

//At this point, there is no fatbin file and it will no generated. 

void launch() {
  foo<<<1,1>>>();
  bar<<<1,1>>>();
}

// The definition of launch() is not possible at the direct input mode (type in line by line) in cling. 
// At this point, we need a definition of foo() and bar(). But there is a exception. 
// We have a function to read in a piece of code from file. This piece of code will translate in a single module. 


__global__ void foo() {}
__global__ void bar() {}

// In our case, we will compile this 8 lines of code in a single module in cling and send it  to the CUDA device JIT, too. 

// We have on file fatbinary file, which will extend with new kernels. The file have to unregistered and registered every time, if it will changed.
// When and which ctor/dtor have to run is managed by the interpreter.

I don't know, if I understand it right. Do you mean, we should implement the content of the ctor/dtor direct in our cling source code? For example, we call direct the __cudaRegisterFatBinary() function in the source code of cling after the generating of a new fatbin-file as opposed of calling __cuda_module_ctor, which we generated with JIT-backend of our interpreter.

tra added inline comments.Mar 15 2018, 10:24 AM
unittests/CodeGen/IncrementalProcessingTest.cpp
176–178

Do I understand it correctly that every time you see new kernel definition, you'll recompile everything you've seen until this point? In the example above you'll do compilation twice first time after foo() and then again after bar(). Compilation after bar() will have GPU code for both foo() and bar(), at which point you'll call dtor, which will unregister foo() from the first compilation and will then call ctor from the new compilation, which will register both foo() and bar() from the new fatbin. If that's the case, it may work OK with a few caveats.

  • it's not clear whether that would leak resources. CUDA runtime API is undocumented, so I can't tell whether unregistering old kernels will release everything (e.g. I have no idea whether it unloads old kernel). Similarly with ctor, I don't know whether it allocates some resources every time it's called. All of that is not an issue when ctor/dtor is called once during app runtime, but I'd be very cautious about using that repeatedly.
  • recompiling/loading/unloading everything every time you parse new kernel is ~quadratically expensive. You probably not going to get too many kernels during any given session, so it may be OK for an interpreter. Still, it could be avoided. All you need (in theory) is to compile and register one new kernel. That's why I suggested for it to be done by the interpreter.

Just my $.02, mostly beyond the scope of this review.
BTW, XLA/GPU in TensorFlow, does have an implementation of jit-for-GPU. While it's not directly comparable with your project, it may have some useful ideas on dealing with GPU-side code.
Most of the interesting JIT bits are in cuda_driver.cc and gpu_compiler.cc

SimeonEhrig added inline comments.Mar 16 2018, 4:39 AM
unittests/CodeGen/IncrementalProcessingTest.cpp
176–178

Thats the way, we do it with one exception. We do not recompile everything. We hold the AST of the device code in PCH-files. That should significant speed up the compilation of a new PTX file. We have to look, if this way is fast enough for our use cases. I think, we can also implement a lazy compilation for the device code, without great effort. But at the moment, we use the way, which you describe.

  • That's a good point. It's really a problem, that I've to solve it. At the moment, I've two Ideas. The first is the simple version. We run example programs on cling and look if they worked right and check the resources. The second idea is to reproduce the behavior of cling with static compiled programs, which are generated with the nvcc. For example:
// func.hpp
#ifndef FUNC_H
#define FUNC_H

void launch_func();

#endif


// main.cu
#include <iostream>
#include "func.hpp"

__global__ void kernel_main(){}

int main(int argc, char const *argv[])
{
      kernel<<<1,1>>>();
      std::cout << "main: " << cudaGetLastError() << std::endl;
      kernel_main();
      return 0;
}


#include <iostream>
#include "func.hpp"

__global__ void kernel_func(){}

void launch_func(){
       kernel_func<<<1,1>>>();
       std::cout << "func: " << cudaGetLastError() << std::endl;
}

If compile each .cu file to a own object file and then link it together, you can see, that __cudaRegisterFatBinary will runs twice at gdb. Also, nvcc use unique ctor names (use a hash function).
But, I have to discuss it with my colleagues, if it is a good solution.

  • Can you explain, why just one new kernel is enough?
// In this case, I understand it. We can put foo() and bar() together in 
// one compiling process.
__global__ void foo();
__global__ void bar();
foo<<<1,1>>>();
bar<<1,1>>>();


// But in case, I see no solution, to use just one new kernel.
// The process is interactive, so we have no knowledge of the follow lines.
__global__ void foo();
foo<<<1,1>>>();
__global__ void bar();
bar<<1,1>>>();

I will have a look at the TensorFlow but this needs some time.

tra added inline comments.Mar 19 2018, 1:13 PM
unittests/CodeGen/IncrementalProcessingTest.cpp
176–178

Can you explain, why just one new kernel is enough?

If you can compile one kernel, you can load/unload it independently. You will have some redundancy if kernels share code, but, generally speaking, every kernel is a complete GPU-side program. If you codegen the kernel (and everything it needs), it's all you need to run it. Granted, if there's substantial code reuse, then compiling each kernel separately will have a lot of overhead. I don't know what will work best in your scenario, just pointing that compiling each kernel individually is an option, which could be useful in some cases.

SimeonEhrig added inline comments.Mar 20 2018, 10:24 AM
unittests/CodeGen/IncrementalProcessingTest.cpp
176–178

At first, thanks for the discussion about the concept of my interpreter and the ideas.

Yes, you are right. This optimization could speed up the interpreter and improve the stability of the cuda runtime. But I think it's really depend on the use case. Adding a new kernel to the PCH is also fast. Possibly, it's faster to add new source code to a fat PCH file and generate an PTX from it, than build a PTX from an small CUDA C++ source code file, every time.

But at the moment, I have to implement all features. To do this, it is just easier to implement my idea. Later, if all is working, I can measure the performance and find the bottlenecks and if it necessary, I want to implement your idea.

It's also possible, that I need the code selection for a function of the cling. At the moment, it is possible to load C++ code from a file and interpret it. After that, you can unload the code, modify and reload or define a function with the console input, which has the same name as a function of the file. It you would be great, if we also can do it with CUDA C++ code. Then their solution might be necessary.

SimeonEhrig retitled this revision from Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names to CUDA ctor/dtor Module-Unique Symbol Name.Apr 12 2018, 7:40 AM
SimeonEhrig edited the summary of this revision. (Show Details)
SimeonEhrig edited the summary of this revision. (Show Details)Apr 12 2018, 8:02 AM
SimeonEhrig edited the summary of this revision. (Show Details)Apr 18 2018, 6:00 AM

Thank you everyone for your review comments!

We addressed the inline comments and improved the description of the change set for clarity and context.
Tests are updated as well.

This now implements the same fix as previously received in D34059 but just for CUDA.

SimeonEhrig marked 2 inline comments as done.Apr 18 2018, 6:12 AM
SimeonEhrig added inline comments.
lib/CodeGen/CGCUDANV.cpp
370

We improved the implementation. If there is no module name, it will not append any suffix and the symbol is just '__cuda_module_ctor'.

SimeonEhrig marked an inline comment as done.Apr 19 2018, 5:55 AM
tra added a comment.EditedApr 19 2018, 10:21 AM

Could you please resubmit your latest patch with complete context?
https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

tra added inline comments.Apr 19 2018, 11:45 AM
lib/CodeGen/CGCUDANV.cpp
287

There is a general problem with this approach. File name can contain the characters that PTX does not allow.
We currently only deal with '.' and '@', but that's not enough here.
You may want to either mangle the name somehow to avoid/convert illegal characters or use some other way to provide unique suffix. Hex-encoded hash of the file name would avoid this problem, for example.

Add full context with -U999999 to diff.

SimeonEhrig added inline comments.Apr 20 2018, 1:31 AM
lib/CodeGen/CGCUDANV.cpp
287

Maybe I'm wrong but I think, that should be no problem, because the generating of a cuda ctor/dtor have nothing to do with the PTX generation.

The function 'makeModuleCtorFunction' should just generate llvm ir code for the host (e.g. x86_64).

If I'm wrong, could you tell me please, where in the source code the 'makeModuleCtorFunction' affect the PTX generation.

This does not address my review. My review is suggesting that we avoid this issue completely by fixing IRGen to use an external linkage for internal declarations in your emission mode. That would allow you to just emit the module ctors as truly internal in the first place, removing any need to mangle them.

Add a comment, which declares the need of a unique ctor/dotr name.

SimeonEhrig marked an inline comment as done.Apr 24 2018, 3:55 AM
tra added inline comments.Apr 24 2018, 10:41 AM
lib/CodeGen/CGCUDANV.cpp
287

You are correct that PTX is irrelevant here. I've completely missed that this will be generated for the host, which is more forgiving.

That said, I'm still not completely sure whether we're guaranteed that using arbitrary characters in a symbol name is OK on x86 and, potentially, other host platforms. As an experiment, try using a module which has a space in its name.

SimeonEhrig added inline comments.Apr 25 2018, 4:36 AM
lib/CodeGen/CGCUDANV.cpp
287

At line 295 and 380 in CGCUDANV.cpp I use a sanitizer function, which replace all symbols without [a-zA-Z0-9._] with a '_'. It's the same solution like in D34059. So I think, it would works in general.

Only for information. I tested it with a module name, which includes a whitespace and without the sanitizer. It works on Linux x86 and the ELF format. There was an whitespace in the symbol of the cuda module ctor (I checked it with readelf).

In general, do you think my solution approach is technically okay? Your answer will be really helpful for internal usage in our cling project. At the moment I developed the cling-cuda-interpreter based on this patch and it would helps a lot of, if I can say, that the patch doesn't cause any problem with the CUDA-environment.

tra added a comment.May 4 2018, 10:23 AM

Perhaps we should take a step back and consider whether this is the right approach to solve your problem.

If I understand it correctly, the real issue is that you repeatedly recompile the same module and cling will only use the function from the first module it's seen it in. Unlike regular functions that presumably remain the same in all the modules they are present in, CUDA constructors do change and you need cling to grab the one from the most recent module.

This patch deals with the issue by attempting to add a unique sufix. Presumably cling will then generate some sort of unique module name and will get unique constructor name in return. The down side of this approach is that module name is something that is derived from the file name and the functionality you're changing is in the shared code, so you need to make sure that whatever you implement makes sense for LLVM in general and that it does what it claims it does. AFAICT, LLVM has no pressing need for the unique constructor name -- it's a function with internal linkage and, if we ever need to generate more than one, LLVM is capable of generating unique names within the module all by itself. The patch currently does not fulfill the "unique" part either.

Perhaps you should consider a different approach which could handle the issue completely in cling. E.g. You could rename the constructor in the module's IR before passing it to JIT. Or you could rename it in PTX (it's just text after all) before passing it to driver or PTXAS.

lib/CodeGen/CGCUDANV.cpp
287

This still does not give you unique suffix which was stated at the main goal of this patch. E.g files "foo$bar" and "foo_bar" will produce identical names. See my previous comment regarding using a hash.

In D44435#1088019, @tra wrote:

Perhaps we should take a step back and consider whether this is the right approach to solve your problem.

If I understand it correctly, the real issue is that you repeatedly recompile the same module and cling will only use the function from the first module it's seen it in. Unlike regular functions that presumably remain the same in all the modules they are present in, CUDA constructors do change and you need cling to grab the one from the most recent module.

This patch deals with the issue by attempting to add a unique sufix. Presumably cling will then generate some sort of unique module name and will get unique constructor name in return. The down side of this approach is that module name is something that is derived from the file name and the functionality you're changing is in the shared code, so you need to make sure that whatever you implement makes sense for LLVM in general and that it does what it claims it does. AFAICT, LLVM has no pressing need for the unique constructor name -- it's a function with internal linkage and, if we ever need to generate more than one, LLVM is capable of generating unique names within the module all by itself. The patch currently does not fulfill the "unique" part either.

Perhaps you should consider a different approach which could handle the issue completely in cling. E.g. You could rename the constructor in the module's IR before passing it to JIT. Or you could rename it in PTX (it's just text after all) before passing it to driver or PTXAS.

You are right. The clang commit is not the best solution. So, we searched for another solution and found one. The solution is similar to your suggestion. We found a possibility to integrate a llvm module pass, which detects the symbols __cuda_module_ctor and __cuda_module_dtor and append the module name to the symbol, before the llvm IR will be generated. So, we were able to move the solution from clang to cling, which is better for both projects.

tra added a comment.May 8 2018, 10:29 AM

Great! Let's close this review then.
And good luck with cling.

Okay, I will close the request and thank you very much for your help and your hints.

SimeonEhrig abandoned this revision.May 9 2018, 12:36 AM