This is an archive of the discontinued LLVM Phabricator instance.

[HIP] Fix device stub name
ClosedPublic

Authored by yaxunl on Oct 7 2019, 7:45 AM.

Details

Summary

HIP emits a device stub function for each kernel in host code.

The HIP debugger requires device stub function to have a different unmangled name as the kernel.

Currently the name of the device stub function is the mangled name with a postfix .stub. However,
this does not work with the HIP debugger since the unmangled name is the same as the kernel.

This patch adds prefix __device__stub__ to the unmangled name of the device stub before mangling,
therefore the device stub function has a valid mangled name which is different than the device kernel
name. The device side kernel name is kept unchanged. kernels with extern "C" also gets the prefix added
to the corresponding device stub function.

Diff Detail

Event Timeline

yaxunl created this revision.Oct 7 2019, 7:45 AM
tra added a reviewer: rsmith.Oct 7 2019, 8:56 AM
tra added a comment.Oct 7 2019, 9:12 AM

Could you elaborate on how exactly current implementation does not work?

I would expect the kernel and the stub to be two distinct entities, as far as debugger is concerned. It does have enough information to track each independently (different address, .stub suffix, perhaps knowledge whether it's device or host code). Without the details, it looks to me that this is something that can and should be dealt with in the debugger. I've asked the same question in D63335 but I don't think I've got a good answer.

yaxunl added a comment.Oct 7 2019, 9:44 AM
In D68578#1697822, @tra wrote:

Could you elaborate on how exactly current implementation does not work?

I would expect the kernel and the stub to be two distinct entities, as far as debugger is concerned. It does have enough information to track each independently (different address, .stub suffix, perhaps knowledge whether it's device or host code). Without the details, it looks to me that this is something that can and should be dealt with in the debugger. I've asked the same question in D63335 but I don't think I've got a good answer.

HIP debugger is a branch of gdb and the changes to support HIP will be upstreamed. When users set break point on a kernel, they intend to set a break point on the real kernel, not the device stub function. The device stub function is only a compiler generated helper function to help launch the kernel. Therefore it should have a different name so that it does not interfere with the symbol resolution of the real kernel.

hliao added inline comments.Oct 7 2019, 9:48 AM
lib/CodeGen/CGCUDANV.cpp
235 ↗(On Diff #223598)

keeping the original assertion in HIP is still valuable to capture naming mismatch issue for unnamed types

tra added a comment.Oct 7 2019, 10:30 AM
In D68578#1697822, @tra wrote:

Could you elaborate on how exactly current implementation does not work?

I would expect the kernel and the stub to be two distinct entities, as far as debugger is concerned. It does have enough information to track each independently (different address, .stub suffix, perhaps knowledge whether it's device or host code). Without the details, it looks to me that this is something that can and should be dealt with in the debugger. I've asked the same question in D63335 but I don't think I've got a good answer.

HIP debugger is a branch of gdb and the changes to support HIP will be upstreamed. When users set break point on a kernel, they intend to set a break point on the real kernel, not the device stub function. The device stub function is only a compiler generated helper function to help launch the kernel. Therefore it should have a different name so that it does not interfere with the symbol resolution of the real kernel.

I would agree that having distinct names for the device-side kernel and it's host-side stub would probably make things easier for debugger.
However, debugger does have access to mangled names and does see the '.stub' suffix in the mangled name. I don't understand why it can't be considered to disambiguate between the kernel and the stub?
I'm clearly missing something here. Is there a chance to get someone from the debugger team to chime in on this review directly?

Also, I would not agree that they intend to set a break point on the real kernel is the only scenario. E.g. quite often when I debug CUDA stuff, I do only care about host-side things and I do want to set breakpoint on the stub, so I can check kernel call parameters as they are passed to the kernel. It would be great if there were a way to explicitly tell debugger whether we want host-side stub or the kernel without having user to know how particular compiler transforms the name. For the user both entities have the same name, but distinct location and there should be a way to express that in the debugger.

t-tye added a subscriber: t-tye.Oct 7 2019, 8:48 PM
In D68578#1697898, @tra wrote:
In D68578#1697822, @tra wrote:

Could you elaborate on how exactly current implementation does not work?

I would expect the kernel and the stub to be two distinct entities, as far as debugger is concerned. It does have enough information to track each independently (different address, .stub suffix, perhaps knowledge whether it's device or host code). Without the details, it looks to me that this is something that can and should be dealt with in the debugger. I've asked the same question in D63335 but I don't think I've got a good answer.

HIP debugger is a branch of gdb and the changes to support HIP will be upstreamed. When users set break point on a kernel, they intend to set a break point on the real kernel, not the device stub function. The device stub function is only a compiler generated helper function to help launch the kernel. Therefore it should have a different name so that it does not interfere with the symbol resolution of the real kernel.

I would agree that having distinct names for the device-side kernel and it's host-side stub would probably make things easier for debugger.
However, debugger does have access to mangled names and does see the '.stub' suffix in the mangled name. I don't understand why it can't be considered to disambiguate between the kernel and the stub?
I'm clearly missing something here. Is there a chance to get someone from the debugger team to chime in on this review directly?

Also, I would not agree that they intend to set a break point on the real kernel is the only scenario. E.g. quite often when I debug CUDA stuff, I do only care about host-side things and I do want to set breakpoint on the stub, so I can check kernel call parameters as they are passed to the kernel. It would be great if there were a way to explicitly tell debugger whether we want host-side stub or the kernel without having user to know how particular compiler transforms the name. For the user both entities have the same name, but distinct location and there should be a way to express that in the debugger.

From a source language point of view, the device function comprises the code that is launched as a grid. We need this fact to be present in the symbols used. Only the device function should have a symbol name matching the mangled name of the device function. It the device function has both a host and device implementation then both can have the source language function name for the symbol since both actually implement the device function. If the user asks to set a breakpoint in the device function then the debugger would set in both implementations so the user is notified when the source program executes the device function, regardless of which implementation is invoked. This is similar to the debugger setting a breakpoint in a function that is inlined into multiple places: the debugger sets breeakpoints in all the inlined places so the user can tstill think of the program debugging in terms of the source language semantics.

In contrast, the stub is effectively part of the implementation of actually launching the device function. It should have a distinct name. The debugger can still be used to set a breakpoint in it, or to step into it. But that should be done in terms of the stub name. If the debugger wants to support source language specific intelligence it can provide a helper library that understands the stub names. This helper library (similar to the thread helper library) can be used by the debugger to present a cleaner language view to the user. In fact OpenMP has also done this and provides a helper library called OMPD that can be used by tools such as a debugger to hide OpenMP trampoline functions etc.

I am a little unclear what this patch is doing as it is mentioned that the mangled name has a _stub in it. My understanding is that the intention was to create a distinct unmangled name for the stub, and then mangle it so that the resulting symbol was a legal mangled name. It sounded like this was the preferred approach, and makes sense to me based on my current understanding. Am I understanding this correctly?

yaxunl added a comment.Oct 8 2019, 8:48 AM

I am a little unclear what this patch is doing as it is mentioned that the mangled name has a _stub in it. My understanding is that the intention was to create a distinct unmangled name for the stub, and then mangle it so that the resulting symbol was a legal mangled name. It sounded like this was the preferred approach, and makes sense to me based on my current understanding. Am I understanding this correctly?

Yes this patch does this.

tra added a comment.Oct 8 2019, 4:33 PM

From a source language point of view, the device function comprises the code that is launched as a grid. We need this fact to be present in the symbols used. Only the device function should have a symbol name matching the mangled name of the device function.

What do you have in mind when you use 'symbol name' here? Is that a symbol as seen by linker? If that's the case, do host and device share this name space on AMD GPUs? In case of CUDA, linker symbols are per-target (i.e. host and each GPU have their own spaces), so they never clash, but the kernel names must have identical mangled name on host and all devices, so the host can refer to the device-side kernel when it needs to launch it.

It the device function has both a host and device implementation then both can have the source language function name for the symbol since both actually implement the device function. If the user asks to set a breakpoint in the device function then the debugger would set in both implementations so the user is notified when the source program executes the device function, regardless of which implementation is invoked. This is similar to the debugger setting a breakpoint in a function that is inlined into multiple places: the debugger sets breeakpoints in all the inlined places so the user can tstill think of the program debugging in terms of the source language semantics.

OK. This sounds like __host__/__device__ function overloads and what you're saying does make sense for that.

In contrast, the stub is effectively part of the implementation of actually launching the device function. It should have a distinct name.

I'm not sure how the requirement of distinct name follows from the fact that the stub is the host-side part of the device-side kernel? To me it looks like an argument for them to have the same name so it's clear that they are both part of the same function as written in the source.

The don't have to be different. CUDA (and HIP) does not allow overloading of kernels, so the stub and the kernel can have identical names as in the example of __host__ and __device__ overloads you've described above, only now it's __host__ stub + __global__ kernel itself, instead of two user-implemented functions. Debugger, of course, will need to know about that to pick the stub or kernel as the breakpoint location, but that appears doable.

The debugger can still be used to set a breakpoint in it, or to step into it. But that should be done in terms of the stub name. If the debugger wants to support source language specific intelligence it can provide a helper library that understands the stub names. This helper library (similar to the thread helper library) can be used by the debugger to present a cleaner language view to the user. In fact OpenMP has also done this and provides a helper library called OMPD that can be used by tools such as a debugger to hide OpenMP trampoline functions etc.

Do I understand it correctly that giving the stub distinct name would effectively get it out of the way when a breakpoint is set on the kernel? I.e. it's essentially a work around the fact that debugger may not have convenient way to specify "set breakpoint on this name in device code only". Perhaps it would make sense to prove this ability as it sounds quite useful. I.e I may want to set breakpoint on all inlined host/device functions, but only on device side. That would be handy.

What happens if the stub and the kernel do have identical names?
My understanding, based on your comments above is that debugger does know about host and device 'spaces' and that it can find pointers to both host and device functions and set appropriate breakpoints for both. In this case it would normally attempt to set breakpoint on both the stub and the kernel as it would in case of __host__/__device__ overloads you've described above. In case of stub/kernel, we would want the breakpoint set only on the kernel itself. Given that debugger does have ability to tell host and device functions/symbols apart, the difficulty is really in being able to tell a real host function from the stub, so we can skip it.

Is that indeed what we want/need? Is there something else?

Does debugger know that device-side function is a kernel? In case of CUDA, the kernels are distinct from regular device-side functions. I don't know whether that's the case for AMDGPU.
If debugger can tell that particular device function is a kernel, that can be used to infer that the matching host-side symbol is a stub and skip setting a breakpoint on it.
If that does not work, debugger presumably has access to the mangled symbols for the potential breakpoint locations. The stub currently has distinct .stub suffix. This can also be used to tell it apart from a regular __host__ function.

I do not see how changing the source-level name for the stub is going to change things in principle. It's just yet another way to disambiguate a real __host__ function from a host stub we generate for the kernel.
Is there anything else about the stubs that requires them to have a name different from the kernel?

I am a little unclear what this patch is doing as it is mentioned that the mangled name has a _stub in it.

Currently the mangled name has .stub suffix which is discarded during unmangling, so unmangled names for the stub and the kernel end up being identical. I'm trying to figure out why is it a problem to be fixed in the compiler.

My understanding is that the intention was to create a distinct unmangled name for the stub, and then mangle it so that the resulting symbol was a legal mangled name. It sounded like this was the preferred approach, and makes sense to me based on my current understanding. Am I understanding this correctly?

This patch proposes changing the source-level name for the stub. Unfortunately the way it attempt to implement it is by doing the renaming during mangling phase itself. This appears to be the wrong place to change source-level name.
Before figuring out what would be the right thing to do here, I want to understand why we're doing it. I appreciate your description of what drives this requirement. I think I have petter idea of it now, but I still have some questions. Please bear with me.

t-tye added a comment.Oct 8 2019, 7:41 PM
In D68578#1700652, @tra wrote:

From a source language point of view, the device function comprises the code that is launched as a grid. We need this fact to be present in the symbols used. Only the device function should have a symbol name matching the mangled name of the device function.

What do you have in mind when you use 'symbol name' here? Is that a symbol as seen by linker? If that's the case, do host and device share this name space on AMD GPUs? In case of CUDA, linker symbols are per-target (i.e. host and each GPU have their own spaces), so they never clash, but the kernel names must have identical mangled name on host and all devices, so the host can refer to the device-side kernel when it needs to launch it.

We want to support a heterogeneous gdb debugger for a single source programming language. We would like to follow the same conventions used by compilers that implement other languages supported by gdb. The debugger can use symbols to find functions. It supports unmangling them and using the unmangled name to indicate the source language function it corresponds to. We would like this to remain true. The stub is not the kernel function, it is a helper function that will launch the kernel. In many ways it is acting like other trampolines. Therefore, it should be named as a internal helper function. The debugger can chose what it wants to do with it, but it does not want to be confused into thinking it actually IS the kernel function. If the user sets a breakpoint in the code of the kernel function then that breakpoint should be hit by every instance of the kernel that is created by the dispatch. It should not be hit by the code that is initiatig the dispatch. If that is what the user wanted they would set a breakpoint at the statement that performs the dispatch launch.

Whether the kernel is present in the CPU or GPU code is s separate concept. If it is present in both, then both would have the same symbol as they are both implementing the kernel. The debugger would set a breakpoint in both as from a language execution model poit of view if either piece of code executes it corresponds to the same source language kernel.

It the device function has both a host and device implementation then both can have the source language function name for the symbol since both actually implement the device function. If the user asks to set a breakpoint in the device function then the debugger would set in both implementations so the user is notified when the source program executes the device function, regardless of which implementation is invoked. This is similar to the debugger setting a breakpoint in a function that is inlined into multiple places: the debugger sets breeakpoints in all the inlined places so the user can tstill think of the program debugging in terms of the source language semantics.

OK. This sounds like __host__/__device__ function overloads and what you're saying does make sense for that.

Right. Well its not really and overload, not a request to have instances of the kernel avaiable for either the CPU or GPU to execute. They are the same function, not different overloads.

In contrast, the stub is effectively part of the implementation of actually launching the device function. It should have a distinct name.

I'm not sure how the requirement of distinct name follows from the fact that the stub is the host-side part of the device-side kernel? To me it looks like an argument for them to have the same name so it's clear that they are both part of the same function as written in the source.

The don't have to be different. CUDA (and HIP) does not allow overloading of kernels, so the stub and the kernel can have identical names as in the example of __host__ and __device__ overloads you've described above, only now it's __host__ stub + __global__ kernel itself, instead of two user-implemented functions. Debugger, of course, will need to know about that to pick the stub or kernel as the breakpoint location, but that appears doable.

As mentioned, the stub is not the host side part of the device side kernel. The stub is a means to launch the kernel. That launching could happen on the device (device enqueue), or on the host. The kernel itself could execute on the device or the host. There is the act of launching the kernel (the function call statement if you will), and the kernel instances that come into existence (the threads created to execute the body of the kernel according to the launch bounds presented at the launch statement).

The user may want to set a breakpoint at the launch statement, or in the body of the kernel. The language execution model treats those separately. The standard debugger expects the symbols to reflect the language constructs. Hence wanting the launch stub (which is compiler generated and not user written) to be distinct from the kernel body. If the compiler decided not to use a launch stub function (perhaps it is launching a kernel that will execute on the CPU and so does not need a helper function) then that is its choice. It is desirable that the debugger does not have to know about the choices made by the compiler. It simply wants to know that a symbol that appears to be a user source language construct is in fact exactly that. It does not have to do any compiler specific filtering.

When the debugger has DWARF available, it can use that to get a more accurate picture, but gdb does have the ability to fall back on just symbols, and it is that functionality we would like to preserve in the same manner as is done for other languages.

The debugger can still be used to set a breakpoint in it, or to step into it. But that should be done in terms of the stub name. If the debugger wants to support source language specific intelligence it can provide a helper library that understands the stub names. This helper library (similar to the thread helper library) can be used by the debugger to present a cleaner language view to the user. In fact OpenMP has also done this and provides a helper library called OMPD that can be used by tools such as a debugger to hide OpenMP trampoline functions etc.

Do I understand it correctly that giving the stub distinct name would effectively get it out of the way when a breakpoint is set on the kernel? I.e. it's essentially a work around the fact that debugger may not have convenient way to specify "set breakpoint on this name in device code only". Perhaps it would make sense to prove this ability as it sounds quite useful. I.e I may want to set breakpoint on all inlined host/device functions, but only on device side. That would be handy.

It is not really a work around. It is making the symbols reflect the reality of the source language program. The debugger can then simply trust that information and use it as gdb does for other languages.

Features such only break on this inlining of a function may be useful, but gdb does not currenty support that. Similarly, it does not support breakpoints based on the architecture. That could be simulated by having a conditional breakpoint that continues if the current thread architecture does not equal the chosen architecture. If such a feature were widely used it could be accelerated by adding architecture conditional breakpoints. I can add that to our list of suggestions.

What happens if the stub and the kernel do have identical names?

The stub is compiler generated so should never have a name that can collide with a user name.

My understanding, based on your comments above is that debugger does know about host and device 'spaces' and that it can find pointers to both host and device functions and set appropriate breakpoints for both. In this case it would normally attempt to set breakpoint on both the stub and the kernel as it would in case of __host__/__device__ overloads you've described above. In case of stub/kernel, we would want the breakpoint set only on the kernel itself. Given that debugger does have ability to tell host and device functions/symbols apart, the difficulty is really in being able to tell a real host function from the stub, so we can skip it.

Is that indeed what we want/need? Is there something else?

As mentioned above. The desire is to have the compiler generate information in a standard way so the debugger can consume it in a standard way. If the user wants to set a breakpoint in the kernel, the compiler information should only lead the debugger to setting a breakpoint in the kernel, not in some other function that is used in the implementation of launching kernels. The user expects a kernel breakpoint to only be hit by the threads that execute the kernel instances created by the launch as that is how the language is defined. The debugger simply keeps track of what code objects are loaded, and what symbols they contain. It does not need to know the distinction between host and device code to implement the basic debugger functionality.

Does debugger know that device-side function is a kernel? In case of CUDA, the kernels are distinct from regular device-side functions. I don't know whether that's the case for AMDGPU.
If debugger can tell that particular device function is a kernel, that can be used to infer that the matching host-side symbol is a stub and skip setting a breakpoint on it.
If that does not work, debugger presumably has access to the mangled symbols for the potential breakpoint locations. The stub currently has distinct .stub suffix. This can also be used to tell it apart from a regular __host__ function.

The debugger does not have to care if the symbol is for a kernel or a function, it will simply plonk a breakpoint in the code that corresponds to the symbol and report when it is hit. If the symbols follow the conventions used by other languages then the debugger does not have to do anything special to support a source language that happens to be executing on multiple devices.

I do not see how changing the source-level name for the stub is going to change things in principle. It's just yet another way to disambiguate a real __host__ function from a host stub we generate for the kernel.
Is there anything else about the stubs that requires them to have a name different from the kernel?

The stub is not a source level function, it is a compiler generated function, and the desire is that it be named so as not to conflict with actual source level constructs as is done for other compiler generated entries.

I am a little unclear what this patch is doing as it is mentioned that the mangled name has a _stub in it.

Currently the mangled name has .stub suffix which is discarded during unmangling, so unmangled names for the stub and the kernel end up being identical. I'm trying to figure out why is it a problem to be fixed in the compiler.

My understanding was that an earlier review rejected adding the suffix to the mangled name as it broke unmangling. It also does not seem the right thing to do as it does not follow the convention for oter compiler generate symbols.

My understanding is that the intention was to create a distinct unmangled name for the stub, and then mangle it so that the resulting symbol was a legal mangled name. It sounded like this was the preferred approach, and makes sense to me based on my current understanding. Am I understanding this correctly?

This patch proposes changing the source-level name for the stub. Unfortunately the way it attempt to implement it is by doing the renaming during mangling phase itself. This appears to be the wrong place to change source-level name.

What do you think is the right place to do it?

Before figuring out what would be the right thing to do here, I want to understand why we're doing it. I appreciate your description of what drives this requirement. I think I have petter idea of it now, but I still have some questions. Please bear with me.

That makes complete sense. The desire is to have the debugger treat heterogeneous single source debugging in the same way as traditional CPU debugging. That the user experience is basically the same. By following the same conventions used for the CPU in the GPU, and implementing similar runtime controls, it allows a common debugger code base to support both with minimal change and ensure a consistent user experience. We would like to avoid adding special treatment to support the GPU in the debugger if following the existing conventions/standards will allow the existing code to simply work.

Hopefully the above responses help describe the motivation for this. If not let me know and thanks for taking the time to review.

In D68578#1700652, @tra wrote:

This patch proposes changing the source-level name for the stub. Unfortunately the way it attempt to implement it is by doing the renaming during mangling phase itself. This appears to be the wrong place to change source-level name.

A specific difficulty here is that we need not only get the mangled kernel stub name, but also get the mangled kernel name. However, there is only one FuncDecl for the kernel. If we change the name of the FuncDecl to the stub name to be different from the kernel name, then we cannot get the mangled name for the kernel. That's why this patch does not change FuncDecl but let the mangler mangle it in two different ways. An alternative approach would be to create two FuncDecls, one for stub, one for kernel, and keep a map from the stub to the kernel. In this way we do not need to change the mangler.

tra added a comment.Nov 6 2019, 2:54 PM

Apologies for the delay with my response.

In D68578#1700652, @tra wrote:

From a source language point of view, the device function comprises the code that is launched as a grid. We need this fact to be present in the symbols used. Only the device function should have a symbol name matching the mangled name of the device function.

What do you have in mind when you use 'symbol name' here? Is that a symbol as seen by linker? If that's the case, do host and device share this name space on AMD GPUs? In case of CUDA, linker symbols are per-target (i.e. host and each GPU have their own spaces), so they never clash, but the kernel names must have identical mangled name on host and all devices, so the host can refer to the device-side kernel when it needs to launch it.

We want to support a heterogeneous gdb debugger for a single source programming language. We would like to follow the same conventions used by compilers that implement other languages supported by gdb. The debugger can use symbols to find functions. It supports unmangling them and using the unmangled name to indicate the source language function it corresponds to. We would like this to remain true. The stub is not the kernel function, it is a helper function that will launch the kernel. In many ways it is acting like other trampolines. Therefore, it should be named as a internal helper function. The debugger can chose what it wants to do with it, but it does not want to be confused into thinking it actually IS the kernel function. If the user sets a breakpoint in the code of the kernel function then that breakpoint should be hit by every instance of the kernel that is created by the dispatch. It should not be hit by the code that is initiatig the dispatch. If that is what the user wanted they would set a breakpoint at the statement that performs the dispatch launch.

Whether the kernel is present in the CPU or GPU code is s separate concept. If it is present in both, then both would have the same symbol as they are both implementing the kernel. The debugger would set a breakpoint in both as from a language execution model poit of view if either piece of code executes it corresponds to the same source language kernel.

Thank you for the details.

It the device function has both a host and device implementation then both can have the source language function name for the symbol since both actually implement the device function. If the user asks to set a breakpoint in the device function then the debugger would set in both implementations so the user is notified when the source program executes the device function, regardless of which implementation is invoked. This is similar to the debugger setting a breakpoint in a function that is inlined into multiple places: the debugger sets breeakpoints in all the inlined places so the user can tstill think of the program debugging in terms of the source language semantics.

OK. This sounds like __host__/__device__ function overloads and what you're saying does make sense for that.

Right. Well its not really and overload, not a request to have instances of the kernel avaiable for either the CPU or GPU to execute. They are the same function, not different overloads.

In case of cuda they may be overloads -- there may be two functions with identical signatures (modulo __host__/__device__ attributes) or multiple functions with the same names but different signatures with different __host__/__device__ attributes. It does not change things in principle. I'm just pointing out that CUDA (and thus HIP) as implemented in clang uses target attributes as another dimension in the space of functions with the same name.

The debugger can still be used to set a breakpoint in it, or to step into it. But that should be done in terms of the stub name. If the debugger wants to support source language specific intelligence it can provide a helper library that understands the stub names. This helper library (similar to the thread helper library) can be used by the debugger to present a cleaner language view to the user. In fact OpenMP has also done this and provides a helper library called OMPD that can be used by tools such as a debugger to hide OpenMP trampoline functions etc.

Do I understand it correctly that giving the stub distinct name would effectively get it out of the way when a breakpoint is set on the kernel? I.e. it's essentially a work around the fact that debugger may not have convenient way to specify "set breakpoint on this name in device code only". Perhaps it would make sense to prove this ability as it sounds quite useful. I.e I may want to set breakpoint on all inlined host/device functions, but only on device side. That would be handy.

It is not really a work around. It is making the symbols reflect the reality of the source language program. The debugger can then simply trust that information and use it as gdb does for other languages.

It still seems to boil down to "the stub should not get in the way of debugger accessing the function itself", but I see your point and agree that it would be useful if the stub could be an entity separate from the function itself.

Now we need to figure out what would be the best way to implement it.

Clang uses the real function in AST to generate the IR for the stub and, because of that, the stub ends up using function's name.
Actually, the situation is a bit worse than that. Clang implicitly relies on __host__ and __device__ entities not being codegen'ed at the same time, so we don't have to care about name conflicts.
Your description above indicates that the assumption is somewhat optimistic and that's what really causes the issue here.

I think @yaxunl's suggestion that we may need different FuncDecl's would be a good way forward.
I suspect we may already have places where clang deals with compiler-generated functions, so we should have existing examples of how it could be done.

Distinguishing between multiple symbols associated with the same source-level declaration is the purpose of the GlobalDecl abstraction.

Distinguishing between multiple symbols associated with the same source-level declaration is the purpose of the GlobalDecl abstraction.

It seems GlobalDecl is just a wrapper for concrete Decl's

https://github.com/llvm/llvm-project/blob/31817731167135870259ef1e7387746345b96a2f/clang/include/clang/AST/GlobalDecl.h#L40

Here we need to get the mangled name of a kernel and the mangled name of the same kernel but with a prefix before mangling.

Can I use GlobalDecl with the same FunctionDecl* but different multi-version index to indicate it is a kernel or a stub, then let the mangler mangle them differently?

Distinguishing between multiple symbols associated with the same source-level declaration is the purpose of the GlobalDecl abstraction.

It seems GlobalDecl is just a wrapper for concrete Decl's

It's a Decl plus a discriminator which is required for certain kinds of declaration. See e.g. GlobalDecl(const CXXConstructorDecl *D, CXXCtorType Type). GlobalDecl asserts if you try to construct it using GlobalDecl(FunctionDecl*) with a constructor/destructor declaration; we could similarly make that forbid construction with a kernel and then require code to use a GlobalDecl constructor that passes down whether it's the kernel or the stub that's being requested.

John.

Distinguishing between multiple symbols associated with the same source-level declaration is the purpose of the GlobalDecl abstraction.

It seems GlobalDecl is just a wrapper for concrete Decl's

It's a Decl plus a discriminator which is required for certain kinds of declaration. See e.g. GlobalDecl(const CXXConstructorDecl *D, CXXCtorType Type). GlobalDecl asserts if you try to construct it using GlobalDecl(FunctionDecl*) with a constructor/destructor declaration; we could similarly make that forbid construction with a kernel and then require code to use a GlobalDecl constructor that passes down whether it's the kernel or the stub that's being requested.

John.

In host compilation, we do not need to differentiate device function or stub function except for the mangler. Currently the mangler does not know about GlobalDecl. If we let the mangler to mangle a function based on whether it is a GlobalDecl or FunctionDecl, we still need to modify the mangler, and the change will be similar to the current approach.

There are a number of places in IRGen that pass around GlobalDecls with the expectation that that's sufficient to uniquely identify a symbol. The fact that IRGen breaks down the GD at the last second before passing it to the mangler, rather than passing it to the mangler and letting the mangler decide what to do with it, doesn't really change anything and is arguably poor code design anyway. Inventing a second declaration, or trying to propagate a flag outside of GD, is just fighting the architecture for no good reason.

yaxunl updated this revision to Diff 229137.Nov 13 2019, 11:19 AM

Attempt to prefix the kernel stub name on the fly.

If we do not want to create two Decl's during parsing, and do not want to change the mangler, it seems the least invasive way to get the prefixed kernel name is to change it on the fly then change it back.

tra added inline comments.Nov 13 2019, 11:53 AM
clang/lib/CodeGen/CodeGenModule.cpp
1128–1137

On one hand I like this patch variant much better than the one that changed the mangling itself.
On the other hand this code appears to reply on implementation details. I.e. we're setting new name on FD which may or may not be the same as ND, but we're always passing ND to getMangledNameImpl().

Perhaps we could implement name-tweaking as another MultiVersionKind which we already plumb into getMangledNameImpl() and which allows changing the name for target attributes & features.

yaxunl marked an inline comment as done.Nov 14 2019, 9:02 AM
yaxunl added inline comments.
clang/lib/CodeGen/CodeGenModule.cpp
1128–1137

The mangled name of an instantiated template function does not depends on its own name, but on the template. If we do not want to depend on this implementation detail, it seems I have to clone the template and instantiate from the clone.

MultiVersion does not help us here since it only appends .postfix to mangled name. The obstacle we are facing is how to change the unmangled name.

tra added inline comments.Nov 14 2019, 9:36 AM
clang/lib/CodeGen/CodeGenModule.cpp
1128–1137

The mangled name of an instantiated template function does not depends on its own name, but on the template. If we do not want to depend on this implementation detail, it seems I have to clone the template and instantiate from the clone.

That would be putting more effort into working around the fact that getMangledNameImpl() doesa not provide a good API to change the name the way you need to. *That's* what needs to be addressed, IMO.

MultiVersion does not help us here since it only appends .postfix to mangled name. The obstacle we are facing is how to change the unmangled name.

*Some* existing implementations append to the mangled name, but we can do other manipulations there, too. The string with the mangled name originates in getMangledNameImpl and we could do more than just append to it. We do not have to use the MultiVersion for that, either. E.g. we prepend __regcall3__ to the names of functions with CC_X86RegCall calling convention. We could do something similar for the kernel stub, I wonder if we could just generate a unique name and be done with that?

Hmm. Unique name probably would not do if, let's say, a kernel is defined in one TU, but we need to call it from another TU. So, whichever way we change the name of the stub, it will need to be the same everywhere.
You may want to add a test verifying that launching of declaration-only kernels uses the right name.

Consistency of name mangling means that we do need to include regular C++-mangled information. Which means we need to do the name tweaking deeper down. How about using calling conventions? It's been suggested in the past that a lot of shenanigans around kernel launches could/should be done as a different calling convention. One of the things affected by the calling convention is mangling and we can add prefix there: https://github.com/llvm/llvm-project/blob/master/clang/lib/AST/Mangle.cpp#L164

We could tag host-side kernel with 'kernel call' calling convention on the host side and then plumb prefixing to be done similar to __regcall3__.

If that works that may be a useful improvement overall. For instance, we may no longer need to stash a it's a kernel flag among attributes and it would probably be useful for other things (e.g enforcing address space requirements for kernel pointer arguments).

yaxunl added inline comments.Nov 14 2019, 2:55 PM
clang/lib/CodeGen/CodeGenModule.cpp
1128–1137

will add a test for decl only kernel. At least for the current implementation I see it works. A decl of stub function with expected name is emitted and can be called.

About calling conv. I've tried implementing __global__ as a calling conv before. The issue is that it is part of type system and clang enforces type checking for that. e.g. you cannot assign it to an ordinary function pointer unless that function pointer is also declared with the same calling convention. This will cause lots of type mismatching issues. In CUDA language, __global__ is not part of type system since it is just an attribute.

We could introduce a calling conv for stub, but probably we can only use use it when we mangle the stub function.

tra added inline comments.Nov 14 2019, 3:21 PM
clang/lib/CodeGen/CodeGenModule.cpp
1128–1137

OK. I'm fresh out of ideas.

We should add some sort of assert to make sure that the mangled name does have the prefix we intended to add. Also a TODO to figure out a better way to add a name prefix before mangling.

If anyone else has other suggestions, please chime in.

yaxunl updated this revision to Diff 231929.Dec 3 2019, 9:30 AM

use calling convention to mangle the stub differently.

yaxunl updated this revision to Diff 231966.Dec 3 2019, 12:52 PM

clean up and fix assertions.

I mean, I made a recommendation and you dismissed it.

yaxunl updated this revision to Diff 244988.Feb 17 2020, 9:00 AM

Revised by John's comments. Introduced HIPKernelType for GlobalDecl so that we can use GlobalDecl to
represent stub and kernel during host compilation. Revised mangler so that GlobalDecl carrying
pertinent information can be passed through.

Thanks, I think this approach is really improving the existing code.

clang/include/clang/AST/GlobalDecl.h
40

The attribute here is CUDAGlobalAttr; should this be named in terms of CUDA, or is the CUDA model sufficiently different from HIP that the same implementation concept doesn't apply?

61

This function exists primarily to be used as a common initializer for all the constructors that don't require any of the extra fields. (This file predates LLVM's adoption of C++14, which allows constructors to delegate to other constructors.) That's why it asserts that it's not used with constructors or destructors. So, two questions:

  • Is there a reason this function now needs to tolerate null declarations?
  • There's some subtle implicit behavior here, in that references to kernels default to HIPKernelKind::Kernel. Is that reasonable, or should there be a third assertion that this function isn't used with kernel declarations?
140

explicit, please.

177

The indentation here seems odd.

clang/lib/AST/ItaniumMangle.cpp
397

Does passing down a GlobalDecl everywhere allow us to remove these constructors, i.e. to eliminate the Structor and StructorType fields?

641

This can just be cast, except actually I don't think you need a cast here at all given the code below.

814

There's a default constructor.

1615

The relevant wording from the Itanium spec here is:

For entities in constructors and destructors, the mangling of the complete object constructor or destructor is used as the base function name, i.e. the C1 or D1 version.

But you might consider pulling this out as a helper function, something like:

static GlobalDecl getParentOfLocalEntity(const DeclContext *DC);
1617

This can still be cast.

1907

You can just pass GlobalDecl() here.

4992–4993

The second assertion can just be removed now, since the GD should be carrying the right information.

clang/lib/CodeGen/CodeGenModule.cpp
1025–1026

Let's see if we can make this breakdown no longer necessary, since MangleContext::mangleName should be capable of doing the right thing starting straight from a GD. In fact, maybe we can remove most of the specialized mangling methods (like mangleCXXCtor and mangleCXXDtor) from MangleContext completely?

Unrelatedly: there's an Out declared in the outermost scope, but a bunch of these scopes declare their own Out; could you just fix that while you're editing this function?

1039

Is this the best way of handling this, or should shouldMangleDeclName return true for kernels (or at least stubs?) even in C? Honest question.

3340

Should this be handled in the caller, or would that make things unreasonably difficult?

tra added a comment.Feb 18 2020, 9:03 AM

Nice! Thank you for making these changes.

clang/include/clang/AST/GlobalDecl.h
40

I believe the attribute serves the same purpose in both CUDA and HIP and could be renamed appropriately in a separate patch.

While the changes in this patch are not required for CUDA, CUDA would benefit from them. We could use a generic GPU prefix and migrate CUDA to the same model later. A TODO comment about that would be nice.

clang/lib/CodeGen/CodeGenModule.cpp
1025–1026

Perhaps it would make sense to split this patch into two -- one that changes mangler input to GlobalDecl and the other one dealing with HIP stubs.

rjmccall added inline comments.Feb 18 2020, 9:58 AM
clang/include/clang/AST/GlobalDecl.h
40

I'd just like consistency. If they're serving the same purpose, then as someone with no dog in this fight, I would give precedence to CUDA over HIP in names since it's both the older language and was implemented first in Clang (even if only partially, IIUC). I don't think a generic name works unless we can meaningfully generalize it to all languages with a similar feature, e.g. OpenCL and so on.

clang/lib/CodeGen/CodeGenModule.cpp
1025–1026

Yes, that's a good idea.

tra added inline comments.Feb 18 2020, 10:18 AM
clang/include/clang/AST/GlobalDecl.h
40

Naming, the hardest problem in computer science. :-)
I personally would prefer generalization-with-exclusions over specific name which is inconsistently commingles things that's really specific to that name and things that are more widely used. Alas, right now CUDA is the example of the latter case -- some parts are CUDA-specific and a lot are shared with HIP.

For the new features we've been sort of sticking with using CUDA/HIP for specific parts and GPU for shared functionality, but as things are a lot of shared bits are still 'CUDA' and it's hard to tell them apart. As you point it out, renaming the incumbent names would be a pain, so here we are.

I think using GPUKernelKind with a comment that it reflects HIP & CUDA kernels would be somewhat better choice than CUDAKernelKind which would be somewhat confusing at this point given that CUDA actually does not use it yet. I'm also fine with keeping it HIPKernelKind and postpone the naming decision until CUDA-related parts are actually implemented.

rjmccall added inline comments.Feb 19 2020, 7:24 AM
clang/include/clang/AST/GlobalDecl.h
40

Maybe KernelReferenceKind? It's probably a common concept across all heterogenous-computing models.

tra added inline comments.Feb 19 2020, 9:18 AM
clang/include/clang/AST/GlobalDecl.h
40

SGTM.

yaxunl marked 38 inline comments as done.Mar 5 2020, 10:30 AM
yaxunl added inline comments.
clang/include/clang/AST/GlobalDecl.h
40

changed.

61

By using default constructor of GlobalDecl, null declaration is no longer necessary. I have reverted change to assert.

I think it is a good idea to force GlobalDecl of kernels to be instantiated with the specific ctor. I have added assert to Init to prevent kernels to be instantiated through it.

140

fixed

177

fixed

clang/lib/AST/ItaniumMangle.cpp
397

Removing eliminate the Structor and StructorType will incur significantly more changes. Can it be done later? Thanks.

641

removed cast

814

fixed

1615

extracted to getParentOfLocalEntity

1617

fixed

1907

fixed

4992–4993

removed

clang/lib/CodeGen/CodeGenModule.cpp
1025–1026

Fixed the redundant Out var.

However, removing mangleCXXCtor/Dtor will incur significantly more changes. Can it be done later? Thanks.

1025–1026
1039

This is for extern "C" kernels, which are either not mangled or with simple prefix. I tried returning true for them in shouldMangleDeclName, and they got mangled as Itanium mangling, which seems not right.

3340

fixed

yaxunl updated this revision to Diff 248544.Mar 5 2020, 10:34 AM
yaxunl marked 11 inline comments as done.

Revised by John's and Artem's comments.

yaxunl updated this revision to Diff 248927.Mar 7 2020, 5:35 AM

update patch

tra accepted this revision.Mar 9 2020, 9:22 AM

Few nits. LGTM otherwise.

clang/include/clang/AST/GlobalDecl.h
61

Wording inconsitency -- we're checking for CUDAGlobalAttr but complaining about 'HIP kernels'. Just drop 'HIP' or replace with 'GPU'?

85

Ditto.

129

Same wording nit.

188

Ditto.

clang/lib/CodeGen/CGCUDARuntime.h
70–72

Adding a descriptive comment would be great. Otherwise anyone looking at the function decl without the context of this patch will be puzzled about its meaning and purpose.

Also, perhaps the argument type should be a NamedDecl -- the function is not used on or useful for regular Decl. It will save us few casts in other places, too.

This revision is now accepted and ready to land.Mar 9 2020, 9:22 AM
yaxunl marked 9 inline comments as done.Mar 9 2020, 1:42 PM
yaxunl added inline comments.
clang/include/clang/AST/GlobalDecl.h
61

will use GPU kernel

clang/lib/CodeGen/CGCUDARuntime.h
70–72

done

This revision was automatically updated to reflect the committed changes.
yaxunl marked 2 inline comments as done.
Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2020, 2:03 PM