This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/CodeGen/
-
CodeGen/
-
CGCUDANV.cpp
-
CGCUDARuntime.h
-
CGExpr.cpp
-
CodeGenModule.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
Inputs/
-
cuda.h
-
cxx-call-kernel.cpp
-
kernel-dbg-info.cu
-
kernel-stub-name.cu
-
unnamed-types.cu

Differential D86376

[HIP] Emit kernel symbol
ClosedPublic

Authored by yaxunl on Aug 21 2020, 3:03 PM.

Download Raw Diff

Details

Reviewers

tra
rjmccall

Commits

rG5cf2a37f1255: [HIP] Emit kernel symbol

Summary

Currently clang uses stub function to launch kernel. This is inconvenient
to interop with C++ programs since the stub function has different name
as kernel, which is required by ROCm debugger.

This patch emits a variable symbol which has the same name as the kernel
and uses it to register and launch the kernel. This allows C++ program to
launch a kernel by using the original kernel name.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

yaxunl requested review of this revision.Aug 21 2020, 3:03 PM

yaxunl created this revision.

yaxunl edited the summary of this revision. (Show Details)Aug 21 2020, 3:10 PM

How much does this inlining buy you in practice? I.e. what's a typical launch latency before/after the patch? For CUDA, config push/pop is negligible compared to the cost of actually launching the kernel on the GPU. It is measurable if the launch is asynchronous, but queueing kernels fast, does not help all that much in the long run -- you eventually have to run those kernels on the GPU, so in most cases you're just spend a bit more time idling while waiting for the queued kernels to finish. To be beneficial, you'll need a finely balanced CPU/GPU workload and that's rather hard to achieve. Not to the point where the minor savings here would be meaningful. I would assume the situation on AMD GPUs is not that different.

One side effect of this patch is that there will be no convenient way to set host-side breakpoint on kernel launch.
Another will be that examining call stack will become somewhat confusing as the arguments passed to the kernel as written in the source code will not match those observed in the stack trace. I guess preserving the appearance of normal function calls was the reason for the split config setup/kernel launch in CUDA. I'd say it's still useful to have as CUDA-specific debugger is not always available and one must use regular gdb on CUDA apps now and then.

If the patch does give measurable performance improvement, can we implement launch config push/pop in a way that compiler can eliminate by itself when it's possible and keep the stub as the host-side kernel entry point? I would prefer to avoid sacrificing debugging usability for performance optimizations that may not matter.

In D86376#2234259, @tra wrote:

How much does this inlining buy you in practice? I.e. what's a typical launch latency before/after the patch? For CUDA, config push/pop is negligible compared to the cost of actually launching the kernel on the GPU. It is measurable if the launch is asynchronous, but queueing kernels fast, does not help all that much in the long run -- you eventually have to run those kernels on the GPU, so in most cases you're just spend a bit more time idling while waiting for the queued kernels to finish. To be beneficial, you'll need a finely balanced CPU/GPU workload and that's rather hard to achieve. Not to the point where the minor savings here would be meaningful. I would assume the situation on AMD GPUs is not that different.

`hipPushConfiguration/hipPopConfiguration' and kernel stub can cause 40 ns overhead, whereas we have requests to squeeze any overhead in kernel launching latency.

One side effect of this patch is that there will be no convenient way to set host-side breakpoint on kernel launch.
Another will be that examining call stack will become somewhat confusing as the arguments passed to the kernel as written in the source code will not match those observed in the stack trace. I guess preserving the appearance of normal function calls was the reason for the split config setup/kernel launch in CUDA. I'd say it's still useful to have as CUDA-specific debugger is not always available and one must use regular gdb on CUDA apps now and then.

Eliminating kernel stub does not affect debugability negatively. At least this is true for HIP debugger. Actually our debugger team intentionally requests to eliminate any debug information for the kernel stub so that it will not confuse the debugger with the real kernel. This is because the kernel stub is an artificial function for launching the kernel, not the real kernel which is in device binary. For HIP debugger (rocmgdb), when the user set break point on a kernel, it will break on the real kernel in device binary, and the call stack are displayed correctly. The arguments to the real kernel are not lost, since the real kernel is a real function in device binary.

Another motivation for eliminating kernel stub is to be able to emit a symbol with the same mangled name as a kernel as a global variable instead of a function. Since we need such symbols to be able to launch kernels with mangled name in a C++ program. If we use kernel stub as the symbol, we cannot use the original mangled kernel name since our debugger does not allow that.

I'm OK with how the patch is implemented.
I'm still on the fence regarding whether it should be implemented.

In D86376#2234458, @yaxunl wrote:

`hipPushConfiguration/hipPopConfiguration' and kernel stub can cause 40 ns overhead, whereas we have requests to squeeze any overhead in kernel launching latency.

That's about the same as 1 cache miss. I'm willing to bet that it will be lost in the noise. Are there any real world benchmarks where it makes a difference?
Are those requests driven by a specific use case? Not all requests (even well intentioned ones) are worth implementing.
This patch appears to be somewhere in the gray area to me. My prior experience with CUDA suggests that it will make little to no difference. On the other hand, AMD GPUs may be different enough to prove me wrong. Without specific evidence, I still can't tell what's the case here.

One side effect of this patch is that there will be no convenient way to set host-side breakpoint on kernel launch.
Another will be that examining call stack will become somewhat confusing as the arguments passed to the kernel as written in the source code will not match those observed in the stack trace. I guess preserving the appearance of normal function calls was the reason for the split config setup/kernel launch in CUDA. I'd say it's still useful to have as CUDA-specific debugger is not always available and one must use regular gdb on CUDA apps now and then.

Eliminating kernel stub does not affect debugability negatively. At least this is true for HIP debugger. Actually our debugger team intentionally requests to eliminate any debug information for the kernel stub so that it will not confuse the debugger with the real kernel. This is because the kernel stub is an artificial function for launching the kernel, not the real kernel which is in device binary. For HIP debugger (rocmgdb), when the user set break point on a kernel, it will break on the real kernel in device binary, and the call stack are displayed correctly. The arguments to the real kernel are not lost, since the real kernel is a real function in device binary.

You appear to assume debuggability with HIP-aware debugger. That part I'm not particularly concerned about as I assume that it will be tested on AMD's side.
I was mostly concerned about debuggability with the ordinary gdb. Imagine someone having to debug a TF app they've got somewhere. The end user may not even have HIP tools installed. It would be useful to be able to debug until the point where control is passed to the GPU. The patch will likely have a minor, but still negative impact on that.

I guess one should still be able to set a breakpoint using the file:line number. If you could verify that it still works with gdb, that would be a reasonable workaround. I think we still need to have some way to set a breakpoint on the kernel launch site (I think it should still work) and on the kernel entry.

So, we have a trade-off of minor performance gain vs a minor debuggability regression. I don't have strong opinions which is the best way to go. By default, with no demonstrated benefit, I'd err on the side of not changing things.

Another motivation for eliminating kernel stub is to be able to emit a symbol with the same mangled name as a kernel as a global variable instead of a function. Since we need such symbols to be able to launch kernels with mangled name in a C++ program. If we use kernel stub as the symbol, we cannot use the original mangled kernel name since our debugger does not allow that.

Is eliminating the host-side stub the goal, or just a coincidental side-effect? I.e. if it's something you *need* to do, then the discussion about minor performance gain becomes rather irrelevant and we should weigh 'improvements in HIP debugging' vs 'regression in host-only debugging' instead.

In D86376#2234547, @tra wrote:

I'm OK with how the patch is implemented.
I'm still on the fence regarding whether it should be implemented.

In D86376#2234458, @yaxunl wrote:

`hipPushConfiguration/hipPopConfiguration' and kernel stub can cause 40 ns overhead, whereas we have requests to squeeze any overhead in kernel launching latency.

That's about the same as 1 cache miss. I'm willing to bet that it will be lost in the noise. Are there any real world benchmarks where it makes a difference?
Are those requests driven by a specific use case? Not all requests (even well intentioned ones) are worth implementing.
This patch appears to be somewhere in the gray area to me. My prior experience with CUDA suggests that it will make little to no difference. On the other hand, AMD GPUs may be different enough to prove me wrong. Without specific evidence, I still can't tell what's the case here.

Sorry, the overhead due to __hipPushConfigure/__hipPopConfigure is about 60 us. The typical kernel launching latency is about 500us, therefore the improvement is around 10%.

One side effect of this patch is that there will be no convenient way to set host-side breakpoint on kernel launch.
Another will be that examining call stack will become somewhat confusing as the arguments passed to the kernel as written in the source code will not match those observed in the stack trace. I guess preserving the appearance of normal function calls was the reason for the split config setup/kernel launch in CUDA. I'd say it's still useful to have as CUDA-specific debugger is not always available and one must use regular gdb on CUDA apps now and then.

Eliminating kernel stub does not affect debugability negatively. At least this is true for HIP debugger. Actually our debugger team intentionally requests to eliminate any debug information for the kernel stub so that it will not confuse the debugger with the real kernel. This is because the kernel stub is an artificial function for launching the kernel, not the real kernel which is in device binary. For HIP debugger (rocmgdb), when the user set break point on a kernel, it will break on the real kernel in device binary, and the call stack are displayed correctly. The arguments to the real kernel are not lost, since the real kernel is a real function in device binary.

You appear to assume debuggability with HIP-aware debugger. That part I'm not particularly concerned about as I assume that it will be tested on AMD's side.
I was mostly concerned about debuggability with the ordinary gdb. Imagine someone having to debug a TF app they've got somewhere. The end user may not even have HIP tools installed. It would be useful to be able to debug until the point where control is passed to the GPU. The patch will likely have a minor, but still negative impact on that.

I guess one should still be able to set a breakpoint using the file:line number. If you could verify that it still works with gdb, that would be a reasonable workaround. I think we still need to have some way to set a breakpoint on the kernel launch site (I think it should still work) and on the kernel entry.

To run HIP applications, users need to install ROCm, which includes rocgdb. A debugger without device code debugging capability has little use with HIP applications therefore I would expect users to always use rocgdb to debug HIP program. Also, since clang already removed all debug information for kernel stub, gdb cannot break on kernel stub any way.

Another motivation for eliminating kernel stub is to be able to emit a symbol with the same mangled name as a kernel as a global variable instead of a function. Since we need such symbols to be able to launch kernels with mangled name in a C++ program. If we use kernel stub as the symbol, we cannot use the original mangled kernel name since our debugger does not allow that.

Is eliminating the host-side stub the goal, or just a coincidental side-effect? I.e. if it's something you *need* to do, then the discussion about minor performance gain becomes rather irrelevant and we should weigh 'improvements in HIP debugging' vs 'regression in host-only debugging' instead.

I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs.

In D86376#2234719, @yaxunl wrote:

This patch appears to be somewhere in the gray area to me. My prior experience with CUDA suggests that it will make little to no difference. On the other hand, AMD GPUs may be different enough to prove me wrong. Without specific evidence, I still can't tell what's the case here.

Sorry, the overhead due to __hipPushConfigure/__hipPopConfigure is about 60 us. The typical kernel launching latency is about 500us, therefore the improvement is around 10%.

60 *micro seconds* to store/load something from memory? It does not sound right. 0.5 millisecond per kernel launch is also suspiciously high.
For CUDA it's ~5us (https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf). If it does indeed take 60 microseconds to push/pop a O(cacheline) worth of launch config data, the implementation may be doing something wrong. We're talking about O(100) syscalls and that's way too much work for something that simple. What do those calls do?

Can you confirm that the units are indeed microseconds and not nanoseconds?

To run HIP applications, users need to install ROCm, which includes rocgdb.

I would disagree with that assertion. I do very much want to build a Tensorflow-based app and run it in a container with nothing else but the app and I do want to use existing infrastructure to capture relevant info if the app crashes. Such capture will not be using any HIP-specific tools.
Or I could give it to a user who absolutely does not care what's inside the executable, but who may want to run it under gdb if something goes wrong.

A debugger without device code debugging capability has little use with HIP applications therefore I would expect users to always use rocgdb to debug HIP program.

I agree that it's indeed the case if someone wants/needs to debug GPU code, however, in many cases it's sufficient to be able to debug host-side things only. And it is useful to see the point where we launch kernels and be able to tell which kernel it was.

Also, since clang already removed all debug information for kernel stub, gdb cannot break on kernel stub any way.

gdb is aware of the ELF symbols and those are often exposed in shared libraries. While you will not have type info, etc, you can still set a breakpoint and get a sensible stack trace in many cases. We usually build with some amount of debug info and it did prove rather helpful to pin-point GPU failures via host-side stack trace as it did include the symbol name of the host-side stub which allows identifying the device-side kernel. If all we see in the stack trace is hipLaunchKernel, it would be considerably less helpful, especially when there's no detailed debug info which would allow us to dig out the kernel name from its arguments. All we'd know that we've launched *some* kernel.

Is eliminating the host-side stub the goal, or just a coincidental side-effect? I.e. if it's something you *need* to do, then the discussion about minor performance gain becomes rather irrelevant and we should weigh 'improvements in HIP debugging' vs 'regression in host-only debugging' instead.

I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs.

Could you elaborate on the "interoperability with C++ programs"? I don't think I see how this patch helps with that. Or what exactly is the issue with C++ interoperability we have now?

In D86376#2234824, @tra wrote:

In D86376#2234719, @yaxunl wrote:

This patch appears to be somewhere in the gray area to me. My prior experience with CUDA suggests that it will make little to no difference. On the other hand, AMD GPUs may be different enough to prove me wrong. Without specific evidence, I still can't tell what's the case here.

Sorry, the overhead due to __hipPushConfigure/__hipPopConfigure is about 60 us. The typical kernel launching latency is about 500us, therefore the improvement is around 10%.

60 *micro seconds* to store/load something from memory? It does not sound right. 0.5 millisecond per kernel launch is also suspiciously high.
For CUDA it's ~5us (https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf). If it does indeed take 60 microseconds to push/pop a O(cacheline) worth of launch config data, the implementation may be doing something wrong. We're talking about O(100) syscalls and that's way too much work for something that simple. What do those calls do?

Can you confirm that the units are indeed microseconds and not nanoseconds?

My previous measurements did not warming up, which caused some one time overhead due to device initialization and loading of device binary. With warm up, the call of __hipPushCallConfigure/__hipPopCallConfigure takes about 19 us. Based on the trace from rocprofile, the time spent inside these functions can be ignored. Most of the time is spent making the calls. These functions stay in a shared library, which may be the reason why they take such long time. Making them always_inline may get rid of the overhead, however, that would require exposing internal data structures.

The kernel launching latency are measured by a simple loop in which a simple kernel is launched then hipStreamSynchronize is called. trace is collected by rocprofiler and the latency is measured from the end of hipStreamSynchronize to the real start of kernel execution. Without this patch, the latency is about 77 us. With this patch, the latency is about 46 us. The improvement is about 40%. The decrement of 31 us is more than 19 us since it also eliminates the overhead of kernel stub.

I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs.

Could you elaborate on the "interoperability with C++ programs"? I don't think I see how this patch helps with that. Or what exactly is the issue with C++ interoperability we have now?

In HIP program, a global symbol is generated in host binary to identify each kernel. This symbol is associated with the device kernel by a call of hipRegisterFunction in init functions. Each time the kernel needs to be called, the associated symbol is passed to hipLaunchKernel. In host code, this symbol represents the kernel. Let's call it the kernel symbol. Currently it is the kernel stub function, however, it could be any global symbol, as long as it is registered with hipRegisterFunction, then hipLaunchKernel can use it to find the right kernel and launch it.

In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the kernel symbol. Since the kernel symbol is defined in object files generated from HIP. For C/C++ program, as long as it declares the kernel symbol as an external function or variable which matches the name of the original symbol, the linker will resolve to the correct kernel symbol, then the correct kernel can be launched.

Here comes the nuance with kernel stub function as the kernel symbol. If you still remember, there was a previous patch for HIP to change the kernel stub name. rocgdb requires the device stub to have a different name than the real kernel, since otherwise it will not be able to break on the real kernel only. As a result, the kernel stub now has a prefix __device_stub_ before mangling.

For example, a kernel foo will have a kernel stub with name __device_stub_foo.

For a C/C++ program to call kernel foo, it needs to declare an external symbol __device_stub_foo then launch it. Of course this is an annoyance for C/C++ users, especially this involves mangled names.

However, we cannot change the name of the kernel stub to be the same as the kernel, since that will break rocgdb.

Now the solution is to get rid of the kernel stub function. Instead of use kernel stub function as kernel symbol, we will emit a global variable as kernel symbol. This global variable can have the same name as the kernel, since rocgdb will not break on it.

In D86376#2236501, @yaxunl wrote:

My previous measurements did not warming up, which caused some one time overhead due to device initialization and loading of device binary. With warm up, the call of __hipPushCallConfigure/__hipPopCallConfigure takes about 19 us. Based on the trace from rocprofile, the time spent inside these functions can be ignored. Most of the time is spent making the calls. These functions stay in a shared library, which may be the reason why they take such long time. Making them always_inline may get rid of the overhead, however, that would require exposing internal data structures.

It's still suspiciously high. AFAICT, config/push/pull is just an std::vector push/pop. It should not take *that* long. Few function calls should not lead to microseconds of overhead, once linker has resolved the symbol, if they come from a shared library.
https://github.com/ROCm-Developer-Tools/HIP/blob/master/vdi/hip_platform.cpp#L590

I wonder if it's the logging facilities that add all this overhead.

The kernel launching latency are measured by a simple loop in which a simple kernel is launched then hipStreamSynchronize is called. trace is collected by rocprofiler and the latency is measured from the end of hipStreamSynchronize to the real start of kernel execution. Without this patch, the latency is about 77 us. With this patch, the latency is about 46 us. The improvement is about 40%. The decrement of 31 us is more than 19 us since it also eliminates the overhead of kernel stub.

This is rather surprising. A function call by itself does *not* have such high overhead. There must be something else. I strongly suspect logging. If you remove logging statements from push/pop without changing anything else, how does that affect performance?

I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs.

Could you elaborate on the "interoperability with C++ programs"? I don't think I see how this patch helps with that. Or what exactly is the issue with C++ interoperability we have now?

In HIP program, a global symbol is generated in host binary to identify each kernel. This symbol is associated with the device kernel by a call of hipRegisterFunction in init functions. Each time the kernel needs to be called, the associated symbol is passed to hipLaunchKernel. In host code, this symbol represents the kernel. Let's call it the kernel symbol. Currently it is the kernel stub function, however, it could be any global symbol, as long as it is registered with hipRegisterFunction, then hipLaunchKernel can use it to find the right kernel and launch it.

So far so good, it matches the way CUDA does that.

In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the kernel symbol.

Do you mean the host-side symbol, registered with the runtime that you've described above? Or do you mean that the device-side symbol is somehow visible from the host side. I think that's where HIP is different from CUDA.

Since the kernel symbol is defined in object files generated from HIP.
For C/C++ program, as long as it declares the kernel symbol as an external function or variable which matches the name of the original symbol, the linker will resolve to the correct kernel symbol, then the correct kernel can be launched.

The first sentence looks incomplete. It seems to imply that hipLaunchKernel uses the device-side kernel symbol and it's the linker which ties host-side reference with device-side symbol. If that's the case, then I don't understand what purpose is served by hipRegisterFunction. AFAICT, it's not used in this scenario at all.

My mental model of kernel launch mechanics looks like this:

For a kernel foo, there is a host-side symbol (it's the stub for CUDA) with the name 'foo' and device-side real kernel 'foo'.
host side linker has no access to device-side symbols, but we do need to associate host and device side 'foo' instances.
address of host-side foo is registered with runtime to map it to device symbol with the name 'foo'
when a kernel is launched, call site sets up launch config and calls the stub, passing it the kernel arguments.
the stub calls the kernel launch function, and passes host-side foo address to the kernel launch function
launch function finds device-side symbol name via the registration info and does device-side address lookup to obtain it's device address
run device-side function.

In this scenario, the host-side stub for foo is a regular function, which gdb can stop on and examine kernel arguments.

How is the process different for HIP? I know that we've changed the stub name to avoid debugger confusion about which if the entities corresponds to 'foo'.

Here comes the nuance with kernel stub function as the kernel symbol. If you still remember, there was a previous patch for HIP to change the kernel stub name. rocgdb requires the device stub to have a different name than the real kernel, since otherwise it will not be able to break on the real kernel only. As a result, the kernel stub now has a prefix __device_stub_ before mangling.

For example, a kernel foo will have a kernel stub with name __device_stub_foo.

For a C/C++ program to call kernel foo, it needs to declare an external symbol __device_stub_foo then launch it. Of course this is an annoyance for C/C++ users, especially this involves mangled names.

It's all done by compiler under the hood. I'm not sure how the stub name affects C/C++ users.

However, we cannot change the name of the kernel stub to be the same as the kernel, since that will break rocgdb.
Now the solution is to get rid of the kernel stub function. Instead of use kernel stub function as kernel symbol, we will emit a global variable as kernel symbol. This global variable can have the same name as the kernel, since rocgdb will not break on it.

I do not follow your reasoning why the stub name is a problem. It's awkward, yes, but losing the stub as a specific kernel entry point seems to be a real loss in debugability, which is worse, IMO.
Could you give me an example where the stub name causes problems?

In D86376#2236704, @tra wrote:

It's still suspiciously high. AFAICT, config/push/pull is just an std::vector push/pop. It should not take *that* long. Few function calls should not lead to microseconds of overhead, once linker has resolved the symbol, if they come from a shared library.
https://github.com/ROCm-Developer-Tools/HIP/blob/master/vdi/hip_platform.cpp#L590

I wonder if it's the logging facilities that add all this overhead.

You are right. The 19 us are mostly due to overhead from rocprofiler. If I do not use rocprofiler and use a simple loop to measure execution time of __hipPushCallConfigure/__hipPopCallConfigure, I got 180 ns.

The kernel launching latency are measured by a simple loop in which a simple kernel is launched then hipStreamSynchronize is called. trace is collected by rocprofiler and the latency is measured from the end of hipStreamSynchronize to the real start of kernel execution. Without this patch, the latency is about 77 us. With this patch, the latency is about 46 us. The improvement is about 40%. The decrement of 31 us is more than 19 us since it also eliminates the overhead of kernel stub.

This is rather surprising. A function call by itself does *not* have such high overhead. There must be something else. I strongly suspect logging. If you remove logging statements from push/pop without changing anything else, how does that affect performance?

The 19 us overhead was due to rocprofiler. Without rocprofiler, I can only measure the average duration of a kernel launching together with hipStreamSynchronize. When the kernel is empty, it serves as an estimation of kernel launching latency. With such measurement, the latency is about 14.0 us. The improvement due to this patch is not significant.

In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the kernel symbol.

Do you mean the host-side symbol, registered with the runtime that you've described above? Or do you mean that the device-side symbol is somehow visible from the host side. I think that's where HIP is different from CUDA.

I mean the host-side symbol. A host program can only use host-side symbol to launch a kernel.

I do not follow your reasoning why the stub name is a problem. It's awkward, yes, but losing the stub as a specific kernel entry point seems to be a real loss in debugability, which is worse, IMO.
Could you give me an example where the stub name causes problems?

For example, in HIP program, there is a kernel void foo(int*). If a C++ program wants to launch it, the desirable way is

void foo(int*);
hipLaunchKernel(foo, grids, blocks, args, shmem, stream);

Due to the prefixed kernel stub name, currently the users have to use

void __device_stub_foo(int*);
hipLaunchKernel(__device_stub_foo, grids, blocks, args, shmem, stream);

yaxunl retitled this revision from [HIP] Improve kernel launching latency to [HIP] Simplify kernel launching.Aug 26 2020, 9:29 AM

In D86376#2239391, @yaxunl wrote:
For example, in HIP program, there is a kernel void foo(int*). If a C++ program wants to launch it, the desirable way is
void foo(int*);
hipLaunchKernel(foo, grids, blocks, args, shmem, stream);
Due to the prefixed kernel stub name, currently the users have to use
void __device_stub_foo(int*);
hipLaunchKernel(__device_stub_foo, grids, blocks, args, shmem, stream);

Ah. That *is* painful. Perhaps we can have the cake and eat it here and do something like this:

Do generate a variable with the kernel name and use it for hipKernelLaunch(), but also keep the stub and call it for <<<>>> launches, only instead of using the stub itself registered as the GPU-side kernel identifier, use the variable.

This way, __device_stub_<kernel> will show up in the stack trace (no debuggability regression), but direct calls to hipLaunchKenrel can use unprefixed kernel name.

WDYT?

In D86376#2239618, @tra wrote:
In D86376#2239391, @yaxunl wrote:
For example, in HIP program, there is a kernel void foo(int*). If a C++ program wants to launch it, the desirable way is
void foo(int*);
hipLaunchKernel(foo, grids, blocks, args, shmem, stream);
Due to the prefixed kernel stub name, currently the users have to use
void __device_stub_foo(int*);
hipLaunchKernel(__device_stub_foo, grids, blocks, args, shmem, stream);
Ah. That *is* painful. Perhaps we can have the cake and eat it here and do something like this:

Do generate a variable with the kernel name and use it for hipKernelLaunch(), but also keep the stub and call it for <<<>>> launches, only instead of using the stub itself registered as the GPU-side kernel identifier, use the variable.

This way, __device_stub_<kernel> will show up in the stack trace (no debuggability regression), but direct calls to hipLaunchKenrel can use unprefixed kernel name.

WDYT?

Yes that should work. Will do.

Revised by Artem's comments.

Actually there is one issue with this approach.

HIP have API's to launch kernels, which accept kernel as function pointer argument. Currently when taking address of kernel, we get the stub function. These kernel launching API's will not work if we use kernel symbol to register the kernel. A solution is to return the kernel symbol instead of stub function when taking address of the kernel in host compilation, i.e. if a function pointer is assigned to a kernel in host code, it gets the kernel symbol instead of the stub function. This will make the kernel launching API work.

To keep the triple chevron working, the kernel symbol will be initialized with the address of the stub function. For triple chevron call, the address of the stub function is loaded from the kernel symbol and invoked.

In D86376#2551298, @yaxunl wrote:

Actually there is one issue with this approach.

HIP have API's to launch kernels, which accept kernel as function pointer argument. Currently when taking address of kernel, we get the stub function. These kernel launching API's will not work if we use kernel symbol to register the kernel. A solution is to return the kernel symbol instead of stub function when taking address of the kernel in host compilation, i.e. if a function pointer is assigned to a kernel in host code, it gets the kernel symbol instead of the stub function. This will make the kernel launching API work.

To keep the triple chevron working, the kernel symbol will be initialized with the address of the stub function. For triple chevron call, the address of the stub function is loaded from the kernel symbol and invoked.

This could work.
Do we really need an indirection? If we know the stub address when we initialize the symbol with it, we should be able to use that address for <<<>>>.

In D86376#2552066, @tra wrote:

In D86376#2551298, @yaxunl wrote:

Actually there is one issue with this approach.

HIP have API's to launch kernels, which accept kernel as function pointer argument. Currently when taking address of kernel, we get the stub function. These kernel launching API's will not work if we use kernel symbol to register the kernel. A solution is to return the kernel symbol instead of stub function when taking address of the kernel in host compilation, i.e. if a function pointer is assigned to a kernel in host code, it gets the kernel symbol instead of the stub function. This will make the kernel launching API work.

To keep the triple chevron working, the kernel symbol will be initialized with the address of the stub function. For triple chevron call, the address of the stub function is loaded from the kernel symbol and invoked.

This could work.
Do we really need an indirection? If we know the stub address when we initialize the symbol with it, we should be able to use that address for <<<>>>.

For triple chevron with kernel name, it is not needed. We only need indirection for a triple chevron with a function pointer, in which case we do not know its stub function at compile time. This is allowed by CUDA/HIP.

In D86376#2552419, @yaxunl wrote:

For triple chevron with kernel name, it is not needed. We only need indirection for a triple chevron with a function pointer, in which case we do not know its stub function at compile time. This is allowed by CUDA/HIP.

Got it. We'll need to map the address of the symbol into the address of the stub.

Adding an indirection brings another question -- what's supposed to happen if we're passed a pointer that's *not* a pointer to the symbol. I.e. it does not point to the pointer to the stub.

Can we backtrack a bit and review our constraints/assumptions. I vaguely recall AMD inproduced __device_stub because debugger needed to distinguish host-side stub from the device-side kernel.
If we add the data with the same name, would not it cause the same confusion about what kernel is? If we are allowed to use 'kernel' on the host, is there a reason not to rename __device_stubkernel back to kernel and just use the stub address everywhere?

Another question -- assuming that the stub can't be renamed, can we give the stub an alias with the name kernel? This way no matter how we take the address, it will always point to the stub.

In D86376#2552524, @tra wrote:

In D86376#2552419, @yaxunl wrote:

For triple chevron with kernel name, it is not needed. We only need indirection for a triple chevron with a function pointer, in which case we do not know its stub function at compile time. This is allowed by CUDA/HIP.

Got it. We'll need to map the address of the symbol into the address of the stub.

Adding an indirection brings another question -- what's supposed to happen if we're passed a pointer that's *not* a pointer to the symbol. I.e. it does not point to the pointer to the stub.

The same thing could happen before this change, i.e., a function pointer does not contain the address of a stub function. In either case it will be UB. This change does not make the situation worse.

Can we backtrack a bit and review our constraints/assumptions. I vaguely recall AMD inproduced __device_stub because debugger needed to distinguish host-side stub from the device-side kernel.
If we add the data with the same name, would not it cause the same confusion about what kernel is? If we are allowed to use 'kernel' on the host, is there a reason not to rename __device_stubkernel back to kernel and just use the stub address everywhere?

We have confirmed with our debugger team that emitting this symbol is OK for rocgdb since it is a variable symbol, not a function symbol.

Another question -- assuming that the stub can't be renamed, can we give the stub an alias with the name kernel? This way no matter how we take the address, it will always point to the stub.

We have tried this and it did not work. The alias will ends up as a symbol to a function which is not allowed by rocgdb.

handle launch kernel by API and launch kernel in function pointer.

ping

So, to summarize how the patch changes the under-the-hood kernel launch machinery:

device-side is unchanged. Kernel function is generated with the real kernel name
host-side stub is still generated with the __device_stub prefix.
host-side generates a 'handle' variable with the kernel function name, which is a pointer to the stub.
host-side registers the handle variable -> device-side kernel name association with the HIP runtime.
the address of the handle variable is used everywhere where we need a kernel pointer on the host side. I.e. passing kernel pointers around, referring to kernels across TUs, etc.
<<<>>> becomes an indirect call to a __device_stub function using the pointer retrieved from the handle.

This revision is now accepted and ready to land.Mar 1 2021, 9:51 AM

Closed by commit rG5cf2a37f1255: [HIP] Emit kernel symbol (authored by yaxunl). · Explain WhyMar 1 2021, 1:32 PM

This revision was automatically updated to reflect the committed changes.

yaxunl added a commit: rG5cf2a37f1255: [HIP] Emit kernel symbol.

Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2021, 1:32 PM

Revision Contents

Path

Size

clang/

lib/

CodeGen/

53 lines

8 lines

22 lines

16 lines

test/

CodeGenCUDA/

Inputs/

12 lines

19 lines

5 lines

92 lines

4 lines

Diff 327268

clang/lib/CodeGen/CGCUDANV.cpp

Show All 36 Lines	private:
llvm::IntegerType IntTy, SizeTy;		llvm::IntegerType IntTy, SizeTy;
llvm::Type *VoidTy;		llvm::Type *VoidTy;
llvm::PointerType CharPtrTy, VoidPtrTy, *VoidPtrPtrTy;		llvm::PointerType CharPtrTy, VoidPtrTy, *VoidPtrPtrTy;

/// Convenience reference to LLVM Context		/// Convenience reference to LLVM Context
llvm::LLVMContext &Context;		llvm::LLVMContext &Context;
/// Convenience reference to the current module		/// Convenience reference to the current module
llvm::Module &TheModule;		llvm::Module &TheModule;
/// Keeps track of kernel launch stubs emitted in this module		/// Keeps track of kernel launch stubs and handles emitted in this module
struct KernelInfo {		struct KernelInfo {
llvm::Function *Kernel;		llvm::Function *Kernel; // stub function to help launch kernel
const Decl *D;		const Decl *D;
};		};
llvm::SmallVector<KernelInfo, 16> EmittedKernels;		llvm::SmallVector<KernelInfo, 16> EmittedKernels;
		// Map a device stub function to a symbol for identifying kernel in host code.
		// For CUDA, the symbol for identifying the kernel is the same as the device
		// stub function. For HIP, they are different.
		llvm::DenseMap<llvm::Function , llvm::GlobalValue > KernelHandles;
		// Map a kernel handle to the kernel stub.
		llvm::DenseMap<llvm::GlobalValue , llvm::Function > KernelStubs;
struct VarInfo {		struct VarInfo {
llvm::GlobalVariable *Var;		llvm::GlobalVariable *Var;
const VarDecl *D;		const VarDecl *D;
DeviceVarFlags Flags;		DeviceVarFlags Flags;
};		};
llvm::SmallVector<VarInfo, 16> DeviceVars;		llvm::SmallVector<VarInfo, 16> DeviceVars;
/// Keeps track of variable containing handle of GPU binary. Populated by		/// Keeps track of variable containing handle of GPU binary. Populated by
/// ModuleCtorFunction() and used to create corresponding cleanup calls in		/// ModuleCtorFunction() and used to create corresponding cleanup calls in
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	private:
/// Creates module destructor function		/// Creates module destructor function
llvm::Function *makeModuleDtorFunction();		llvm::Function *makeModuleDtorFunction();
/// Transform managed variables for device compilation.		/// Transform managed variables for device compilation.
void transformManagedVars();		void transformManagedVars();

public:		public:
CGNVCUDARuntime(CodeGenModule &CGM);		CGNVCUDARuntime(CodeGenModule &CGM);

		llvm::GlobalValue getKernelHandle(llvm::Function F, GlobalDecl GD) override;
		llvm::Function getKernelStub(llvm::GlobalValue Handle) override {
		auto Loc = KernelStubs.find(Handle);
		assert(Loc != KernelStubs.end());
		return Loc->second;
		}
void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList &Args) override;		void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList &Args) override;
void handleVarRegistration(const VarDecl *VD,		void handleVarRegistration(const VarDecl *VD,
llvm::GlobalVariable &Var) override;		llvm::GlobalVariable &Var) override;
void		void
internalizeDeviceSideVar(const VarDecl *D,		internalizeDeviceSideVar(const VarDecl *D,
llvm::GlobalValue::LinkageTypes &Linkage) override;		llvm::GlobalValue::LinkageTypes &Linkage) override;

llvm::Function *finalizeModule() override;		llvm::Function *finalizeModule() override;
▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	if (CGM.getContext().shouldExternalizeStaticVar(ND) &&
DeviceSideName = std::string(Out.str());		DeviceSideName = std::string(Out.str());
}		}
return DeviceSideName;		return DeviceSideName;
}		}

void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,		void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,
FunctionArgList &Args) {		FunctionArgList &Args) {
EmittedKernels.push_back({CGF.CurFn, CGF.CurFuncDecl});		EmittedKernels.push_back({CGF.CurFn, CGF.CurFuncDecl});
		if (auto *GV = dyn_cast<llvm::GlobalVariable>(KernelHandles[CGF.CurFn])) {
		GV->setLinkage(CGF.CurFn->getLinkage());
		GV->setInitializer(CGF.CurFn);
		}
if (CudaFeatureEnabled(CGM.getTarget().getSDKVersion(),		if (CudaFeatureEnabled(CGM.getTarget().getSDKVersion(),
CudaFeature::CUDA_USES_NEW_LAUNCH) \|\|		CudaFeature::CUDA_USES_NEW_LAUNCH) \|\|
(CGF.getLangOpts().HIP && CGF.getLangOpts().HIPUseNewLaunchAPI))		(CGF.getLangOpts().HIP && CGF.getLangOpts().HIPUseNewLaunchAPI))
emitDeviceStubBodyNew(CGF, Args);		emitDeviceStubBodyNew(CGF, Args);
else		else
emitDeviceStubBodyLegacy(CGF, Args);		emitDeviceStubBodyLegacy(CGF, Args);
}		}

▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	llvm::FunctionCallee cudaPopConfigFn = CGM.CreateRuntimeFunction(
/isVarArg=/false),		/isVarArg=/false),
addUnderscoredPrefixToName("PopCallConfiguration"));		addUnderscoredPrefixToName("PopCallConfiguration"));

CGF.EmitRuntimeCallOrInvoke(cudaPopConfigFn,		CGF.EmitRuntimeCallOrInvoke(cudaPopConfigFn,
{GridDim.getPointer(), BlockDim.getPointer(),		{GridDim.getPointer(), BlockDim.getPointer(),
ShmemSize.getPointer(), Stream.getPointer()});		ShmemSize.getPointer(), Stream.getPointer()});

// Emit the call to cudaLaunch		// Emit the call to cudaLaunch
llvm::Value *Kernel = CGF.Builder.CreatePointerCast(CGF.CurFn, VoidPtrTy);		llvm::Value *Kernel =
		CGF.Builder.CreatePointerCast(KernelHandles[CGF.CurFn], VoidPtrTy);
CallArgList LaunchKernelArgs;		CallArgList LaunchKernelArgs;
LaunchKernelArgs.add(RValue::get(Kernel),		LaunchKernelArgs.add(RValue::get(Kernel),
cudaLaunchKernelFD->getParamDecl(0)->getType());		cudaLaunchKernelFD->getParamDecl(0)->getType());
LaunchKernelArgs.add(RValue::getAggregate(GridDim), Dim3Ty);		LaunchKernelArgs.add(RValue::getAggregate(GridDim), Dim3Ty);
LaunchKernelArgs.add(RValue::getAggregate(BlockDim), Dim3Ty);		LaunchKernelArgs.add(RValue::getAggregate(BlockDim), Dim3Ty);
LaunchKernelArgs.add(RValue::get(KernelArgs.getPointer()),		LaunchKernelArgs.add(RValue::get(KernelArgs.getPointer()),
cudaLaunchKernelFD->getParamDecl(3)->getType());		cudaLaunchKernelFD->getParamDecl(3)->getType());
LaunchKernelArgs.add(RValue::get(CGF.Builder.CreateLoad(ShmemSize)),		LaunchKernelArgs.add(RValue::get(CGF.Builder.CreateLoad(ShmemSize)),
Show All 38 Lines	for (const VarDecl *A : Args) {
llvm::BasicBlock *NextBlock = CGF.createBasicBlock("setup.next");		llvm::BasicBlock *NextBlock = CGF.createBasicBlock("setup.next");
CGF.Builder.CreateCondBr(CBZero, NextBlock, EndBlock);		CGF.Builder.CreateCondBr(CBZero, NextBlock, EndBlock);
CGF.EmitBlock(NextBlock);		CGF.EmitBlock(NextBlock);
Offset += TInfo.Width;		Offset += TInfo.Width;
}		}

// Emit the call to cudaLaunch		// Emit the call to cudaLaunch
llvm::FunctionCallee cudaLaunchFn = getLaunchFn();		llvm::FunctionCallee cudaLaunchFn = getLaunchFn();
llvm::Value *Arg = CGF.Builder.CreatePointerCast(CGF.CurFn, CharPtrTy);		llvm::Value *Arg =
		CGF.Builder.CreatePointerCast(KernelHandles[CGF.CurFn], CharPtrTy);
CGF.EmitRuntimeCallOrInvoke(cudaLaunchFn, Arg);		CGF.EmitRuntimeCallOrInvoke(cudaLaunchFn, Arg);
CGF.EmitBranch(EndBlock);		CGF.EmitBranch(EndBlock);

CGF.EmitBlock(EndBlock);		CGF.EmitBlock(EndBlock);
}		}

// Replace the original variable Var with the address loaded from variable		// Replace the original variable Var with the address loaded from variable
// ManagedVar populated by HIP runtime.		// ManagedVar populated by HIP runtime.
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	llvm::Function *CGNVCUDARuntime::makeRegisterGlobalsFn() {
// each emitted kernel.		// each emitted kernel.
llvm::Argument &GpuBinaryHandlePtr = *RegisterKernelsFunc->arg_begin();		llvm::Argument &GpuBinaryHandlePtr = *RegisterKernelsFunc->arg_begin();
for (auto &&I : EmittedKernels) {		for (auto &&I : EmittedKernels) {
llvm::Constant *KernelName =		llvm::Constant *KernelName =
makeConstantString(getDeviceSideName(cast<NamedDecl>(I.D)));		makeConstantString(getDeviceSideName(cast<NamedDecl>(I.D)));
llvm::Constant *NullPtr = llvm::ConstantPointerNull::get(VoidPtrTy);		llvm::Constant *NullPtr = llvm::ConstantPointerNull::get(VoidPtrTy);
llvm::Value *Args[] = {		llvm::Value *Args[] = {
&GpuBinaryHandlePtr,		&GpuBinaryHandlePtr,
Builder.CreateBitCast(I.Kernel, VoidPtrTy),		Builder.CreateBitCast(KernelHandles[I.Kernel], VoidPtrTy),
KernelName,		KernelName,
KernelName,		KernelName,
llvm::ConstantInt::get(IntTy, -1),		llvm::ConstantInt::get(IntTy, -1),
NullPtr,		NullPtr,
NullPtr,		NullPtr,
NullPtr,		NullPtr,
NullPtr,		NullPtr,
llvm::ConstantPointerNull::get(IntTy->getPointerTo())};		llvm::ConstantPointerNull::get(IntTy->getPointerTo())};
▲ Show 20 Lines • Show All 554 Lines • ▼ Show 20 Lines
// Returns module constructor to be added.		// Returns module constructor to be added.
llvm::Function *CGNVCUDARuntime::finalizeModule() {		llvm::Function *CGNVCUDARuntime::finalizeModule() {
if (CGM.getLangOpts().CUDAIsDevice) {		if (CGM.getLangOpts().CUDAIsDevice) {
transformManagedVars();		transformManagedVars();
return nullptr;		return nullptr;
}		}
return makeModuleCtorFunction();		return makeModuleCtorFunction();
}		}

		llvm::GlobalValue CGNVCUDARuntime::getKernelHandle(llvm::Function F,
		GlobalDecl GD) {
		auto Loc = KernelHandles.find(F);
		if (Loc != KernelHandles.end())
		return Loc->second;

		if (!CGM.getLangOpts().HIP) {
		KernelHandles[F] = F;
		KernelStubs[F] = F;
		return F;
		}

		auto *Var = new llvm::GlobalVariable(
		TheModule, F->getType(), /isConstant=/true, F->getLinkage(),
		/Initializer=/nullptr,
		CGM.getMangledName(
		GD.getWithKernelReferenceKind(KernelReferenceKind::Kernel)));
		Var->setAlignment(CGM.getPointerAlign().getAsAlign());
		Var->setDSOLocal(F->isDSOLocal());
		Var->setVisibility(F->getVisibility());
		KernelHandles[F] = Var;
		KernelStubs[Var] = F;
		return Var;
		}

clang/lib/CodeGen/CGCUDARuntime.h

Show All 9 Lines
// subclasses of this implement code generation for specific CUDA		// subclasses of this implement code generation for specific CUDA
// runtime libraries.		// runtime libraries.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H		#ifndef LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H
#define LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H		#define LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H

		#include "clang/AST/GlobalDecl.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/IR/GlobalValue.h"		#include "llvm/IR/GlobalValue.h"

namespace llvm {		namespace llvm {
class Function;		class Function;
class GlobalVariable;		class GlobalVariable;
}		}

▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	public:
/// Finalize generated LLVM module. Returns a module constructor function		/// Finalize generated LLVM module. Returns a module constructor function
/// to be added or a null pointer.		/// to be added or a null pointer.
virtual llvm::Function *finalizeModule() = 0;		virtual llvm::Function *finalizeModule() = 0;

/// Returns function or variable name on device side even if the current		/// Returns function or variable name on device side even if the current
/// compilation is for host.		/// compilation is for host.
virtual std::string getDeviceSideName(const NamedDecl *ND) = 0;		virtual std::string getDeviceSideName(const NamedDecl *ND) = 0;

		/// Get kernel handle by stub function.
		virtual llvm::GlobalValue getKernelHandle(llvm::Function Stub,
		GlobalDecl GD) = 0;

		/// Get kernel stub by kernel handle.
		virtual llvm::Function getKernelStub(llvm::GlobalValue Handle) = 0;

/// Adjust linkage of shadow variables in host compilation.		/// Adjust linkage of shadow variables in host compilation.
virtual void		virtual void
internalizeDeviceSideVar(const VarDecl *D,		internalizeDeviceSideVar(const VarDecl *D,
llvm::GlobalValue::LinkageTypes &Linkage) = 0;		llvm::GlobalValue::LinkageTypes &Linkage) = 0;
};		};

/// Creates an instance of a CUDA runtime class.		/// Creates an instance of a CUDA runtime class.
CGCUDARuntime *CreateNVCUDARuntime(CodeGenModule &CGM);		CGCUDARuntime *CreateNVCUDARuntime(CodeGenModule &CGM);

}		}
}		}

#endif		#endif

clang/lib/CodeGen/CGExpr.cpp

//===--- CGExpr.cpp - Emit LLVM Code from Expressions ---------------------===//		//===--- CGExpr.cpp - Emit LLVM Code from Expressions ---------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This contains code to emit Expr nodes as LLVM code.		// This contains code to emit Expr nodes as LLVM code.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		#include "CGCUDARuntime.h"
#include "CGCXXABI.h"		#include "CGCXXABI.h"
#include "CGCall.h"		#include "CGCall.h"
#include "CGCleanup.h"		#include "CGCleanup.h"
#include "CGDebugInfo.h"		#include "CGDebugInfo.h"
#include "CGObjCRuntime.h"		#include "CGObjCRuntime.h"
#include "CGOpenMPRuntime.h"		#include "CGOpenMPRuntime.h"
#include "CGRecordLayout.h"		#include "CGRecordLayout.h"
#include "CodeGenFunction.h"		#include "CodeGenFunction.h"
▲ Show 20 Lines • Show All 4,845 Lines • ▼ Show 20 Lines	if (auto builtinID = FD->getBuiltinID()) {
// we are in the builtin implementation itself, don't call the actual		// we are in the builtin implementation itself, don't call the actual
// builtin. If we are in the builtin implementation, avoid trivial infinite		// builtin. If we are in the builtin implementation, avoid trivial infinite
// recursion.		// recursion.
if (!FD->isInlineBuiltinDeclaration() \|\|		if (!FD->isInlineBuiltinDeclaration() \|\|
CGF.CurFn->getName() == FD->getName())		CGF.CurFn->getName() == FD->getName())
return CGCallee::forBuiltin(builtinID, FD);		return CGCallee::forBuiltin(builtinID, FD);
}		}

llvm::Constant *calleePtr = EmitFunctionDeclPointer(CGF.CGM, GD);		llvm::Constant *CalleePtr = EmitFunctionDeclPointer(CGF.CGM, GD);
return CGCallee::forDirect(calleePtr, GD);		if (CGF.CGM.getLangOpts().CUDA && !CGF.CGM.getLangOpts().CUDAIsDevice &&
		FD->hasAttr<CUDAGlobalAttr>())
		CalleePtr = CGF.CGM.getCUDARuntime().getKernelStub(
		cast<llvm::GlobalValue>(CalleePtr->stripPointerCasts()));
		return CGCallee::forDirect(CalleePtr, GD);
}		}

CGCallee CodeGenFunction::EmitCallee(const Expr *E) {		CGCallee CodeGenFunction::EmitCallee(const Expr *E) {
E = E->IgnoreParens();		E = E->IgnoreParens();

// Look through function-to-pointer decay.		// Look through function-to-pointer decay.
if (auto ICE = dyn_cast<ImplicitCastExpr>(E)) {		if (auto ICE = dyn_cast<ImplicitCastExpr>(E)) {
if (ICE->getCastKind() == CK_FunctionToPointerDecay \|\|		if (ICE->getCastKind() == CK_FunctionToPointerDecay \|\|
▲ Show 20 Lines • Show All 377 Lines • ▼ Show 20 Lines	if (isa<FunctionNoProtoType>(FnType) \|\| Chain) {
int AS = Callee.getFunctionPointer()->getType()->getPointerAddressSpace();		int AS = Callee.getFunctionPointer()->getType()->getPointerAddressSpace();
CalleeTy = CalleeTy->getPointerTo(AS);		CalleeTy = CalleeTy->getPointerTo(AS);

llvm::Value *CalleePtr = Callee.getFunctionPointer();		llvm::Value *CalleePtr = Callee.getFunctionPointer();
CalleePtr = Builder.CreateBitCast(CalleePtr, CalleeTy, "callee.knr.cast");		CalleePtr = Builder.CreateBitCast(CalleePtr, CalleeTy, "callee.knr.cast");
Callee.setFunctionPointer(CalleePtr);		Callee.setFunctionPointer(CalleePtr);
}		}

		// HIP function pointer contains kernel handle when it is used in triple
		// chevron. The kernel stub needs to be loaded from kernel handle and used
		// as callee.
		if (CGM.getLangOpts().HIP && !CGM.getLangOpts().CUDAIsDevice &&
		isa<CUDAKernelCallExpr>(E) &&
		(!TargetDecl \|\| !isa<FunctionDecl>(TargetDecl))) {
		llvm::Value *Handle = Callee.getFunctionPointer();
		Handle->dump();
		auto *Cast =
		Builder.CreateBitCast(Handle, Handle->getType()->getPointerTo());
		auto *Stub = Builder.CreateLoad(Address(Cast, CGM.getPointerAlign()));
		Callee.setFunctionPointer(Stub);
		}
llvm::CallBase *CallOrInvoke = nullptr;		llvm::CallBase *CallOrInvoke = nullptr;
RValue Call = EmitCall(FnInfo, Callee, ReturnValue, Args, &CallOrInvoke,		RValue Call = EmitCall(FnInfo, Callee, ReturnValue, Args, &CallOrInvoke,
E->getExprLoc());		E->getExprLoc());

// Generate function declaration DISuprogram in order to be used		// Generate function declaration DISuprogram in order to be used
// in debug info about call sites.		// in debug info about call sites.
if (CGDebugInfo *DI = getDebugInfo()) {		if (CGDebugInfo *DI = getDebugInfo()) {
if (auto *CalleeDecl = dyn_cast_or_null<FunctionDecl>(TargetDecl))		if (auto *CalleeDecl = dyn_cast_or_null<FunctionDecl>(TargetDecl))
▲ Show 20 Lines • Show All 143 Lines • Show Last 20 Lines

clang/lib/CodeGen/CodeGenModule.cpp

Show First 20 Lines • Show All 3,565 Lines • ▼ Show 20 Lines	llvm::Constant *CodeGenModule::GetAddrOfFunction(GlobalDecl GD,
if (const auto *DD = dyn_cast<CXXDestructorDecl>(GD.getDecl())) {		if (const auto *DD = dyn_cast<CXXDestructorDecl>(GD.getDecl())) {
if (getTarget().getCXXABI().isMicrosoft() &&		if (getTarget().getCXXABI().isMicrosoft() &&
GD.getDtorType() == Dtor_Complete &&		GD.getDtorType() == Dtor_Complete &&
DD->getParent()->getNumVBases() == 0)		DD->getParent()->getNumVBases() == 0)
GD = GlobalDecl(DD, Dtor_Base);		GD = GlobalDecl(DD, Dtor_Base);
}		}

StringRef MangledName = getMangledName(GD);		StringRef MangledName = getMangledName(GD);
return GetOrCreateLLVMFunction(MangledName, Ty, GD, ForVTable, DontDefer,		auto *F = GetOrCreateLLVMFunction(MangledName, Ty, GD, ForVTable, DontDefer,
/IsThunk=/false, llvm::AttributeList(),		/IsThunk=/false, llvm::AttributeList(),
IsForDefinition);		IsForDefinition);
		// Returns kernel handle for HIP kernel stub function.
		if (LangOpts.CUDA && !LangOpts.CUDAIsDevice &&
		cast<FunctionDecl>(GD.getDecl())->hasAttr<CUDAGlobalAttr>()) {
		auto *Handle = getCUDARuntime().getKernelHandle(
		cast<llvm::Function>(F->stripPointerCasts()), GD);
		if (IsForDefinition)
		return F;
		return llvm::ConstantExpr::getBitCast(Handle, Ty->getPointerTo());
		}
		return F;
}		}

static const FunctionDecl *		static const FunctionDecl *
GetRuntimeFunctionDecl(ASTContext &C, StringRef Name) {		GetRuntimeFunctionDecl(ASTContext &C, StringRef Name) {
TranslationUnitDecl *TUDecl = C.getTranslationUnitDecl();		TranslationUnitDecl *TUDecl = C.getTranslationUnitDecl();
DeclContext *DC = TranslationUnitDecl::castToDeclContext(TUDecl);		DeclContext *DC = TranslationUnitDecl::castToDeclContext(TUDecl);

IdentifierInfo &CII = C.Idents.get(Name);		IdentifierInfo &CII = C.Idents.get(Name);
▲ Show 20 Lines • Show All 2,697 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/Inputs/cuda.h

	/* Minimal declarations for CUDA support. Testing purposes only. */			/* Minimal declarations for CUDA support. Testing purposes only. */

	#include <stddef.h>			#include <stddef.h>

				#if __HIP__ \|\| __CUDA__
	#define __constant__ __attribute__((constant))			#define __constant__ __attribute__((constant))
	#define __device__ __attribute__((device))			#define __device__ __attribute__((device))
	#define __global__ __attribute__((global))			#define __global__ __attribute__((global))
	#define __host__ __attribute__((host))			#define __host__ __attribute__((host))
	#define __shared__ __attribute__((shared))			#define __shared__ __attribute__((shared))
	#if __HIP__			#if __HIP__
	#define __managed__ __attribute__((managed))			#define __managed__ __attribute__((managed))
	#endif			#endif
	#define __launch_bounds__(...) __attribute__((launch_bounds(__VA_ARGS__)))			#define __launch_bounds__(...) __attribute__((launch_bounds(__VA_ARGS__)))
				#else
				#define __constant__
				#define __device__
				#define __global__
				#define __host__
				#define __shared__
				#define __managed__
				#define __launch_bounds__(...)
				#endif

	struct dim3 {			struct dim3 {
	unsigned x, y, z;			unsigned x, y, z;
	__host__ __device__ dim3(unsigned x, unsigned y = 1, unsigned z = 1) : x(x), y(y), z(z) {}			__host__ __device__ dim3(unsigned x, unsigned y = 1, unsigned z = 1) : x(x), y(y), z(z) {}
	};			};

	#ifdef __HIP__			#if __HIP__ \|\| HIP_PLATFORM
	typedef struct hipStream *hipStream_t;			typedef struct hipStream *hipStream_t;
	typedef enum hipError {} hipError_t;			typedef enum hipError {} hipError_t;
	int hipConfigureCall(dim3 gridSize, dim3 blockSize, size_t sharedSize = 0,			int hipConfigureCall(dim3 gridSize, dim3 blockSize, size_t sharedSize = 0,
	hipStream_t stream = 0);			hipStream_t stream = 0);
	extern "C" hipError_t __hipPushCallConfiguration(dim3 gridSize, dim3 blockSize,			extern "C" hipError_t __hipPushCallConfiguration(dim3 gridSize, dim3 blockSize,
	size_t sharedSize = 0,			size_t sharedSize = 0,
	hipStream_t stream = 0);			hipStream_t stream = 0);
	extern "C" hipError_t hipLaunchKernel(const void *func, dim3 gridDim,			extern "C" hipError_t hipLaunchKernel(const void *func, dim3 gridDim,
	Show All 18 Lines

clang/test/CodeGenCUDA/cxx-call-kernel.cpp

This file was added.

				// RUN: %clang_cc1 -x hip -emit-llvm-bc %s -o %t.hip.bc
				// RUN: %clang_cc1 -mlink-bitcode-file %t.hip.bc -DHIP_PLATFORM -emit-llvm \
				// RUN: %s -o - \| FileCheck %s

				#include "Inputs/cuda.h"

				// CHECK: @_Z2g1i = constant void (i32)* @_Z17__device_stub__g1i, align 8
				#if __HIP__
				__global__ void g1(int x) {}
				#else
				extern void g1(int x);

				// CHECK: call i32 @hipLaunchKernel{{.*}}@_Z2g1i
				void test() {
				hipLaunchKernel((void*)g1, 1, 1, nullptr, 0, 0);
				}

				// CHECK: __hipRegisterFunction{{.*}}@_Z2g1i
				#endif

clang/test/CodeGenCUDA/kernel-dbg-info.cu

	Show All 24 Lines
	// RUN: -fcuda-is-device \| FileCheck -check-prefix=DEV %s			// RUN: -fcuda-is-device \| FileCheck -check-prefix=DEV %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	extern "C" __global__ void ckernel(int *a) {			extern "C" __global__ void ckernel(int *a) {
	*a = 1;			*a = 1;
	}			}

				// Kernel symbol for launching kernel.
				// CHECK: @[[SYM:ckernel]] = constant void (i32) @__device_stub__ckernel, align 8

	// Device side kernel names			// Device side kernel names
	// CHECK: @[[CKERN:[0-9]]] = {{.}} c"ckernel\00"			// CHECK: @[[CKERN:[0-9]]] = {{.}} c"ckernel\00"

	// DEV: define {{.}}@ckernel{{.}}!dbg			// DEV: define {{.}}@ckernel{{.}}!dbg
	// DEV: store {{.*}}!dbg			// DEV: store {{.*}}!dbg
	// DEV: ret {{.*}}!dbg			// DEV: ret {{.*}}!dbg

	// Make sure there is no !dbg between function attributes and '{'			// Make sure there is no !dbg between function attributes and '{'
	// CHECK: define{{.}} void @[[CSTUB:__device_stub__ckernel]]{{.}} #{{[0-9]+}} {			// CHECK: define{{.}} void @[[CSTUB:__device_stub__ckernel]]{{.}} #{{[0-9]+}} {
	// CHECK-NOT: call {{.}}@hipLaunchByPtr{{.}}!dbg			// CHECK-NOT: call {{.}}@hipLaunchByPtr{{.}}!dbg
	// CHECK: call {{.}}@hipLaunchByPtr{{.}}@[[CSTUB]]			// CHECK: call {{.}}@hipLaunchByPtr{{.}}@[[SYM]]
	// CHECK-NOT: ret {{.*}}!dbg			// CHECK-NOT: ret {{.*}}!dbg

	// CHECK-LABEL: define {{.}}@_Z8hostfuncPi{{.}}!dbg			// CHECK-LABEL: define {{.}}@_Z8hostfuncPi{{.}}!dbg
	// O0: call void @[[CSTUB]]{{.*}}!dbg			// O0: call void @[[CSTUB]]{{.*}}!dbg
	void hostfunc(int *a) {			void hostfunc(int *a) {
	ckernel<<<1, 1>>>(a);			ckernel<<<1, 1>>>(a);
	}			}

clang/test/CodeGenCUDA/kernel-stub-name.cu

	// RUN: echo "GPU binary would be here" > %t			// RUN: echo "GPU binary would be here" > %t

	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \			// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
	// RUN: -fcuda-include-gpubinary %t -o - -x hip\			// RUN: -fcuda-include-gpubinary %t -o - -x hip\
	// RUN: \| FileCheck -allow-deprecated-dag-overlap %s --check-prefixes=CHECK			// RUN: \| FileCheck %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

				// Kernel handles

				// CHECK: @[[HCKERN:ckernel]] = constant void ()* @__device_stub__ckernel, align 8
				// CHECK: @[[HNSKERN:_ZN2ns8nskernelEv]] = constant void ()* @_ZN2ns23__device_stub__nskernelEv, align 8
				// CHECK: @[[HTKERN:_Z10kernelfuncIiEvv]] = linkonce_odr constant void ()* @_Z25__device_stub__kernelfuncIiEvv, align 8
				// CHECK: @[[HDKERN:_Z11kernel_declv]] = external constant void ()*, align 8

	extern "C" __global__ void ckernel() {}			extern "C" __global__ void ckernel() {}

	namespace ns {			namespace ns {
	__global__ void nskernel() {}			__global__ void nskernel() {}
	} // namespace ns			} // namespace ns

	template<class T>			template<class T>
	__global__ void kernelfunc() {}			__global__ void kernelfunc() {}

	__global__ void kernel_decl();			__global__ void kernel_decl();

				void (*kernel_ptr)();
				void *void_ptr;

				void launch(void *kern);

	// Device side kernel names			// Device side kernel names

	// CHECK: @[[CKERN:[0-9]]] = {{.}} c"ckernel\00"			// CHECK: @[[CKERN:[0-9]]] = {{.}} c"ckernel\00"
	// CHECK: @[[NSKERN:[0-9]]] = {{.}} c"_ZN2ns8nskernelEv\00"			// CHECK: @[[NSKERN:[0-9]]] = {{.}} c"_ZN2ns8nskernelEv\00"
	// CHECK: @[[TKERN:[0-9]]] = {{.}} c"_Z10kernelfuncIiEvv\00"			// CHECK: @[[TKERN:[0-9]]] = {{.}} c"_Z10kernelfuncIiEvv\00"

	// Non-template kernel stub functions			// Non-template kernel stub functions

	// CHECK: define{{.*}}@[[CSTUB:__device_stub__ckernel]]			// CHECK: define{{.*}}@[[CSTUB:__device_stub__ckernel]]
	// CHECK: call{{.}}@hipLaunchByPtr{{.}}@[[CSTUB]]			// CHECK: call{{.}}@hipLaunchByPtr{{.}}@[[HCKERN]]
	// CHECK: define{{.*}}@[[NSSTUB:_ZN2ns23__device_stub__nskernelEv]]			// CHECK: define{{.*}}@[[NSSTUB:_ZN2ns23__device_stub__nskernelEv]]
	// CHECK: call{{.}}@hipLaunchByPtr{{.}}@[[NSSTUB]]			// CHECK: call{{.}}@hipLaunchByPtr{{.}}@[[HNSKERN]]


	// CHECK-LABEL: define{{.*}}@_Z8hostfuncv()			// Check kernel stub is used for triple chevron

				// CHECK-LABEL: define{{.*}}@_Z4fun1v()
	// CHECK: call void @[[CSTUB]]()			// CHECK: call void @[[CSTUB]]()
	// CHECK: call void @[[NSSTUB]]()			// CHECK: call void @[[NSSTUB]]()
	// CHECK: call void @[[TSTUB:_Z25__device_stub__kernelfuncIiEvv]]()			// CHECK: call void @[[TSTUB:_Z25__device_stub__kernelfuncIiEvv]]()
	// CHECK: call void @[[DSTUB:_Z26__device_stub__kernel_declv]]()			// CHECK: call void @[[DSTUB:_Z26__device_stub__kernel_declv]]()
	void hostfunc(void) {
				void fun1(void) {
	ckernel<<<1, 1>>>();			ckernel<<<1, 1>>>();
	ns::nskernel<<<1, 1>>>();			ns::nskernel<<<1, 1>>>();
	kernelfunc<int><<<1, 1>>>();			kernelfunc<int><<<1, 1>>>();
	kernel_decl<<<1, 1>>>();			kernel_decl<<<1, 1>>>();
	}			}

	// Template kernel stub functions			// Template kernel stub functions

	// CHECK: define{{.*}}@[[TSTUB]]			// CHECK: define{{.*}}@[[TSTUB]]
	// CHECK: call{{.}}@hipLaunchByPtr{{.}}@[[TSTUB]]			// CHECK: call{{.}}@hipLaunchByPtr{{.}}@[[HTKERN]]

				// Check declaration of stub function for external kernel.

	// CHECK: declare{{.*}}@[[DSTUB]]			// CHECK: declare{{.*}}@[[DSTUB]]

				// Check kernel handle is used for passing the kernel as a function pointer

				// CHECK-LABEL: define{{.*}}@_Z4fun2v()
				// CHECK: call void @_Z6launchPv({{.*}}[[HCKERN]]
				// CHECK: call void @_Z6launchPv({{.*}}[[HNSKERN]]
				// CHECK: call void @_Z6launchPv({{.*}}[[HTKERN]]
				// CHECK: call void @_Z6launchPv({{.*}}[[HDKERN]]
				void fun2() {
				launch((void *)ckernel);
				launch((void *)ns::nskernel);
				launch((void *)kernelfunc<int>);
				launch((void *)kernel_decl);
				}

				// Check kernel handle is used for assigning a kernel to a function pointer

				// CHECK-LABEL: define{{.*}}@_Z4fun3v()
				// CHECK: store void ()* bitcast (void ()** @[[HCKERN]] to void ()), void ()* @kernel_ptr, align 8
				// CHECK: store void ()* bitcast (void ()** @[[HCKERN]] to void ()), void ()* @kernel_ptr, align 8
				// CHECK: store i8* bitcast (void ()** @[[HCKERN]] to i8), i8* @void_ptr, align 8
				// CHECK: store i8* bitcast (void ()** @[[HCKERN]] to i8), i8* @void_ptr, align 8
				void fun3() {
				kernel_ptr = ckernel;
				kernel_ptr = &ckernel;
				void_ptr = (void *)ckernel;
				void_ptr = (void *)&ckernel;
				}

				// Check kernel stub is loaded from kernel handle when function pointer is
				// used with triple chevron

				// CHECK-LABEL: define{{.*}}@_Z4fun4v()
				// CHECK: store void ()* bitcast (void ()** @[[HCKERN]] to void ()), void ()* @kernel_ptr
				// CHECK: call i32 @_Z16hipConfigureCall4dim3S_mP9hipStream
				// CHECK: %[[HANDLE:.]] = load void (), void ()** @kernel_ptr, align 8
				// CHECK: %[[CAST:.]] = bitcast void () %[[HANDLE]] to void ()**
				// CHECK: %[[STUB:.]] = load void (), void ()** %[[CAST]], align 8
				// CHECK: call void %[[STUB]]()
				void fun4() {
				kernel_ptr = ckernel;
				kernel_ptr<<<1,1>>>();
				}

				// Check kernel handle is passed to a function

				// CHECK-LABEL: define{{.*}}@_Z4fun5v()
				// CHECK: store void ()* bitcast (void ()** @[[HCKERN]] to void ()), void ()* @kernel_ptr
				// CHECK: %[[HANDLE:.]] = load void (), void ()** @kernel_ptr, align 8
				// CHECK: %[[CAST:.]] = bitcast void () %[[HANDLE]] to i8*
				// CHECK: call void @_Z6launchPv(i8* %[[CAST]])
				void fun5() {
				kernel_ptr = ckernel;
				launch((void *)kernel_ptr);
				}

	// CHECK-LABEL: define{{.*}}@__hip_register_globals			// CHECK-LABEL: define{{.*}}@__hip_register_globals
	// CHECK: call{{.}}@__hipRegisterFunction{{.}}@[[CSTUB]]{{.*}}@[[CKERN]]			// CHECK: call{{.}}@__hipRegisterFunction{{.}}@[[HCKERN]]{{.*}}@[[CKERN]]
	// CHECK: call{{.}}@__hipRegisterFunction{{.}}@[[NSSTUB]]{{.*}}@[[NSKERN]]			// CHECK: call{{.}}@__hipRegisterFunction{{.}}@[[HNSKERN]]{{.*}}@[[NSKERN]]
	// CHECK: call{{.}}@__hipRegisterFunction{{.}}@[[TSTUB]]{{.*}}@[[TKERN]]			// CHECK: call{{.}}@__hipRegisterFunction{{.}}@[[HTKERN]]{{.*}}@[[TKERN]]
				// CHECK-NOT: call{{.}}@__hipRegisterFunction{{.}}@[[HDKERN]]{{.*}}@[[DKERN]]

clang/test/CodeGenCUDA/unnamed-types.cu

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	[](float *p) {
k0<<<1,1>>>(p, [] __device__ (float x) { return x + 3.f; });		k0<<<1,1>>>(p, [] __device__ (float x) { return x + 3.f; });
}(p);		}(p);
k1<<<1,1>>>(p,		k1<<<1,1>>>(p,
[] __device__ (float x) { return x + 4.f; },		[] __device__ (float x) { return x + 4.f; },
[] __device__ (float x, float y) { return x * y; },		[] __device__ (float x, float y) { return x * y; },
[] __device__ (float x) { return x + 5.f; });		[] __device__ (float x) { return x + 5.f; });
}		}
// HOST: @__hip_register_globals		// HOST: @__hip_register_globals
// HOST: __hipRegisterFunction{{.}}@_Z17__device_stub__k0IZZ2f1PfENKUlS0_E_clES0_EUlfE_EvS0_T_{{.}}@0		// HOST: __hipRegisterFunction{{.}}@_Z2k0IZZ2f1PfENKUlS0_E_clES0_EUlfE_EvS0_T_{{.}}@0
// HOST: __hipRegisterFunction{{.}}@_Z17__device_stub__k1IZ2f1PfEUlfE_Z2f1S0_EUlffE_Z2f1S0_EUlfE0_EvS0_T_T0_T1_{{.}}@1		// HOST: __hipRegisterFunction{{.}}@_Z2k1IZ2f1PfEUlfE_Z2f1S0_EUlffE_Z2f1S0_EUlfE0_EvS0_T_T0_T1_{{.}}@1
// MSVC: __hipRegisterFunction{{.}}@"??$k0@V<lambda_1>@?0???R1?0??f1@@YAXPEAM@Z@QEBA@0@Z@@@YAXPEAMV<lambda_1>@?0???R0?0??f1@@YAX0@Z@QEBA@0@Z@@Z{{.}}@0		// MSVC: __hipRegisterFunction{{.}}@"??$k0@V<lambda_1>@?0???R1?0??f1@@YAXPEAM@Z@QEBA@0@Z@@@YAXPEAMV<lambda_1>@?0???R0?0??f1@@YAX0@Z@QEBA@0@Z@@Z{{.}}@0
// MSVC: __hipRegisterFunction{{.}}@"??$k1@V<lambda_2>@?0??f1@@YAXPEAM@Z@V<lambda_3>@?0??2@YAX0@Z@V<lambda_4>@?0??2@YAX0@Z@@@YAXPEAMV<lambda_2>@?0??f1@@YAX0@Z@V<lambda_3>@?0??1@YAX0@Z@V<lambda_4>@?0??1@YAX0@Z@@Z{{.}}@1		// MSVC: __hipRegisterFunction{{.}}@"??$k1@V<lambda_2>@?0??f1@@YAXPEAM@Z@V<lambda_3>@?0??2@YAX0@Z@V<lambda_4>@?0??2@YAX0@Z@@@YAXPEAMV<lambda_2>@?0??f1@@YAX0@Z@V<lambda_3>@?0??1@YAX0@Z@V<lambda_4>@?0??1@YAX0@Z@@Z{{.}}@1