This is an archive of the discontinued LLVM Phabricator instance.

[HIP] Emit kernel symbol
ClosedPublic

Authored by yaxunl on Aug 21 2020, 3:03 PM.

Details

Summary

Currently clang uses stub function to launch kernel. This is inconvenient
to interop with C++ programs since the stub function has different name
as kernel, which is required by ROCm debugger.

This patch emits a variable symbol which has the same name as the kernel
and uses it to register and launch the kernel. This allows C++ program to
launch a kernel by using the original kernel name.

Diff Detail

Event Timeline

yaxunl requested review of this revision.Aug 21 2020, 3:03 PM
yaxunl created this revision.
yaxunl edited the summary of this revision. (Show Details)Aug 21 2020, 3:10 PM
tra added a comment.Aug 24 2020, 11:08 AM

How much does this inlining buy you in practice? I.e. what's a typical launch latency before/after the patch? For CUDA, config push/pop is negligible compared to the cost of actually launching the kernel on the GPU. It is measurable if the launch is asynchronous, but queueing kernels fast, does not help all that much in the long run -- you eventually have to run those kernels on the GPU, so in most cases you're just spend a bit more time idling while waiting for the queued kernels to finish. To be beneficial, you'll need a finely balanced CPU/GPU workload and that's rather hard to achieve. Not to the point where the minor savings here would be meaningful. I would assume the situation on AMD GPUs is not that different.

One side effect of this patch is that there will be no convenient way to set host-side breakpoint on kernel launch.
Another will be that examining call stack will become somewhat confusing as the arguments passed to the kernel as written in the source code will not match those observed in the stack trace. I guess preserving the appearance of normal function calls was the reason for the split config setup/kernel launch in CUDA. I'd say it's still useful to have as CUDA-specific debugger is not always available and one must use regular gdb on CUDA apps now and then.

If the patch does give measurable performance improvement, can we implement launch config push/pop in a way that compiler can eliminate by itself when it's possible and keep the stub as the host-side kernel entry point? I would prefer to avoid sacrificing debugging usability for performance optimizations that may not matter.

In D86376#2234259, @tra wrote:

How much does this inlining buy you in practice? I.e. what's a typical launch latency before/after the patch? For CUDA, config push/pop is negligible compared to the cost of actually launching the kernel on the GPU. It is measurable if the launch is asynchronous, but queueing kernels fast, does not help all that much in the long run -- you eventually have to run those kernels on the GPU, so in most cases you're just spend a bit more time idling while waiting for the queued kernels to finish. To be beneficial, you'll need a finely balanced CPU/GPU workload and that's rather hard to achieve. Not to the point where the minor savings here would be meaningful. I would assume the situation on AMD GPUs is not that different.

`hipPushConfiguration/hipPopConfiguration' and kernel stub can cause 40 ns overhead, whereas we have requests to squeeze any overhead in kernel launching latency.

One side effect of this patch is that there will be no convenient way to set host-side breakpoint on kernel launch.
Another will be that examining call stack will become somewhat confusing as the arguments passed to the kernel as written in the source code will not match those observed in the stack trace. I guess preserving the appearance of normal function calls was the reason for the split config setup/kernel launch in CUDA. I'd say it's still useful to have as CUDA-specific debugger is not always available and one must use regular gdb on CUDA apps now and then.

Eliminating kernel stub does not affect debugability negatively. At least this is true for HIP debugger. Actually our debugger team intentionally requests to eliminate any debug information for the kernel stub so that it will not confuse the debugger with the real kernel. This is because the kernel stub is an artificial function for launching the kernel, not the real kernel which is in device binary. For HIP debugger (rocmgdb), when the user set break point on a kernel, it will break on the real kernel in device binary, and the call stack are displayed correctly. The arguments to the real kernel are not lost, since the real kernel is a real function in device binary.

Another motivation for eliminating kernel stub is to be able to emit a symbol with the same mangled name as a kernel as a global variable instead of a function. Since we need such symbols to be able to launch kernels with mangled name in a C++ program. If we use kernel stub as the symbol, we cannot use the original mangled kernel name since our debugger does not allow that.

tra added a comment.Aug 24 2020, 1:40 PM

I'm OK with how the patch is implemented.
I'm still on the fence regarding whether it should be implemented.

`hipPushConfiguration/hipPopConfiguration' and kernel stub can cause 40 ns overhead, whereas we have requests to squeeze any overhead in kernel launching latency.

That's about the same as 1 cache miss. I'm willing to bet that it will be lost in the noise. Are there any real world benchmarks where it makes a difference?
Are those requests driven by a specific use case? Not all requests (even well intentioned ones) are worth implementing.
This patch appears to be somewhere in the gray area to me. My prior experience with CUDA suggests that it will make little to no difference. On the other hand, AMD GPUs may be different enough to prove me wrong. Without specific evidence, I still can't tell what's the case here.

One side effect of this patch is that there will be no convenient way to set host-side breakpoint on kernel launch.
Another will be that examining call stack will become somewhat confusing as the arguments passed to the kernel as written in the source code will not match those observed in the stack trace. I guess preserving the appearance of normal function calls was the reason for the split config setup/kernel launch in CUDA. I'd say it's still useful to have as CUDA-specific debugger is not always available and one must use regular gdb on CUDA apps now and then.

Eliminating kernel stub does not affect debugability negatively. At least this is true for HIP debugger. Actually our debugger team intentionally requests to eliminate any debug information for the kernel stub so that it will not confuse the debugger with the real kernel. This is because the kernel stub is an artificial function for launching the kernel, not the real kernel which is in device binary. For HIP debugger (rocmgdb), when the user set break point on a kernel, it will break on the real kernel in device binary, and the call stack are displayed correctly. The arguments to the real kernel are not lost, since the real kernel is a real function in device binary.

You appear to assume debuggability with HIP-aware debugger. That part I'm not particularly concerned about as I assume that it will be tested on AMD's side.
I was mostly concerned about debuggability with the ordinary gdb. Imagine someone having to debug a TF app they've got somewhere. The end user may not even have HIP tools installed. It would be useful to be able to debug until the point where control is passed to the GPU. The patch will likely have a minor, but still negative impact on that.

I guess one should still be able to set a breakpoint using the file:line number. If you could verify that it still works with gdb, that would be a reasonable workaround. I think we still need to have some way to set a breakpoint on the kernel launch site (I think it should still work) and on the kernel entry.

So, we have a trade-off of minor performance gain vs a minor debuggability regression. I don't have strong opinions which is the best way to go. By default, with no demonstrated benefit, I'd err on the side of not changing things.

Another motivation for eliminating kernel stub is to be able to emit a symbol with the same mangled name as a kernel as a global variable instead of a function. Since we need such symbols to be able to launch kernels with mangled name in a C++ program. If we use kernel stub as the symbol, we cannot use the original mangled kernel name since our debugger does not allow that.

Is eliminating the host-side stub the goal, or just a coincidental side-effect? I.e. if it's something you *need* to do, then the discussion about minor performance gain becomes rather irrelevant and we should weigh 'improvements in HIP debugging' vs 'regression in host-only debugging' instead.

In D86376#2234547, @tra wrote:

I'm OK with how the patch is implemented.
I'm still on the fence regarding whether it should be implemented.

`hipPushConfiguration/hipPopConfiguration' and kernel stub can cause 40 ns overhead, whereas we have requests to squeeze any overhead in kernel launching latency.

That's about the same as 1 cache miss. I'm willing to bet that it will be lost in the noise. Are there any real world benchmarks where it makes a difference?
Are those requests driven by a specific use case? Not all requests (even well intentioned ones) are worth implementing.
This patch appears to be somewhere in the gray area to me. My prior experience with CUDA suggests that it will make little to no difference. On the other hand, AMD GPUs may be different enough to prove me wrong. Without specific evidence, I still can't tell what's the case here.

Sorry, the overhead due to __hipPushConfigure/__hipPopConfigure is about 60 us. The typical kernel launching latency is about 500us, therefore the improvement is around 10%.

One side effect of this patch is that there will be no convenient way to set host-side breakpoint on kernel launch.
Another will be that examining call stack will become somewhat confusing as the arguments passed to the kernel as written in the source code will not match those observed in the stack trace. I guess preserving the appearance of normal function calls was the reason for the split config setup/kernel launch in CUDA. I'd say it's still useful to have as CUDA-specific debugger is not always available and one must use regular gdb on CUDA apps now and then.

Eliminating kernel stub does not affect debugability negatively. At least this is true for HIP debugger. Actually our debugger team intentionally requests to eliminate any debug information for the kernel stub so that it will not confuse the debugger with the real kernel. This is because the kernel stub is an artificial function for launching the kernel, not the real kernel which is in device binary. For HIP debugger (rocmgdb), when the user set break point on a kernel, it will break on the real kernel in device binary, and the call stack are displayed correctly. The arguments to the real kernel are not lost, since the real kernel is a real function in device binary.

You appear to assume debuggability with HIP-aware debugger. That part I'm not particularly concerned about as I assume that it will be tested on AMD's side.
I was mostly concerned about debuggability with the ordinary gdb. Imagine someone having to debug a TF app they've got somewhere. The end user may not even have HIP tools installed. It would be useful to be able to debug until the point where control is passed to the GPU. The patch will likely have a minor, but still negative impact on that.

I guess one should still be able to set a breakpoint using the file:line number. If you could verify that it still works with gdb, that would be a reasonable workaround. I think we still need to have some way to set a breakpoint on the kernel launch site (I think it should still work) and on the kernel entry.

To run HIP applications, users need to install ROCm, which includes rocgdb. A debugger without device code debugging capability has little use with HIP applications therefore I would expect users to always use rocgdb to debug HIP program. Also, since clang already removed all debug information for kernel stub, gdb cannot break on kernel stub any way.

Another motivation for eliminating kernel stub is to be able to emit a symbol with the same mangled name as a kernel as a global variable instead of a function. Since we need such symbols to be able to launch kernels with mangled name in a C++ program. If we use kernel stub as the symbol, we cannot use the original mangled kernel name since our debugger does not allow that.

Is eliminating the host-side stub the goal, or just a coincidental side-effect? I.e. if it's something you *need* to do, then the discussion about minor performance gain becomes rather irrelevant and we should weigh 'improvements in HIP debugging' vs 'regression in host-only debugging' instead.

I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs.

tra added a comment.Aug 24 2020, 3:49 PM

This patch appears to be somewhere in the gray area to me. My prior experience with CUDA suggests that it will make little to no difference. On the other hand, AMD GPUs may be different enough to prove me wrong. Without specific evidence, I still can't tell what's the case here.

Sorry, the overhead due to __hipPushConfigure/__hipPopConfigure is about 60 us. The typical kernel launching latency is about 500us, therefore the improvement is around 10%.

60 *micro seconds* to store/load something from memory? It does not sound right. 0.5 millisecond per kernel launch is also suspiciously high.
For CUDA it's ~5us (https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf). If it does indeed take 60 microseconds to push/pop a O(cacheline) worth of launch config data, the implementation may be doing something wrong. We're talking about O(100) syscalls and that's way too much work for something that simple. What do those calls do?

Can you confirm that the units are indeed microseconds and not nanoseconds?

To run HIP applications, users need to install ROCm, which includes rocgdb.

I would disagree with that assertion. I do very much want to build a Tensorflow-based app and run it in a container with nothing else but the app and I do want to use existing infrastructure to capture relevant info if the app crashes. Such capture will not be using any HIP-specific tools.
Or I could give it to a user who absolutely does not care what's inside the executable, but who may want to run it under gdb if something goes wrong.

A debugger without device code debugging capability has little use with HIP applications therefore I would expect users to always use rocgdb to debug HIP program.

I agree that it's indeed the case if someone wants/needs to debug GPU code, however, in many cases it's sufficient to be able to debug host-side things only. And it is useful to see the point where we launch kernels and be able to tell which kernel it was.

Also, since clang already removed all debug information for kernel stub, gdb cannot break on kernel stub any way.

gdb is aware of the ELF symbols and those are often exposed in shared libraries. While you will not have type info, etc, you can still set a breakpoint and get a sensible stack trace in many cases. We usually build with some amount of debug info and it did prove rather helpful to pin-point GPU failures via host-side stack trace as it did include the symbol name of the host-side stub which allows identifying the device-side kernel. If all we see in the stack trace is hipLaunchKernel, it would be considerably less helpful, especially when there's no detailed debug info which would allow us to dig out the kernel name from its arguments. All we'd know that we've launched *some* kernel.

Is eliminating the host-side stub the goal, or just a coincidental side-effect? I.e. if it's something you *need* to do, then the discussion about minor performance gain becomes rather irrelevant and we should weigh 'improvements in HIP debugging' vs 'regression in host-only debugging' instead.

I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs.

Could you elaborate on the "interoperability with C++ programs"? I don't think I see how this patch helps with that. Or what exactly is the issue with C++ interoperability we have now?

In D86376#2234824, @tra wrote:

This patch appears to be somewhere in the gray area to me. My prior experience with CUDA suggests that it will make little to no difference. On the other hand, AMD GPUs may be different enough to prove me wrong. Without specific evidence, I still can't tell what's the case here.

Sorry, the overhead due to __hipPushConfigure/__hipPopConfigure is about 60 us. The typical kernel launching latency is about 500us, therefore the improvement is around 10%.

60 *micro seconds* to store/load something from memory? It does not sound right. 0.5 millisecond per kernel launch is also suspiciously high.
For CUDA it's ~5us (https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf). If it does indeed take 60 microseconds to push/pop a O(cacheline) worth of launch config data, the implementation may be doing something wrong. We're talking about O(100) syscalls and that's way too much work for something that simple. What do those calls do?

Can you confirm that the units are indeed microseconds and not nanoseconds?

My previous measurements did not warming up, which caused some one time overhead due to device initialization and loading of device binary. With warm up, the call of __hipPushCallConfigure/__hipPopCallConfigure takes about 19 us. Based on the trace from rocprofile, the time spent inside these functions can be ignored. Most of the time is spent making the calls. These functions stay in a shared library, which may be the reason why they take such long time. Making them always_inline may get rid of the overhead, however, that would require exposing internal data structures.

The kernel launching latency are measured by a simple loop in which a simple kernel is launched then hipStreamSynchronize is called. trace is collected by rocprofiler and the latency is measured from the end of hipStreamSynchronize to the real start of kernel execution. Without this patch, the latency is about 77 us. With this patch, the latency is about 46 us. The improvement is about 40%. The decrement of 31 us is more than 19 us since it also eliminates the overhead of kernel stub.

I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs.

Could you elaborate on the "interoperability with C++ programs"? I don't think I see how this patch helps with that. Or what exactly is the issue with C++ interoperability we have now?

In HIP program, a global symbol is generated in host binary to identify each kernel. This symbol is associated with the device kernel by a call of hipRegisterFunction in init functions. Each time the kernel needs to be called, the associated symbol is passed to hipLaunchKernel. In host code, this symbol represents the kernel. Let's call it the kernel symbol. Currently it is the kernel stub function, however, it could be any global symbol, as long as it is registered with hipRegisterFunction, then hipLaunchKernel can use it to find the right kernel and launch it.

In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the kernel symbol. Since the kernel symbol is defined in object files generated from HIP. For C/C++ program, as long as it declares the kernel symbol as an external function or variable which matches the name of the original symbol, the linker will resolve to the correct kernel symbol, then the correct kernel can be launched.

Here comes the nuance with kernel stub function as the kernel symbol. If you still remember, there was a previous patch for HIP to change the kernel stub name. rocgdb requires the device stub to have a different name than the real kernel, since otherwise it will not be able to break on the real kernel only. As a result, the kernel stub now has a prefix __device_stub_ before mangling.

For example, a kernel foo will have a kernel stub with name __device_stub_foo.

For a C/C++ program to call kernel foo, it needs to declare an external symbol __device_stub_foo then launch it. Of course this is an annoyance for C/C++ users, especially this involves mangled names.

However, we cannot change the name of the kernel stub to be the same as the kernel, since that will break rocgdb.

Now the solution is to get rid of the kernel stub function. Instead of use kernel stub function as kernel symbol, we will emit a global variable as kernel symbol. This global variable can have the same name as the kernel, since rocgdb will not break on it.

tra added a comment.Aug 25 2020, 10:35 AM

My previous measurements did not warming up, which caused some one time overhead due to device initialization and loading of device binary. With warm up, the call of __hipPushCallConfigure/__hipPopCallConfigure takes about 19 us. Based on the trace from rocprofile, the time spent inside these functions can be ignored. Most of the time is spent making the calls. These functions stay in a shared library, which may be the reason why they take such long time. Making them always_inline may get rid of the overhead, however, that would require exposing internal data structures.

It's still suspiciously high. AFAICT, config/push/pull is just an std::vector push/pop. It should not take *that* long. Few function calls should not lead to microseconds of overhead, once linker has resolved the symbol, if they come from a shared library.
https://github.com/ROCm-Developer-Tools/HIP/blob/master/vdi/hip_platform.cpp#L590

I wonder if it's the logging facilities that add all this overhead.

The kernel launching latency are measured by a simple loop in which a simple kernel is launched then hipStreamSynchronize is called. trace is collected by rocprofiler and the latency is measured from the end of hipStreamSynchronize to the real start of kernel execution. Without this patch, the latency is about 77 us. With this patch, the latency is about 46 us. The improvement is about 40%. The decrement of 31 us is more than 19 us since it also eliminates the overhead of kernel stub.

This is rather surprising. A function call by itself does *not* have such high overhead. There must be something else. I strongly suspect logging. If you remove logging statements from push/pop without changing anything else, how does that affect performance?

I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs.

Could you elaborate on the "interoperability with C++ programs"? I don't think I see how this patch helps with that. Or what exactly is the issue with C++ interoperability we have now?

In HIP program, a global symbol is generated in host binary to identify each kernel. This symbol is associated with the device kernel by a call of hipRegisterFunction in init functions. Each time the kernel needs to be called, the associated symbol is passed to hipLaunchKernel. In host code, this symbol represents the kernel. Let's call it the kernel symbol. Currently it is the kernel stub function, however, it could be any global symbol, as long as it is registered with hipRegisterFunction, then hipLaunchKernel can use it to find the right kernel and launch it.

So far so good, it matches the way CUDA does that.

In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the kernel symbol.

Do you mean the host-side symbol, registered with the runtime that you've described above? Or do you mean that the device-side symbol is somehow visible from the host side. I think that's where HIP is different from CUDA.

Since the kernel symbol is defined in object files generated from HIP.
For C/C++ program, as long as it declares the kernel symbol as an external function or variable which matches the name of the original symbol, the linker will resolve to the correct kernel symbol, then the correct kernel can be launched.

The first sentence looks incomplete. It seems to imply that hipLaunchKernel uses the device-side kernel symbol and it's the linker which ties host-side reference with device-side symbol. If that's the case, then I don't understand what purpose is served by hipRegisterFunction. AFAICT, it's not used in this scenario at all.

My mental model of kernel launch mechanics looks like this:

  • For a kernel foo, there is a host-side symbol (it's the stub for CUDA) with the name 'foo' and device-side real kernel 'foo'.
  • host side linker has no access to device-side symbols, but we do need to associate host and device side 'foo' instances.
  • address of host-side foo is registered with runtime to map it to device symbol with the name 'foo'
  • when a kernel is launched, call site sets up launch config and calls the stub, passing it the kernel arguments.
  • the stub calls the kernel launch function, and passes host-side foo address to the kernel launch function
  • launch function finds device-side symbol name via the registration info and does device-side address lookup to obtain it's device address
  • run device-side function.

In this scenario, the host-side stub for foo is a regular function, which gdb can stop on and examine kernel arguments.

How is the process different for HIP? I know that we've changed the stub name to avoid debugger confusion about which if the entities corresponds to 'foo'.

Here comes the nuance with kernel stub function as the kernel symbol. If you still remember, there was a previous patch for HIP to change the kernel stub name. rocgdb requires the device stub to have a different name than the real kernel, since otherwise it will not be able to break on the real kernel only. As a result, the kernel stub now has a prefix __device_stub_ before mangling.

For example, a kernel foo will have a kernel stub with name __device_stub_foo.

For a C/C++ program to call kernel foo, it needs to declare an external symbol __device_stub_foo then launch it. Of course this is an annoyance for C/C++ users, especially this involves mangled names.

It's all done by compiler under the hood. I'm not sure how the stub name affects C/C++ users.

However, we cannot change the name of the kernel stub to be the same as the kernel, since that will break rocgdb.
Now the solution is to get rid of the kernel stub function. Instead of use kernel stub function as kernel symbol, we will emit a global variable as kernel symbol. This global variable can have the same name as the kernel, since rocgdb will not break on it.

I do not follow your reasoning why the stub name is a problem. It's awkward, yes, but losing the stub as a specific kernel entry point seems to be a real loss in debugability, which is worse, IMO.
Could you give me an example where the stub name causes problems?

In D86376#2236704, @tra wrote:

It's still suspiciously high. AFAICT, config/push/pull is just an std::vector push/pop. It should not take *that* long. Few function calls should not lead to microseconds of overhead, once linker has resolved the symbol, if they come from a shared library.
https://github.com/ROCm-Developer-Tools/HIP/blob/master/vdi/hip_platform.cpp#L590

I wonder if it's the logging facilities that add all this overhead.

You are right. The 19 us are mostly due to overhead from rocprofiler. If I do not use rocprofiler and use a simple loop to measure execution time of __hipPushCallConfigure/__hipPopCallConfigure, I got 180 ns.

The kernel launching latency are measured by a simple loop in which a simple kernel is launched then hipStreamSynchronize is called. trace is collected by rocprofiler and the latency is measured from the end of hipStreamSynchronize to the real start of kernel execution. Without this patch, the latency is about 77 us. With this patch, the latency is about 46 us. The improvement is about 40%. The decrement of 31 us is more than 19 us since it also eliminates the overhead of kernel stub.

This is rather surprising. A function call by itself does *not* have such high overhead. There must be something else. I strongly suspect logging. If you remove logging statements from push/pop without changing anything else, how does that affect performance?

The 19 us overhead was due to rocprofiler. Without rocprofiler, I can only measure the average duration of a kernel launching together with hipStreamSynchronize. When the kernel is empty, it serves as an estimation of kernel launching latency. With such measurement, the latency is about 14.0 us. The improvement due to this patch is not significant.

In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the kernel symbol.

Do you mean the host-side symbol, registered with the runtime that you've described above? Or do you mean that the device-side symbol is somehow visible from the host side. I think that's where HIP is different from CUDA.

I mean the host-side symbol. A host program can only use host-side symbol to launch a kernel.

I do not follow your reasoning why the stub name is a problem. It's awkward, yes, but losing the stub as a specific kernel entry point seems to be a real loss in debugability, which is worse, IMO.
Could you give me an example where the stub name causes problems?

For example, in HIP program, there is a kernel void foo(int*). If a C++ program wants to launch it, the desirable way is

void foo(int*);
hipLaunchKernel(foo, grids, blocks, args, shmem, stream);

Due to the prefixed kernel stub name, currently the users have to use

void __device_stub_foo(int*);
hipLaunchKernel(__device_stub_foo, grids, blocks, args, shmem, stream);
yaxunl retitled this revision from [HIP] Improve kernel launching latency to [HIP] Simplify kernel launching.Aug 26 2020, 9:29 AM
tra added a comment.Aug 26 2020, 10:34 AM

For example, in HIP program, there is a kernel void foo(int*). If a C++ program wants to launch it, the desirable way is

void foo(int*);
hipLaunchKernel(foo, grids, blocks, args, shmem, stream);

Due to the prefixed kernel stub name, currently the users have to use

void __device_stub_foo(int*);
hipLaunchKernel(__device_stub_foo, grids, blocks, args, shmem, stream);

Ah. That *is* painful. Perhaps we can have the cake and eat it here and do something like this:

Do generate a variable with the kernel name and use it for hipKernelLaunch(), but also keep the stub and call it for <<<>>> launches, only instead of using the stub itself registered as the GPU-side kernel identifier, use the variable.

This way, __device_stub_<kernel> will show up in the stack trace (no debuggability regression), but direct calls to hipLaunchKenrel can use unprefixed kernel name.

WDYT?

yaxunl added a comment.Feb 8 2021, 9:07 AM
In D86376#2239618, @tra wrote:

For example, in HIP program, there is a kernel void foo(int*). If a C++ program wants to launch it, the desirable way is

void foo(int*);
hipLaunchKernel(foo, grids, blocks, args, shmem, stream);

Due to the prefixed kernel stub name, currently the users have to use

void __device_stub_foo(int*);
hipLaunchKernel(__device_stub_foo, grids, blocks, args, shmem, stream);

Ah. That *is* painful. Perhaps we can have the cake and eat it here and do something like this:

Do generate a variable with the kernel name and use it for hipKernelLaunch(), but also keep the stub and call it for <<<>>> launches, only instead of using the stub itself registered as the GPU-side kernel identifier, use the variable.

This way, __device_stub_<kernel> will show up in the stack trace (no debuggability regression), but direct calls to hipLaunchKenrel can use unprefixed kernel name.

WDYT?

Yes that should work. Will do.

yaxunl updated this revision to Diff 322213.Feb 8 2021, 2:10 PM
yaxunl retitled this revision from [HIP] Simplify kernel launching to [HIP] Emit kernel symbol.
yaxunl edited the summary of this revision. (Show Details)

Revised by Artem's comments.

yaxunl added a comment.Feb 9 2021, 7:14 AM

Actually there is one issue with this approach.

HIP have API's to launch kernels, which accept kernel as function pointer argument. Currently when taking address of kernel, we get the stub function. These kernel launching API's will not work if we use kernel symbol to register the kernel. A solution is to return the kernel symbol instead of stub function when taking address of the kernel in host compilation, i.e. if a function pointer is assigned to a kernel in host code, it gets the kernel symbol instead of the stub function. This will make the kernel launching API work.

To keep the triple chevron working, the kernel symbol will be initialized with the address of the stub function. For triple chevron call, the address of the stub function is loaded from the kernel symbol and invoked.

tra added a comment.Feb 9 2021, 11:28 AM

Actually there is one issue with this approach.

HIP have API's to launch kernels, which accept kernel as function pointer argument. Currently when taking address of kernel, we get the stub function. These kernel launching API's will not work if we use kernel symbol to register the kernel. A solution is to return the kernel symbol instead of stub function when taking address of the kernel in host compilation, i.e. if a function pointer is assigned to a kernel in host code, it gets the kernel symbol instead of the stub function. This will make the kernel launching API work.

To keep the triple chevron working, the kernel symbol will be initialized with the address of the stub function. For triple chevron call, the address of the stub function is loaded from the kernel symbol and invoked.

This could work.
Do we really need an indirection? If we know the stub address when we initialize the symbol with it, we should be able to use that address for <<<>>>.

yaxunl added a comment.Feb 9 2021, 2:19 PM
In D86376#2552066, @tra wrote:

Actually there is one issue with this approach.

HIP have API's to launch kernels, which accept kernel as function pointer argument. Currently when taking address of kernel, we get the stub function. These kernel launching API's will not work if we use kernel symbol to register the kernel. A solution is to return the kernel symbol instead of stub function when taking address of the kernel in host compilation, i.e. if a function pointer is assigned to a kernel in host code, it gets the kernel symbol instead of the stub function. This will make the kernel launching API work.

To keep the triple chevron working, the kernel symbol will be initialized with the address of the stub function. For triple chevron call, the address of the stub function is loaded from the kernel symbol and invoked.

This could work.
Do we really need an indirection? If we know the stub address when we initialize the symbol with it, we should be able to use that address for <<<>>>.

For triple chevron with kernel name, it is not needed. We only need indirection for a triple chevron with a function pointer, in which case we do not know its stub function at compile time. This is allowed by CUDA/HIP.

tra added a comment.Feb 9 2021, 2:54 PM

For triple chevron with kernel name, it is not needed. We only need indirection for a triple chevron with a function pointer, in which case we do not know its stub function at compile time. This is allowed by CUDA/HIP.

Got it. We'll need to map the address of the symbol into the address of the stub.

Adding an indirection brings another question -- what's supposed to happen if we're passed a pointer that's *not* a pointer to the symbol. I.e. it does not point to the pointer to the stub.

Can we backtrack a bit and review our constraints/assumptions. I vaguely recall AMD inproduced __device_stub because debugger needed to distinguish host-side stub from the device-side kernel.
If we add the data with the same name, would not it cause the same confusion about what kernel is? If we are allowed to use 'kernel' on the host, is there a reason not to rename __device_stubkernel back to kernel and just use the stub address everywhere?

Another question -- assuming that the stub can't be renamed, can we give the stub an alias with the name kernel? This way no matter how we take the address, it will always point to the stub.

yaxunl added a comment.Feb 9 2021, 3:23 PM
In D86376#2552524, @tra wrote:

For triple chevron with kernel name, it is not needed. We only need indirection for a triple chevron with a function pointer, in which case we do not know its stub function at compile time. This is allowed by CUDA/HIP.

Got it. We'll need to map the address of the symbol into the address of the stub.

Adding an indirection brings another question -- what's supposed to happen if we're passed a pointer that's *not* a pointer to the symbol. I.e. it does not point to the pointer to the stub.

The same thing could happen before this change, i.e., a function pointer does not contain the address of a stub function. In either case it will be UB. This change does not make the situation worse.

Can we backtrack a bit and review our constraints/assumptions. I vaguely recall AMD inproduced __device_stub because debugger needed to distinguish host-side stub from the device-side kernel.
If we add the data with the same name, would not it cause the same confusion about what kernel is? If we are allowed to use 'kernel' on the host, is there a reason not to rename __device_stubkernel back to kernel and just use the stub address everywhere?

We have confirmed with our debugger team that emitting this symbol is OK for rocgdb since it is a variable symbol, not a function symbol.

Another question -- assuming that the stub can't be renamed, can we give the stub an alias with the name kernel? This way no matter how we take the address, it will always point to the stub.

We have tried this and it did not work. The alias will ends up as a symbol to a function which is not allowed by rocgdb.

yaxunl updated this revision to Diff 322894.Feb 10 2021, 7:13 PM

handle launch kernel by API and launch kernel in function pointer.

tra accepted this revision.Mar 1 2021, 9:51 AM

So, to summarize how the patch changes the under-the-hood kernel launch machinery:

  • device-side is unchanged. Kernel function is generated with the real kernel name
  • host-side stub is still generated with the __device_stub prefix.
  • host-side generates a 'handle' variable with the kernel function name, which is a pointer to the stub.
  • host-side registers the handle variable -> device-side kernel name association with the HIP runtime.
  • the address of the handle variable is used everywhere where we need a kernel pointer on the host side. I.e. passing kernel pointers around, referring to kernels across TUs, etc.
  • <<<>>> becomes an indirect call to a __device_stub function using the pointer retrieved from the handle.
This revision is now accepted and ready to land.Mar 1 2021, 9:51 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2021, 1:32 PM