This is an archive of the discontinued LLVM Phabricator instance.

[OpenCL][AMDGPU]: Do not allow a call to kernel
Needs ReviewPublic

Authored by cdevadas on Feb 25 2022, 6:47 AM.

Details

Summary

In OpenCL, a kernel is allowed to call other kernels as if
they are regular functions. To support it, clang emits
amdgpu_kernel calling convention for both caller and callee.
A backend pass in our downstream compiler alters such calls
by introducing regular function bodies which are clones of
the callee kernels. This implementation currently limits us
in certain ways. For instance, the restriction to not use
byref attribute for callee kernels.

To avoid such limitations, this patch brings in those
cloned functions early on and prevents clang from generating
amdgpu_kernel call sites. A new function body will be added
for each kernel in the compilation unit expecting that the
unused clones will get removed at link time.

Diff Detail

Event Timeline

cdevadas created this revision.Feb 25 2022, 6:47 AM
cdevadas requested review of this revision.Feb 25 2022, 6:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 25 2022, 6:47 AM

One of my concerns is that all kernels are duplicated which may cause code object size doubled.

Do we need to make the clone always_inline and let the kernel call its clone to avoid duplicate function bodies? Or LLVM has some pass to do that?

Another concern is that the duplicate non-kernel functions have actual kernel ABI. Not sure if that can cause any issues.

One of my concerns is that all kernels are duplicated which may cause code object size doubled.

Not really, the kernel should just be a stub that calls the real implementation function. In the real world this will always be inlined

Do we need to make the clone always_inline and let the kernel call its clone to avoid duplicate function bodies? Or LLVM has some pass to do that?

It's not a special case, there's no real need to put always_inline. Nobody uses this feature in the real world anyway, and single use functions will be inlined

Another concern is that the duplicate non-kernel functions have actual kernel ABI. Not sure if that can cause any issues.

My main question is how we have the symbol for the kernel and function coexist

arsenm added inline comments.Feb 25 2022, 7:19 AM
clang/lib/CodeGen/TargetInfo.cpp
9238

I don't think we can really start with the function IR. The TargetABIInfo could be different from the kernel and function form (and will due to using byval/byref etc.)

9240

I don't think adding a prefix and suffix is a good strategy for something which in principle should be ABI visible. A period + suffix I think would be a better convention

9478–9479

This is basically just moving what the current hack does into clang. Can we emit calls to the function version up front?

Is there something which stops you from taking the address of a kernel and then calling it? If not, are there actually any uses of kernels in the module that shouldn't be rewritten as uses of the clone?

I feel like this would be a lot easier to just fix in your LLVM pass so that you rewrite any uses of a kernel to use a clone instead before you rewrite the kernel.

Is there something which stops you from taking the address of a kernel and then calling it? If not, are there actually any uses of kernels in the module that shouldn't be rewritten as uses of the clone?

The actual amdgpu_kernel is uncallable and has a totally different ABI, and is invoked by external driver code. From the user's device code perspective, only the callable function version is meaningful.

I feel like this would be a lot easier to just fix in your LLVM pass so that you rewrite any uses of a kernel to use a clone instead before you rewrite the kernel.

Then we can't ban calls to kernels (and would be pushing some kind of symbol naming conflict decision into the backend) and in principle would have to handle this actual call.

We also don't really want these to have similar/compatible signatures where you can just swap out the call target. For example I want to more drastically change the IR used for aggregates between the two cases.

Is there something which stops you from taking the address of a kernel and then calling it? If not, are there actually any uses of kernels in the module that shouldn't be rewritten as uses of the clone?

The actual amdgpu_kernel is uncallable and has a totally different ABI, and is invoked by external driver code. From the user's device code perspective, only the callable function version is meaningful.

I think you're misunderstanding what I'm asking. I believe that in OpenCL, you can do &someKernelFunction in source code and then call that. The rewrite in this patch does not handle non-call uses of the kernel function and so will continue to miscompile them.

I feel like this would be a lot easier to just fix in your LLVM pass so that you rewrite any uses of a kernel to use a clone instead before you rewrite the kernel.

Then we can't ban calls to kernels (and would be pushing some kind of symbol naming conflict decision into the backend) and in principle would have to handle this actual call.

Okay, this is not an accurate description of what you're trying to do, and this is important to be precise about. You are not "banning calls to kernels", which would be a novel language restriction and make you non-conformant to OpenCL. You still have a language requirement to allow code to directly use kernel functions. That is why this patch is modifying IR generation instead of emitting new errors in Sema.

What's happening here is that your target (very reasonably) requires kernels to have a special kernel entrypoint in order to be called from outside. That entrypoint uses a very different ABI from ordinary functions, one which simplifies being dynamically called by the runtime, and so it is important that ordinary uses of the function don't accidentally resolve against that special entrypoint. You therefore need two different functions for the kernel, one to satisfy standard uses and one to act as the kernel entrypoint.

Your current architecture is to generate code normally, which will produce what's roughly the standard entrypoint, and then have a backend pass break that down and produce a kernel entrypoint. I can understand why you find that frustratingly limited, and I agree that it doesn't seem to handle standard uses correctly. Something needs to change here.

Now, Clang supports many different kernel languages, all of which face very similar language/implementation problems. It is therefore always informative to go check to see how other language implementors have tried to solve the problem you're facing. So if you go and look at how CUDA is implemented in Clang, you will see that they have introduced a "kernel reference kind" to GlobalDecl, which allows them to distinguish between the kernel entrypoint and the standard entrypoint in IRGen. You could very easily build on this in your OpenCL implementation so that Clang emits the standard entrypoint and then either your pass or IRGen itself fills in the kernel entrypoint by marshaling arguments and then calling (presumably in a way that forces inlining) the standard entrypoint. That would also give you total control of how arguments are marshaled in the kernel entrypoint.

Alternatively, I think cloning the standard entrypoint so that uses of it are rewritten to a clone is reasonable enough. I don't really see why doing the cloning in IRGen is necessary when you already have a module pass that does similar kinds of rewriting. Doing the clone in IRGen also does not seem to move you closer to your goal of actually marshaling arguments differently. Most importantly, though, I believe you do need to rewrite all the uses and not just the direct calls.