This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Add amdgpu_kernel calling conv attribute to clang
ClosedPublic

Authored by JonChesterfield on May 19 2022, 6:28 AM.

Details

Summary

Allows emitting define amdgpu_kernel void @func() IR from C or C++.

This replaces the current workflow which is to write a stub in opencl that
calls an external C function implemented in C++ combined through llvm-link.

Calling the resulting function still requires a manual implementation of the
ABI from the host side. The primary application is for more rapid debugging
of the amdgpu backend by permuting a C or C++ test file instead of manually
updating an IR file.

Implementation closely follows D54425. Non-amd reviewers from there.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 6:28 AM
JonChesterfield requested review of this revision.May 19 2022, 6:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 6:28 AM

need a codegen test to make sure amdgpu_kernel ABI is used in C/C++ for functions with this attribute. https://github.com/llvm/llvm-project/blob/main/clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu#L64 may be used as an example.

need a codegen test to make sure amdgpu_kernel ABI is used in C/C++ for functions with this attribute

We've already got tests that check the amdgpu_kernel calling conv is lowered correctly. This change only adds it to clang, so checking the IR seems sufficient. I can copy/paste/mod that test case if you like, but it doesn't give any extra coverage that I can see

need a codegen test to make sure amdgpu_kernel ABI is used in C/C++ for functions with this attribute

We've already got tests that check the amdgpu_kernel calling conv is lowered correctly. This change only adds it to clang, so checking the IR seems sufficient. I can copy/paste/mod that test case if you like, but it doesn't give any extra coverage that I can see

we only need an IR test, e.g. check struct type kernel arg is using byref.

JonChesterfield added a comment.EditedMay 19 2022, 1:43 PM

OK, so that's a different thing. CUDA/HIP has a bunch of rules about implicitly tagging things with addrspace(1) at the call boundary. I don't think any of that magic should exist for C or C++, the developer gets to spell out the address space stuff they want explicitly, and if they want to link it against CUDA/HIP/OpenMP code they get to look up the rules. In particular, persuading clang to do the extra argument mangling stuff will get in the way of using this to create test cases.

I'm writing the corresponding test case for that now. I only really wanted void func(void) to be usable from this, with the assumption that the developer digs arguments out of the kernarg pointer if they want some.

Found something weird. The cuda test case above compiles with O0 and also with O2.

__attribute__((amdgpu_kernel)) void kernel1(int *x) {
  x[0]++;
}

On O0, the argument is in no particular address space, as specified.

; Function Attrs: mustprogress noinline nounwind optnone
define dso_local amdgpu_kernel void @_Z7kernel1Pi(i32* noundef %x) #0 {
entry:
  %x.addr = alloca i32*, align 8, addrspace(5)
  %x.addr.ascast = addrspacecast i32* addrspace(5)* %x.addr to i32**
  store i32* %x, i32** %x.addr.ascast, align 8
  %0 = load i32*, i32** %x.addr.ascast, align 8
  %arrayidx = getelementptr inbounds i32, i32* %0, i64 0
  %1 = load i32, i32* %arrayidx, align 4
  %inc = add nsw i32 %1, 1
  store i32 %inc, i32* %arrayidx, align 4
  ret void
}

After opt O2, we cast the address to addrspace(1), then dereference that.

define dso_local amdgpu_kernel void @_Z7kernel1Pi(i32* nocapture noundef %x) local_unnamed_addr #0 {
entry:
  %x.global = addrspacecast i32* %x to i32 addrspace(1)*
  %0 = load i32, i32 addrspace(1)* %x.global, align 4, !amdgpu.noclobber !2
  %inc = add nsw i32 %0, 1
  store i32 %inc, i32 addrspace(1)* %x.global, align 4
  ret void
}

So that's a bug, right? An optimisation has decided an unadorned pointer argument can be assumed to be in addrspace(1)

  • Add O0 arg passing codegen test

Added a codegen test for arg passing. It establishes that most arguments are left alone, but structs passed by value are handled as an addrspace(4) byref. Letting opt -O2 run annotated some argument pointers as being in addrspace(1) which I think is wrong.

I have no judgement to pass on the current codegen for this - it might be correct, or we might have bugs in opt - but those bugs are exactly those that are easier to flush out if we land this patch first. All I want to achieve with this patch is a means of tagging a void func(void) with the calling convention.

If we wanted to go further - and turn C++ into a language that is robustly usable on the GPU - I think we have a bunch of tasks related to handling addrspace annotations to burn through.

yaxunl accepted this revision.May 19 2022, 2:38 PM

LGTM. Thanks.

This revision is now accepted and ready to land.May 19 2022, 2:38 PM

Thanks for accepting! I'm interested to learn more about how the calling conv works, e.g. if parts of it are implemented in clang and parts of it patched on the fly by opt, but that's downstream of easy access to writing C tests that use it.

  • Rebase on main
  • Fix git merge misfires
  • Fix git merge misfires

Unintentionally created this patch against an older version of main and it interacted badly with D124998 on the rebase. Rerunning tests now, and will leave this open for further comments for a little while. Thanks all

This revision was landed with ongoing or failed builds.May 20 2022, 12:51 AM
This revision was automatically updated to reflect the committed changes.
JonChesterfield added a comment.EditedMay 20 2022, 1:43 AM

A clangd buildbot (https://lab.llvm.org/buildbot/#/builders/131/builds/27770) failed on this with

[ RUN      ] SerializationTest.NoCrashOnBadArraySize
==384111==ERROR: ThreadSanitizer failed to allocate 0x10000 (65536) bytes of stack depot (error code: 12)
ERROR: Failed to mmap

I believe this is coincidence

edit: confirmed, earlier builds failed the same way, e.g. https://lab.llvm.org/buildbot/#/builders/131/builds/27768

The primary application is for more rapid debugging of the amdgpu backend by permuting a C or C++ test file instead of manually updating an IR file.

Given that this is adding a calling convention, which has significant impacts on our type system: is this use case important enough to steal a bit for this CC? This sounds *super* special case to me, but maybe it's a common need?

clang/include/clang/Basic/Attr.td
1862

No new undocumented attributes.

The primary application is for more rapid debugging of the amdgpu backend by permuting a C or C++ test file instead of manually updating an IR file.

Given that this is adding a calling convention, which has significant impacts on our type system: is this use case important enough to steal a bit for this CC? This sounds *super* special case to me, but maybe it's a common need?

Btw, I'd appreciate if you gave code reviewers more than one day to review a change to the type system before landing -- I'm in WG14 meetings all week, so I don't have much time to do a thorough review of something like this.

If it was adding a calling convention, sure - caution warranted. There's no llvm change here though, an existing CC is exposed to C++. No change to the type system either.

I'll propose a patch with some documentation for it if you wish, but it'll just say "For ad hoc debugging of the amdgpu backend". Undocumented seems to state that more clearly.

If it was adding a calling convention, sure - caution warranted. There's no llvm change here though, an existing CC is exposed to C++. No change to the type system either.

This is adding a user-facing calling convention to Clang and it changes the type system as a result. For example, lambda function pointer conversion operators sometimes are generated for each calling convention so that you can form a function pointer of the correct type (this might not be impacted by your change here); there's a specific number of bits for representing the enumeration of calling conventions and this uses one of those bits, etc.

I'll propose a patch with some documentation for it if you wish, but it'll just say "For ad hoc debugging of the amdgpu backend". Undocumented seems to state that more clearly.

I continue to question whether we want to support such a calling convention. This does not seem to be generally useful enough to warrant inclusion in Clang. The fact that you'd like to leave it undocumented as that's more clear for users is a pretty good indication that this calling convention doesn't meet the bar for an extension.

If it was adding a calling convention, sure - caution warranted. There's no llvm change here though, an existing CC is exposed to C++. No change to the type system either.

This is adding a user-facing calling convention to Clang and it changes the type system as a result. For example, lambda function pointer conversion operators sometimes are generated for each calling convention so that you can form a function pointer of the correct type (this might not be impacted by your change here); there's a specific number of bits for representing the enumeration of calling conventions and this uses one of those bits, etc.

It slightly changes the type system of C++ code in that the calling convention was previously only available in opencl / openmp etc. I was under the impression that the compiler data representation cost of calling conventions was in LLVM and thus pre-paid for the calling convention this gives access to. There's the enum CallingConv which has gained a field, I didn't realise that was input into something of limited bitwidth.

I'll propose a patch with some documentation for it if you wish, but it'll just say "For ad hoc debugging of the amdgpu backend". Undocumented seems to state that more clearly.

I continue to question whether we want to support such a calling convention. This does not seem to be generally useful enough to warrant inclusion in Clang. The fact that you'd like to leave it undocumented as that's more clear for users is a pretty good indication that this calling convention doesn't meet the bar for an extension.

Strictly speaking this lets people write a GPU kernel that can execute on AMDGPU in freestanding C++. I happen to want to do that for testing LLVM in the immediate instance but there's arguably wider applicability. However, it looks like how arguments are represented in this calling convention has some strangeness (see discussion with Sam above), particularly with regard to address spaces.

I can revert this patch if necessary, but it'll force me to continue trying to test our compiler through the lens of opencl, and rules out programming the hardware without the various specific language front ends. I think that would be a sad loss.

If it was adding a calling convention, sure - caution warranted. There's no llvm change here though, an existing CC is exposed to C++. No change to the type system either.

This is adding a user-facing calling convention to Clang and it changes the type system as a result. For example, lambda function pointer conversion operators sometimes are generated for each calling convention so that you can form a function pointer of the correct type (this might not be impacted by your change here); there's a specific number of bits for representing the enumeration of calling conventions and this uses one of those bits, etc.

It slightly changes the type system of C++ code in that the calling convention was previously only available in opencl / openmp etc. I was under the impression that the compiler data representation cost of calling conventions was in LLVM and thus pre-paid for the calling convention this gives access to. There's the enum CallingConv which has gained a field, I didn't realise that was input into something of limited bitwidth.

Calling conventions are weird in that they have a fair amount of frontend AND backend work involved with them (though maybe this one is more backend than frontend as it doesn't seem to be doing much different in codegen). As for the bit-width thing, it mostly comes into play here: https://github.com/llvm/llvm-project/blob/main/clang/include/clang/AST/Type.h#L3678 -- we try to pack as much information into as few bits as possible because Type overhead causes big problems (for example, it limits template instantiation depth due to memory overhead). We have *some* wiggle room left in that bit-field, and we have some ideas on how to improve the situation, but... nobody's done the refactoring work yet and each new calling convention we add brings us that much closer to the answer being "sorry, can't do it.".

I'll propose a patch with some documentation for it if you wish, but it'll just say "For ad hoc debugging of the amdgpu backend". Undocumented seems to state that more clearly.

I continue to question whether we want to support such a calling convention. This does not seem to be generally useful enough to warrant inclusion in Clang. The fact that you'd like to leave it undocumented as that's more clear for users is a pretty good indication that this calling convention doesn't meet the bar for an extension.

Strictly speaking this lets people write a GPU kernel that can execute on AMDGPU in freestanding C++.

That sounds generally useful, which is great!

I happen to want to do that for testing LLVM in the immediate instance but there's arguably wider applicability. However, it looks like how arguments are represented in this calling convention has some strangeness (see discussion with Sam above), particularly with regard to address spaces.

That sounds less great. :-( This suggests there may be another calling convention in the future which corrects those deficiencies.

I can revert this patch if necessary, but it'll force me to continue trying to test our compiler through the lens of opencl, and rules out programming the hardware without the various specific language front ends. I think that would be a sad loss.

My thinking is: I don't want you to revert and have no solution, because you have a problem to solve and this (presumably) solves it for you. But at the same time, I'd like us to be able to explore options to make sure that a calling convention is the best approach and that we're confident that we won't need additional calling conventions to fix corner cases in the future. For example, can the calling convention be inferred at codegen time in Clang or by an LLVM pass so that the FE doesn't need to expose an attribute through the type system?

Have you explored alternatives that don't require a user-facing attribute?

In HIP, kernels are represented by attribute global and not by calling convention in clang. This may be an alternative.

Another alternative might be merging amdgpu_kernel and opencl_kernel calling convention since for the same target they are the same. They could be represented by the same kernel calling convention in AST.