Page MenuHomePhabricator

[HIP] Fix offset of kernel argument for AMDGPU target
Needs ReviewPublic

Authored by yaxunl on Thu, Nov 29, 11:20 AM.

Details

Summary

Clang emits call of hipSetupArgument(arg, size, offset) in host IR to set up arguments
for a HIP kernel. The offset should meet the expection of the device backend.

Currently clang uses AST alignment to calculate the offset. This works for nvptx
backend and in most cases works for amdpu backend. However, this does not work
when the kernel argument is a packed struct.

In the device IR for amdgpu backend, a struct type kernel argument is passed directly,
instead of by a pointer with byval attribute. The backend calculates the offset of
the argument by ABI alignment of the arg in IR. For packed struct, this is always 1.
However, its AST alignment is different. This discrepency causes incorrect offset
value used in the emitted call of hipSetupArgument.

This patch fixes the issue by using ABI alignment of kernel arg in IR to calculate its offset
for amdgpu target.

It does not affect other targets.

Diff Detail

Event Timeline

yaxunl created this revision.Thu, Nov 29, 11:20 AM

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

arsenm added inline comments.Thu, Nov 29, 11:53 AM
lib/CodeGen/CGCUDANV.cpp
205

Checking the specific target seems wrong. Shouldn't you just need to check if the value is byval or not?

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

AMDGPU backend does not support passing struct type kernel argument by pointer with byval attribute.

@arsenm Do you think it is feasible to use pointer with byval attribute to pass the struct type kernel argument? Thanks.

lib/CodeGen/CGCUDANV.cpp
205

This is in the host code gen for the kernel stub function, where the struct argument is always by val for x86_64. We cannot use that to differentiate between nvptx and amdgcn.

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

Also, the alignment of the kernel argument is only used to put the argument in some buffer accessible to the device. It needs not match the C ABI alignment. If the user take address of this argument, they will get address of an alloca, which matches the C ABI alignment.

For example, in code below, %s.coerce is OK to align to 1. If user takes its address, they get %s, which is aligned to 8.

%struct.S = type <{ i32*, i8, %struct.U, [5 x i8] }>
%struct.U = type { i16 }

; Function Attrs: convergent noinline nounwind optnone
define amdgpu_kernel void @_Z6kernelc1SPi(i8 %a, %struct.S %s.coerce, i32* %b) #0 {
entry:
  %s = alloca %struct.S, align 8, addrspace(5)
  %s1 = addrspacecast %struct.S addrspace(5)* %s to %struct.S*
  %a.addr = alloca i8, align 1, addrspace(5)
  %0 = addrspacecast i8 addrspace(5)* %a.addr to i8*
  %b.addr = alloca i32*, align 8, addrspace(5)
  %1 = addrspacecast i32* addrspace(5)* %b.addr to i32**
  store %struct.S %s.coerce, %struct.S* %s1, align 8
  store i8 %a, i8* %0, align 1
  store i32* %b, i32** %1, align 8
  ret void
}

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

Also, the alignment of the kernel argument is only used to put the argument in some buffer accessible to the device. It needs not match the C ABI alignment. If the user take address of this argument, they will get address of an alloca, which matches the C ABI alignment.

For example, in code below, %s.coerce is OK to align to 1. If user takes its address, they get %s, which is aligned to 8.

%struct.S = type <{ i32*, i8, %struct.U, [5 x i8] }>
%struct.U = type { i16 }

; Function Attrs: convergent noinline nounwind optnone
define amdgpu_kernel void @_Z6kernelc1SPi(i8 %a, %struct.S %s.coerce, i32* %b) #0 {
entry:
  %s = alloca %struct.S, align 8, addrspace(5)
  %s1 = addrspacecast %struct.S addrspace(5)* %s to %struct.S*
  %a.addr = alloca i8, align 1, addrspace(5)
  %0 = addrspacecast i8 addrspace(5)* %a.addr to i8*
  %b.addr = alloca i32*, align 8, addrspace(5)
  %1 = addrspacecast i32* addrspace(5)* %b.addr to i32**
  store %struct.S %s.coerce, %struct.S* %s1, align 8
  store i8 %a, i8* %0, align 1
  store i32* %b, i32** %1, align 8
  ret void
}

Well the bad thing is that users need to specify different offset for nvptx and amdgpu for the same kernel, if they ever calls {cuda|hip}SetupArgument explicitly.

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

AMDGPU backend does not support passing struct type kernel argument by pointer with byval attribute.

@arsenm Do you think it is feasible to use pointer with byval attribute to pass the struct type kernel argument? Thanks.

byval really doesn't make any sense for kernels. The address space wouldn't even match correctly, since byval is for stack passed arguments

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

AMDGPU backend does not support passing struct type kernel argument by pointer with byval attribute.

@arsenm Do you think it is feasible to use pointer with byval attribute to pass the struct type kernel argument? Thanks.

byval really doesn't make any sense for kernels. The address space wouldn't even match correctly, since byval is for stack passed arguments

How about clang emits kernel argument alignment metadata for C ABI and backend follows that?

I understand that it's copied into a properly-aligned local variable, but if it affects how the function is called, that's also part of the ABI, and it should be taken from the C alignment, not any vagaries of how struct types are translated into IR.

The other reason to stick with the C alignment value is that it's actually stable across compiler reasons, whereas IRGen does not promise to use the same IR type for a struct across compiler versions.

Now, you're a GPU compiler, so maybe you don't have any ABI requirements, or maybe they're weaker for kernel functions, in which case this is basically irrelevant. But if you do care about compatibility here, you should be aiming to communicate the alignment of this parameter correctly.

This is part of why we largely try to avoid using arbitrary first-class aggregate as parameters in IR.

arsenm added inline comments.Thu, Nov 29, 1:56 PM
lib/CodeGen/CGCUDANV.cpp
205

No, you need to see what the lowering behavior is for the argument type in the device's calling convention

I don't understand why there's a discrepancy in this case. Why does anything think the alignment of a packed struct is anything other than 1? Why is the C alignment claiming something higher?

I don't understand why there's a discrepancy in this case. Why does anything think the alignment of a packed struct is anything other than 1? Why is the C alignment claiming something higher?

Just because something is a packed struct in LLVM IR doesn't mean it was a packed struct in C.

arsenm added a comment.Tue, Dec 4, 9:43 AM

I understand that it's copied into a properly-aligned local variable, but if it affects how the function is called, that's also part of the ABI, and it should be taken from the C alignment, not any vagaries of how struct types are translated into IR.

The other reason to stick with the C alignment value is that it's actually stable across compiler reasons, whereas IRGen does not promise to use the same IR type for a struct across compiler versions.

Now, you're a GPU compiler, so maybe you don't have any ABI requirements, or maybe they're weaker for kernel functions, in which case this is basically irrelevant. But if you do care about compatibility here, you should be aiming to communicate the alignment of this parameter correctly.

This is part of why we largely try to avoid using arbitrary first-class aggregate as parameters in IR.

byval isn't suitable for the purposes of the entry point here, so I don't see what the alternative is with the current constraints. We've defined the ABI input buffer as just the ABI allocation size/alignment of the IR type in order. Clang needs to get this to match what it wants.

The alternative I would like is to stop using the IR argument list at all for kernels. All arguments would be accessed from offsets from an intrinsic call

I understand that it's copied into a properly-aligned local variable, but if it affects how the function is called, that's also part of the ABI, and it should be taken from the C alignment, not any vagaries of how struct types are translated into IR.

The other reason to stick with the C alignment value is that it's actually stable across compiler reasons, whereas IRGen does not promise to use the same IR type for a struct across compiler versions.

Now, you're a GPU compiler, so maybe you don't have any ABI requirements, or maybe they're weaker for kernel functions, in which case this is basically irrelevant. But if you do care about compatibility here, you should be aiming to communicate the alignment of this parameter correctly.

This is part of why we largely try to avoid using arbitrary first-class aggregate as parameters in IR.

byval isn't suitable for the purposes of the entry point here, so I don't see what the alternative is with the current constraints. We've defined the ABI input buffer as just the ABI allocation size/alignment of the IR type in order. Clang needs to get this to match what it wants.

What I have said, repeatedly, is that that is not a good ABI rule because the alignment of the IR type is not guaranteed to be meaningful or even stable. Maybe there are good reasons why you don't care about having a stable ABI, but I need you to at least acknowledge that that's what you're doing, because otherwise it is hard for me to approve a patch that I know has this problem.

The alternative I would like is to stop using the IR argument list at all for kernels. All arguments would be accessed from offsets from an intrinsic call

That seems like a very reasonable alternative, or you could have the function take a single pointer argument and do all accesses relative to that. I'll leave that up to you.

I understand that it's copied into a properly-aligned local variable, but if it affects how the function is called, that's also part of the ABI, and it should be taken from the C alignment, not any vagaries of how struct types are translated into IR.

The other reason to stick with the C alignment value is that it's actually stable across compiler reasons, whereas IRGen does not promise to use the same IR type for a struct across compiler versions.

Now, you're a GPU compiler, so maybe you don't have any ABI requirements, or maybe they're weaker for kernel functions, in which case this is basically irrelevant. But if you do care about compatibility here, you should be aiming to communicate the alignment of this parameter correctly.

This is part of why we largely try to avoid using arbitrary first-class aggregate as parameters in IR.

byval isn't suitable for the purposes of the entry point here, so I don't see what the alternative is with the current constraints. We've defined the ABI input buffer as just the ABI allocation size/alignment of the IR type in order. Clang needs to get this to match what it wants.

What I have said, repeatedly, is that that is not a good ABI rule because the alignment of the IR type is not guaranteed to be meaningful or even stable. Maybe there are good reasons why you don't care about having a stable ABI, but I need you to at least acknowledge that that's what you're doing, because otherwise it is hard for me to approve a patch that I know has this problem.

I don't understand how the alignment of the IR type isn't meaningful or stable. The DataLayout gives us the concept of an ABI alignment, but you seem to be saying this is meaningless (in which case why do we have it?)

I think we have different understandings of what an ABI means in this case. The "call" to a kernel isn't made in a vacuum unlike a normal function. For OpenCL at least, the backend produces metadata in the object file telling the explicit offsets of the individual arguments, which we today compute from the IR argument list. The actual caller isn't using the types to decide where to place arguments, the binary is explicitly telling it where. For some reason CUDA seems to be doing something different here which might be part of my confusion?

The alternative I would like is to stop using the IR argument list at all for kernels. All arguments would be accessed from offsets from an intrinsic call

That seems like a very reasonable alternative, or you could have the function take a single pointer argument and do all accesses relative to that. I'll leave that up to you.

We would need to make up some other way to track the individual arguments, which I haven't come up with a great solution for (other than keeping them there without uses, which seems pretty awkward). As an optimization we currently do lower the kernel argument uses to look like this, but we compute the offsets from the IR arguments.

An OpenCL kernel may call another OpenCL kernel. I am wondering how do you pass arguments to the kernel callee.

A simpler solution would be not letting hipSetupArgument specify the offset. Since the kernel metadata contains the alignment info of kernel argument, HIP runtime is able to calculate the offset based on kernel metadata by itself. hipSetupArgument only needs to specify the index of the argument and its size.

I don't understand how the alignment of the IR type isn't meaningful or stable. The DataLayout gives us the concept of an ABI alignment, but you seem to be saying this is meaningless (in which case why do we have it?)

LLVM has stable rules for computing the alignment of a particular LLVM struct type. Clang IRGen does not have stable rules for turning a Clang struct type into an LLVM struct type. Clang always reserves the right to translate a particular source-language struct type into, say, { [8 x i8] } instead of { i32, i32 }. This is unlikely for sufficiently simple types just because it's nice for debugging and for the compactness of the generated code if the generated struct type is reasonably similar to the original type, but there are complex cases where the translation is not totally obvious, and we don't want to make promises. Because the translation to LLVM IR type is not stable, the choice of alignment for the LLVM IR type cannot be guaranteed either.

I think we have different understandings of what an ABI means in this case. The "call" to a kernel isn't made in a vacuum unlike a normal function. For OpenCL at least, the backend produces metadata in the object file telling the explicit offsets of the individual arguments, which we today compute from the IR argument list. The actual caller isn't using the types to decide where to place arguments, the binary is explicitly telling it where. For some reason CUDA seems to be doing something different here which might be part of my confusion?

If you don't guarantee stability across recompilations for the size and layout of this buffer, and neither callers nor callees make assumptions about the alignment of pointers within it, then you might indeed not have any ABI requirements. But in that case you might as well just use a packed structure so that you can minimize the number of bytes you have to transfer.

The alternative I would like is to stop using the IR argument list at all for kernels. All arguments would be accessed from offsets from an intrinsic call

That seems like a very reasonable alternative, or you could have the function take a single pointer argument and do all accesses relative to that. I'll leave that up to you.

We would need to make up some other way to track the individual arguments, which I haven't come up with a great solution for (other than keeping them there without uses, which seems pretty awkward). As an optimization we currently do lower the kernel argument uses to look like this, but we compute the offsets from the IR arguments.

Okay. It seems to me like there should be exactly one place responsible for computing the layout of this buffer. Like, if you really don't care about alignment, you could pack it; or if you do care about alignment but don't need stability, you could reorder the fields to minimize alignment padding; or really anything along those lines.

arsenm added a comment.Wed, Dec 5, 2:36 PM

I think if we can just declare something simple to follow that doesn't depend on the IR type alignment, we could pack any basic type and align any aggregates to 4

yaxunl added a comment.Wed, Dec 5, 7:25 PM

I think if we can just declare something simple to follow that doesn't depend on the IR type alignment, we could pack any basic type and align any aggregates to 4

From the user point of view, I don't think it is a good idea to let the user calculate the offset and pass it to hipSetupArgument, especially if we want to have some specific alignment rules for amdgpu backend. We should just drop the offset parameter from hipSetupArgument and let the runtime figure out the offset based on device code metadata. This will always work no matter how backend changes the way to layout the kernel arguments.