This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/
-
CodeGen/
3
CGCUDANV.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
kernel-args-alignment.cu

Differential D55067

[HIP] Fix offset of kernel argument for AMDGPU target
AbandonedPublic

Authored by yaxunl on Nov 29 2018, 11:20 AM.

Download Raw Diff

Details

Reviewers

tra
arsenm
rjmccall

Commits

rG42260d2938d8: hip-clang: Fix kernel arg alignment for amdgcn

Summary

Clang emits call of hipSetupArgument(arg, size, offset) in host IR to set up arguments
for a HIP kernel. The offset should meet the expection of the device backend.

Currently clang uses AST alignment to calculate the offset. This works for nvptx
backend and in most cases works for amdpu backend. However, this does not work
when the kernel argument is a packed struct.

In the device IR for amdgpu backend, a struct type kernel argument is passed directly,
instead of by a pointer with byval attribute. The backend calculates the offset of
the argument by ABI alignment of the arg in IR. For packed struct, this is always 1.
However, its AST alignment is different. This discrepency causes incorrect offset
value used in the emitted call of hipSetupArgument.

This patch fixes the issue by using ABI alignment of kernel arg in IR to calculate its offset
for amdgpu target.

It does not affect other targets.

Diff Detail

Event Timeline

yaxunl created this revision.Nov 29 2018, 11:20 AM

Herald added subscribers: t-tye, tpr, dstuttard and 2 others. · View Herald TranscriptNov 29 2018, 11:20 AM

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

arsenm added inline comments.Nov 29 2018, 11:53 AM

lib/CodeGen/CGCUDANV.cpp
205	Checking the specific target seems wrong. Shouldn't you just need to check if the value is byval or not?

In D55067#1313213, @rjmccall wrote:

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

AMDGPU backend does not support passing struct type kernel argument by pointer with byval attribute.

@arsenm Do you think it is feasible to use pointer with byval attribute to pass the struct type kernel argument? Thanks.

lib/CodeGen/CGCUDANV.cpp
205	This is in the host code gen for the kernel stub function, where the struct argument is always by val for x86_64. We cannot use that to differentiate between nvptx and amdgcn.

In D55067#1313213, @rjmccall wrote:

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

Also, the alignment of the kernel argument is only used to put the argument in some buffer accessible to the device. It needs not match the C ABI alignment. If the user take address of this argument, they will get address of an alloca, which matches the C ABI alignment.

For example, in code below, %s.coerce is OK to align to 1. If user takes its address, they get %s, which is aligned to 8.

%struct.S = type <{ i32*, i8, %struct.U, [5 x i8] }>
%struct.U = type { i16 }

; Function Attrs: convergent noinline nounwind optnone
define amdgpu_kernel void @_Z6kernelc1SPi(i8 %a, %struct.S %s.coerce, i32* %b) #0 {
entry:
  %s = alloca %struct.S, align 8, addrspace(5)
  %s1 = addrspacecast %struct.S addrspace(5)* %s to %struct.S*
  %a.addr = alloca i8, align 1, addrspace(5)
  %0 = addrspacecast i8 addrspace(5)* %a.addr to i8*
  %b.addr = alloca i32*, align 8, addrspace(5)
  %1 = addrspacecast i32* addrspace(5)* %b.addr to i32**
  store %struct.S %s.coerce, %struct.S* %s1, align 8
  store i8 %a, i8* %0, align 1
  store i32* %b, i32** %1, align 8
  ret void
}

In D55067#1313290, @yaxunl wrote:
In D55067#1313213, @rjmccall wrote:

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

Also, the alignment of the kernel argument is only used to put the argument in some buffer accessible to the device. It needs not match the C ABI alignment. If the user take address of this argument, they will get address of an alloca, which matches the C ABI alignment.

For example, in code below, %s.coerce is OK to align to 1. If user takes its address, they get %s, which is aligned to 8.
%struct.S = type <{ i32*, i8, %struct.U, [5 x i8] }>
%struct.U = type { i16 }

; Function Attrs: convergent noinline nounwind optnone
define amdgpu_kernel void @_Z6kernelc1SPi(i8 %a, %struct.S %s.coerce, i32* %b) #0 {
entry:
  %s = alloca %struct.S, align 8, addrspace(5)
  %s1 = addrspacecast %struct.S addrspace(5)* %s to %struct.S*
  %a.addr = alloca i8, align 1, addrspace(5)
  %0 = addrspacecast i8 addrspace(5)* %a.addr to i8*
  %b.addr = alloca i32*, align 8, addrspace(5)
  %1 = addrspacecast i32* addrspace(5)* %b.addr to i32**
  store %struct.S %s.coerce, %struct.S* %s1, align 8
  store i8 %a, i8* %0, align 1
  store i32* %b, i32** %1, align 8
  ret void
}

Well the bad thing is that users need to specify different offset for nvptx and amdgpu for the same kernel, if they ever calls {cuda|hip}SetupArgument explicitly.

In D55067#1313264, @yaxunl wrote:

In D55067#1313213, @rjmccall wrote:

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

AMDGPU backend does not support passing struct type kernel argument by pointer with byval attribute.

@arsenm Do you think it is feasible to use pointer with byval attribute to pass the struct type kernel argument? Thanks.

byval really doesn't make any sense for kernels. The address space wouldn't even match correctly, since byval is for stack passed arguments

In D55067#1313364, @arsenm wrote:

In D55067#1313264, @yaxunl wrote:

In D55067#1313213, @rjmccall wrote:

This seems backwards. Clang knows what the actual ABI alignment of the C type is, and it doesn't have to match the alignment of the IR type that IRGen produces. It's the actual C ABI alignment that's supposed to affect the calling convention, so there needs to be some way to specify the C ABI alignment on the parameter in IR. That may mean using byval, which can be given an explicit alignment.

AMDGPU backend does not support passing struct type kernel argument by pointer with byval attribute.

@arsenm Do you think it is feasible to use pointer with byval attribute to pass the struct type kernel argument? Thanks.

byval really doesn't make any sense for kernels. The address space wouldn't even match correctly, since byval is for stack passed arguments

How about clang emits kernel argument alignment metadata for C ABI and backend follows that?

I understand that it's copied into a properly-aligned local variable, but if it affects how the function is called, that's also part of the ABI, and it should be taken from the C alignment, not any vagaries of how struct types are translated into IR.

The other reason to stick with the C alignment value is that it's actually stable across compiler reasons, whereas IRGen does not promise to use the same IR type for a struct across compiler versions.

Now, you're a GPU compiler, so maybe you don't have any ABI requirements, or maybe they're weaker for kernel functions, in which case this is basically irrelevant. But if you do care about compatibility here, you should be aiming to communicate the alignment of this parameter correctly.

This is part of why we largely try to avoid using arbitrary first-class aggregate as parameters in IR.

arsenm added inline comments.Nov 29 2018, 1:56 PM

lib/CodeGen/CGCUDANV.cpp
205	No, you need to see what the lowering behavior is for the argument type in the device's calling convention

I don't understand why there's a discrepancy in this case. Why does anything think the alignment of a packed struct is anything other than 1? Why is the C alignment claiming something higher?

In D55067#1313480, @arsenm wrote:

I don't understand why there's a discrepancy in this case. Why does anything think the alignment of a packed struct is anything other than 1? Why is the C alignment claiming something higher?

Just because something is a packed struct in LLVM IR doesn't mean it was a packed struct in C.

In D55067#1313419, @rjmccall wrote:

I understand that it's copied into a properly-aligned local variable, but if it affects how the function is called, that's also part of the ABI, and it should be taken from the C alignment, not any vagaries of how struct types are translated into IR.

The other reason to stick with the C alignment value is that it's actually stable across compiler reasons, whereas IRGen does not promise to use the same IR type for a struct across compiler versions.

Now, you're a GPU compiler, so maybe you don't have any ABI requirements, or maybe they're weaker for kernel functions, in which case this is basically irrelevant. But if you do care about compatibility here, you should be aiming to communicate the alignment of this parameter correctly.

This is part of why we largely try to avoid using arbitrary first-class aggregate as parameters in IR.

byval isn't suitable for the purposes of the entry point here, so I don't see what the alternative is with the current constraints. We've defined the ABI input buffer as just the ABI allocation size/alignment of the IR type in order. Clang needs to get this to match what it wants.

The alternative I would like is to stop using the IR argument list at all for kernels. All arguments would be accessed from offsets from an intrinsic call

In D55067#1318810, @arsenm wrote:

In D55067#1313419, @rjmccall wrote:

I understand that it's copied into a properly-aligned local variable, but if it affects how the function is called, that's also part of the ABI, and it should be taken from the C alignment, not any vagaries of how struct types are translated into IR.

The other reason to stick with the C alignment value is that it's actually stable across compiler reasons, whereas IRGen does not promise to use the same IR type for a struct across compiler versions.

Now, you're a GPU compiler, so maybe you don't have any ABI requirements, or maybe they're weaker for kernel functions, in which case this is basically irrelevant. But if you do care about compatibility here, you should be aiming to communicate the alignment of this parameter correctly.

This is part of why we largely try to avoid using arbitrary first-class aggregate as parameters in IR.

byval isn't suitable for the purposes of the entry point here, so I don't see what the alternative is with the current constraints. We've defined the ABI input buffer as just the ABI allocation size/alignment of the IR type in order. Clang needs to get this to match what it wants.

What I have said, repeatedly, is that that is not a good ABI rule because the alignment of the IR type is not guaranteed to be meaningful or even stable. Maybe there are good reasons why you don't care about having a stable ABI, but I need you to at least acknowledge that that's what you're doing, because otherwise it is hard for me to approve a patch that I know has this problem.

The alternative I would like is to stop using the IR argument list at all for kernels. All arguments would be accessed from offsets from an intrinsic call

That seems like a very reasonable alternative, or you could have the function take a single pointer argument and do all accesses relative to that. I'll leave that up to you.

In D55067#1318959, @rjmccall wrote:

In D55067#1318810, @arsenm wrote:

In D55067#1313419, @rjmccall wrote:

I understand that it's copied into a properly-aligned local variable, but if it affects how the function is called, that's also part of the ABI, and it should be taken from the C alignment, not any vagaries of how struct types are translated into IR.

The other reason to stick with the C alignment value is that it's actually stable across compiler reasons, whereas IRGen does not promise to use the same IR type for a struct across compiler versions.

Now, you're a GPU compiler, so maybe you don't have any ABI requirements, or maybe they're weaker for kernel functions, in which case this is basically irrelevant. But if you do care about compatibility here, you should be aiming to communicate the alignment of this parameter correctly.

This is part of why we largely try to avoid using arbitrary first-class aggregate as parameters in IR.

byval isn't suitable for the purposes of the entry point here, so I don't see what the alternative is with the current constraints. We've defined the ABI input buffer as just the ABI allocation size/alignment of the IR type in order. Clang needs to get this to match what it wants.

What I have said, repeatedly, is that that is not a good ABI rule because the alignment of the IR type is not guaranteed to be meaningful or even stable. Maybe there are good reasons why you don't care about having a stable ABI, but I need you to at least acknowledge that that's what you're doing, because otherwise it is hard for me to approve a patch that I know has this problem.

I don't understand how the alignment of the IR type isn't meaningful or stable. The DataLayout gives us the concept of an ABI alignment, but you seem to be saying this is meaningless (in which case why do we have it?)

I think we have different understandings of what an ABI means in this case. The "call" to a kernel isn't made in a vacuum unlike a normal function. For OpenCL at least, the backend produces metadata in the object file telling the explicit offsets of the individual arguments, which we today compute from the IR argument list. The actual caller isn't using the types to decide where to place arguments, the binary is explicitly telling it where. For some reason CUDA seems to be doing something different here which might be part of my confusion?

The alternative I would like is to stop using the IR argument list at all for kernels. All arguments would be accessed from offsets from an intrinsic call

That seems like a very reasonable alternative, or you could have the function take a single pointer argument and do all accesses relative to that. I'll leave that up to you.

We would need to make up some other way to track the individual arguments, which I haven't come up with a great solution for (other than keeping them there without uses, which seems pretty awkward). As an optimization we currently do lower the kernel argument uses to look like this, but we compute the offsets from the IR arguments.

In D55067#1318810, @arsenm wrote:

An OpenCL kernel may call another OpenCL kernel. I am wondering how do you pass arguments to the kernel callee.

A simpler solution would be not letting hipSetupArgument specify the offset. Since the kernel metadata contains the alignment info of kernel argument, HIP runtime is able to calculate the offset based on kernel metadata by itself. hipSetupArgument only needs to specify the index of the argument and its size.

In D55067#1319047, @arsenm wrote:

I don't understand how the alignment of the IR type isn't meaningful or stable. The DataLayout gives us the concept of an ABI alignment, but you seem to be saying this is meaningless (in which case why do we have it?)

LLVM has stable rules for computing the alignment of a particular LLVM struct type. Clang IRGen does not have stable rules for turning a Clang struct type into an LLVM struct type. Clang always reserves the right to translate a particular source-language struct type into, say, { [8 x i8] } instead of { i32, i32 }. This is unlikely for sufficiently simple types just because it's nice for debugging and for the compactness of the generated code if the generated struct type is reasonably similar to the original type, but there are complex cases where the translation is not totally obvious, and we don't want to make promises. Because the translation to LLVM IR type is not stable, the choice of alignment for the LLVM IR type cannot be guaranteed either.

I think we have different understandings of what an ABI means in this case. The "call" to a kernel isn't made in a vacuum unlike a normal function. For OpenCL at least, the backend produces metadata in the object file telling the explicit offsets of the individual arguments, which we today compute from the IR argument list. The actual caller isn't using the types to decide where to place arguments, the binary is explicitly telling it where. For some reason CUDA seems to be doing something different here which might be part of my confusion?

If you don't guarantee stability across recompilations for the size and layout of this buffer, and neither callers nor callees make assumptions about the alignment of pointers within it, then you might indeed not have any ABI requirements. But in that case you might as well just use a packed structure so that you can minimize the number of bytes you have to transfer.

The alternative I would like is to stop using the IR argument list at all for kernels. All arguments would be accessed from offsets from an intrinsic call

That seems like a very reasonable alternative, or you could have the function take a single pointer argument and do all accesses relative to that. I'll leave that up to you.

We would need to make up some other way to track the individual arguments, which I haven't come up with a great solution for (other than keeping them there without uses, which seems pretty awkward). As an optimization we currently do lower the kernel argument uses to look like this, but we compute the offsets from the IR arguments.

Okay. It seems to me like there should be exactly one place responsible for computing the layout of this buffer. Like, if you really don't care about alignment, you could pack it; or if you do care about alignment but don't need stability, you could reorder the fields to minimize alignment padding; or really anything along those lines.

I think if we can just declare something simple to follow that doesn't depend on the IR type alignment, we could pack any basic type and align any aggregates to 4

In D55067#1320691, @arsenm wrote:

I think if we can just declare something simple to follow that doesn't depend on the IR type alignment, we could pack any basic type and align any aggregates to 4

From the user point of view, I don't think it is a good idea to let the user calculate the offset and pass it to hipSetupArgument, especially if we want to have some specific alignment rules for amdgpu backend. We should just drop the offset parameter from hipSetupArgument and let the runtime figure out the offset based on device code metadata. This will always work no matter how backend changes the way to layout the kernel arguments.

Is this still relevant? We want to move towards consistently using byref for kernel arguments anyway

This revision now requires changes to proceed.Nov 9 2021, 9:30 AM

this is no longer needed

Revision Contents

Path

Size

lib/

CodeGen/

CGCUDANV.cpp

16 lines

test/

CodeGenCUDA/

kernel-args-alignment.cu

32 lines

Diff 175908

lib/CodeGen/CGCUDANV.cpp

	Show First 20 Lines • Show All 193 Lines • ▼ Show 20 Lines

	void CGNVCUDARuntime::emitDeviceStubBody(CodeGenFunction &CGF,			void CGNVCUDARuntime::emitDeviceStubBody(CodeGenFunction &CGF,
	FunctionArgList &Args) {			FunctionArgList &Args) {
	// Emit a call to cudaSetupArgument for each arg in Args.			// Emit a call to cudaSetupArgument for each arg in Args.
	llvm::Constant *cudaSetupArgFn = getSetupArgumentFn();			llvm::Constant *cudaSetupArgFn = getSetupArgumentFn();
	llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");			llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");
	CharUnits Offset = CharUnits::Zero();			CharUnits Offset = CharUnits::Zero();
	for (const VarDecl *A : Args) {			for (const VarDecl *A : Args) {
				auto *Arg = CGF.GetAddrOfLocalVar(A).getPointer();
	CharUnits TyWidth, TyAlign;			CharUnits TyWidth, TyAlign;
				auto *Aux = CGM.getContext().getAuxTargetInfo();
				if (Aux && Aux->getTriple().getArch() == llvm::Triple::amdgcn) {
				arsenmUnsubmitted Not Done Reply Inline Actions Checking the specific target seems wrong. Shouldn't you just need to check if the value is byval or not? arsenm: Checking the specific target seems wrong. Shouldn't you just need to check if the value is…
				yaxunlAuthorUnsubmitted Not Done Reply Inline Actions This is in the host code gen for the kernel stub function, where the struct argument is always by val for x86_64. We cannot use that to differentiate between nvptx and amdgcn. yaxunl: This is in the host code gen for the kernel stub function, where the struct argument is always…
				arsenmUnsubmitted Not Done Reply Inline Actions No, you need to see what the lowering behavior is for the argument type in the device's calling convention arsenm: No, you need to see what the lowering behavior is for the argument type in the device's calling…
				auto *ArgTy = Arg->getType()->getPointerElementType();
				auto &DL = CGM.getDataLayout();
				TyWidth = CharUnits::fromQuantity(DL.getTypeStoreSize(ArgTy));
				TyAlign = CharUnits::fromQuantity(DL.getABITypeAlignment(ArgTy));
				} else {
	std::tie(TyWidth, TyAlign) =			std::tie(TyWidth, TyAlign) =
	CGM.getContext().getTypeInfoInChars(A->getType());			CGM.getContext().getTypeInfoInChars(A->getType());
				}
	Offset = Offset.alignTo(TyAlign);			Offset = Offset.alignTo(TyAlign);
	llvm::Value *Args[] = {			llvm::Value *Args[] = {
	CGF.Builder.CreatePointerCast(CGF.GetAddrOfLocalVar(A).getPointer(),			CGF.Builder.CreatePointerCast(Arg, VoidPtrTy),
	VoidPtrTy),
	llvm::ConstantInt::get(SizeTy, TyWidth.getQuantity()),			llvm::ConstantInt::get(SizeTy, TyWidth.getQuantity()),
	llvm::ConstantInt::get(SizeTy, Offset.getQuantity()),			llvm::ConstantInt::get(SizeTy, Offset.getQuantity()),
	};			};
	llvm::CallSite CS = CGF.EmitRuntimeCallOrInvoke(cudaSetupArgFn, Args);			llvm::CallSite CS = CGF.EmitRuntimeCallOrInvoke(cudaSetupArgFn, Args);
	llvm::Constant *Zero = llvm::ConstantInt::get(IntTy, 0);			llvm::Constant *Zero = llvm::ConstantInt::get(IntTy, 0);
	llvm::Value *CSZero = CGF.Builder.CreateICmpEQ(CS.getInstruction(), Zero);			llvm::Value *CSZero = CGF.Builder.CreateICmpEQ(CS.getInstruction(), Zero);
	llvm::BasicBlock *NextBlock = CGF.createBasicBlock("setup.next");			llvm::BasicBlock *NextBlock = CGF.createBasicBlock("setup.next");
	CGF.Builder.CreateCondBr(CSZero, NextBlock, EndBlock);			CGF.Builder.CreateCondBr(CSZero, NextBlock, EndBlock);
	▲ Show 20 Lines • Show All 416 Lines • Show Last 20 Lines

test/CodeGenCUDA/kernel-args-alignment.cu

	// RUN: %clang_cc1 --std=c++11 -triple x86_64-unknown-linux-gnu -emit-llvm -o - %s \| \			// RUN: %clang_cc1 --std=c++11 -triple x86_64-unknown-linux-gnu -emit-llvm -o - %s \| \
	// RUN: FileCheck -check-prefix HOST -check-prefix CHECK %s			// RUN: FileCheck -check-prefixes=HOST,HOST-NV,CHECK %s

	// RUN: %clang_cc1 --std=c++11 -fcuda-is-device -triple nvptx64-nvidia-cuda \			// RUN: %clang_cc1 --std=c++11 -fcuda-is-device -triple nvptx64-nvidia-cuda \
	// RUN: -emit-llvm -o - %s \| FileCheck -check-prefix DEVICE -check-prefix CHECK %s			// RUN: -emit-llvm -o - %s \| FileCheck -check-prefixes=DEVICE,DEVICE-NV,CHECK %s

				// RUN: %clang_cc1 --std=c++11 -triple x86_64-unknown-linux-gnu -x hip \
				// RUN: -aux-triple amdgcn-amd-amdhsa -emit-llvm -o - %s \| \
				// RUN: FileCheck -check-prefixes=HOST,HOST-AMD,CHECK %s

				// RUN: %clang_cc1 --std=c++11 -fcuda-is-device -triple amdgcn-amd-amdhsa \
				// RUN: -x hip -emit-llvm -o - %s \| FileCheck -check-prefixes=DEVICE,DEVICE-AMD,CHECK %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	struct U {			struct U {
	short x;			short x;
	} __attribute__((packed));			} __attribute__((packed));

	struct S {			struct S {
	int *ptr;			int *ptr;
	char a;			char a;
	U u;			U u;
	};			};

	// Clang should generate a packed LLVM struct for S (denoted by the <>s),			// Clang should generate a packed LLVM struct for S (denoted by the <>s),
	// otherwise this test isn't interesting.			// otherwise this test isn't interesting.
	// CHECK: %struct.S = type <{ i32*, i8, %struct.U, [5 x i8] }>			// CHECK: %struct.S = type <{ i32*, i8, %struct.U, [5 x i8] }>

	static_assert(alignof(S) == 8, "Unexpected alignment.");			static_assert(alignof(S) == 8, "Unexpected alignment.");

	// HOST-LABEL: @_Z6kernelc1SPi			// HOST-LABEL: @_Z6kernelc1SPi
	// Marshalled kernel args should be:			// For NVPTX backend, marshalled kernel args should be:
	// 1. offset 0, width 1			// 1. offset 0, width 1
	// 2. offset 8 (because alignof(S) == 8), width 16			// 2. offset 8 (because alignof(S) == 8), width 16
	// 3. offset 24, width 8			// 3. offset 24, width 8
	// HOST: call i32 @cudaSetupArgument({{[^,]*}}, i64 1, i64 0)			// HOST-NV: call i32 @cudaSetupArgument({{[^,]*}}, i64 1, i64 0)
	// HOST: call i32 @cudaSetupArgument({{[^,]*}}, i64 16, i64 8)			// HOST-NV: call i32 @cudaSetupArgument({{[^,]*}}, i64 16, i64 8)
	// HOST: call i32 @cudaSetupArgument({{[^,]*}}, i64 8, i64 24)			// HOST-NV: call i32 @cudaSetupArgument({{[^,]*}}, i64 8, i64 24)
				// AMDGPU backend assumes struct type kernel arguments are passed directly,
				// not byval. It lays out kernel arguments by size and alignment in IR.
				// Packed struct type in IR always has ABI alignment of 1.
				// For AMDGPU backend, marshalled kernel args should be:
				// 1. offset 0, width 1
				// 2. offset 1 (because ABI alignment of S is 1), width 16
				// 3. offset 24, width 8
				// HOST-AMD: call i32 @hipSetupArgument({{[^,]*}}, i64 1, i64 0)
				// HOST-AMD: call i32 @hipSetupArgument({{[^,]*}}, i64 16, i64 1)
				// HOST-AMD: call i32 @hipSetupArgument({{[^,]*}}, i64 8, i64 24)

	// DEVICE-LABEL: @_Z6kernelc1SPi			// DEVICE-LABEL: @_Z6kernelc1SPi
	// DEVICE-SAME: i8{{[^,]}}, %struct.S byval align 8{{[^,]}}, i32			// DEVICE-NV-SAME: i8{{[^,]}}, %struct.S byval align 8{{[^,]}}, i32
				// DEVICE-AMD-SAME: i8{{[^,]}}, %struct.S{{[^,]}}, i32
	__global__ void kernel(char a, S s, int *b) {}			__global__ void kernel(char a, S s, int *b) {}