This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/CodeGen/
-
CodeGen/
-
TargetInfo.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
6/11
amdgpu-kernel-arg-pointer-type.cu
-
kernel-args.cu

Differential D89980

[hip] Remove the coercion on aggregate kernel arguments.
ClosedPublic

Authored by hliao on Oct 22 2020, 12:54 PM.

Download Raw Diff

Details

Reviewers

arsenm
tra
rjmccall
yaxunl
msearles

Commits

rG8920ef06a138: [hip] Remove the coercion on aggregate kernel arguments.

Summary

If an aggregate argument is indirectly accessed within kernels, direct passing results in unpromotable alloca, which degrade performance significantly. InferAddrSpace pass is enhanced in D91121 to take the assumption that generic pointers loaded from the constant memory could be regarded global ones. The need for the coercion on aggregate arguments is mitigated.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hliao created this revision.Oct 22 2020, 12:54 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 22 2020, 12:54 PM

Herald added subscribers: cfe-commits, kerbowa, nhaehnle, jvesely. · View Herald Transcript

hliao requested review of this revision.Oct 22 2020, 12:54 PM

Herald added a subscriber: wdng. · View Herald TranscriptOct 22 2020, 12:54 PM

Are there any tests to illustrate what this change does to IR or generated code?

Harbormaster completed remote builds in B76082: Diff 300084.Oct 22 2020, 1:47 PM

Test case is enhanced to check that no kernel argument type is coerced.

In D89980#2348339, @tra wrote:

Are there any tests to illustrate what this change does to IR or generated code?

the existing test kernel-args.cu is enhanced by adding a pointer in that aggregate kernel argument. Previously, as the pointer is used in that aggregate kernel argument, it will be coerced into a global pointer. Now, it won't be changed anymore.

Revise the comment and point the safety issue by coercing the kernel argument
from a generic pointer to a global one.

arsenm requested changes to this revision.Oct 27 2020, 7:50 AM

arsenm added inline comments.

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
30	This test should not be deleted. I want to see the diff in the IR produced here. I think this still be worse, since AA does not answer all of the problems of using generic pointers

This revision now requires changes to proceed.Oct 27 2020, 7:50 AM

hliao added inline comments.Oct 27 2020, 7:54 AM

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
30	Could you elaborate on problems using generic pointers?

Harbormaster completed remote builds in B76548: Diff 300985.Oct 27 2020, 8:13 AM

Harbormaster completed remote builds in B76551: Diff 300989.Oct 27 2020, 8:37 AM

Add amdgpu-kernel-arg-pointer-type.cu back and revise its checks.

arsenm added inline comments.Oct 27 2020, 9:06 AM

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
22	This is still a regression. Fixing up AA does not solve the problem this promotions this is intended to solve. Generic accesses are worse independently of the aliasing properties

hliao added inline comments.Oct 27 2020, 9:29 AM

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
22	Do you mean FLAT load/store has worse addressing mode than GLOBAL ones?

arsenm added inline comments.Oct 27 2020, 9:30 AM

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
22	Yes. The flat offsets have a smaller range, and do not have the saddr mode. Flat accesses also won't avoid the extra lgmkcnt wait

hliao added inline comments.Oct 27 2020, 9:33 AM

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
22	I plan to add support to select GLOBAL ones once we could confirm that pointer could only point to GLOBAL/CONSTANT address spaces. Do you think that's a reasonable solution?

arsenm added inline comments.Oct 27 2020, 9:42 AM

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
22	I would much rather have the IR express the address space rather than fixing it up later. IR passes are aware of the addressing mode differences. Relying on AA for basic selection would also be worse for compile time

Harbormaster completed remote builds in B76574: Diff 301012.Oct 27 2020, 9:54 AM

hliao added inline comments.Oct 27 2020, 11:17 AM

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
22	LLVM IR is agnostic to the underlying addressing mode. In code selection, we won't use AA but just needs to check IR.

Besides the unpromotable alloca issue due to indirect accesses, such coercion to GLOBAL pointer directly is not safe as, in HIP/CUDA, both CONSTANT and GLOBAL pointers would be passed as the kernel arguments. Without introducing a new address space combing GLOBAL/CONSTANT, such coercion would be unsafe.

In D89980#2357208, @hliao wrote:

Besides the unpromotable alloca issue due to indirect accesses, such coercion to GLOBAL pointer directly is not safe as, in HIP/CUDA, both CONSTANT and GLOBAL pointers would be passed as the kernel arguments. Without introducing a new address space combing GLOBAL/CONSTANT, such coercion would be unsafe.

I'm not sure what you are saying here. Constant address space doesn't really exist and we could eliminate it. It is always valid to use global in place of constant.

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
22	It's not really. The codegen IR passes check the addressing mode based on the indexed address space. Using a flat pointer is strictly worse than an addrspace(1) pointer

Even GLOBAL may have a better addressing mode, the unpromotable alloca resolved in this change has an even significant performance issue. We could favor GLOBAL LOAD/STORE for kernel function as I proposed in other threads but, considering that an aggregate argument may be accessed indirectly, we need to pass it indirectly.

I think this is a dead end approach. I don't see the connection to the original problem you are trying to solve. Can you send me an IR testcase that this is supposed to help?

This revision now requires changes to proceed.Nov 2 2020, 8:31 AM

In D89980#2368506, @arsenm wrote:

I think this is a dead end approach. I don't see the connection to the original problem you are trying to solve. Can you send me an IR testcase that this is supposed to help?

That's probably commonly known. If we pass an aggregate parameter directly by value and dynamically index it late, that alloca cannot be promoted as that aggregate value in LLVM IR cannot be dynamically indexed. For example,

struct S {
   int a[100];
   int n;
};

int foo(S s) {
  return s.a[s.n];
}

If the underlying ABI chooses to pass s directly by value, we have the following pseudo IR.

%s = alloca S
; store `s` value into %s as the parameter is treated as a local variable by filling its initial value from LLVM IR parameter.
...
; regular parameter access through %s with dynamic indices

that store from the parameter from LLVM IR is an aggregate value store. Later, when %s is to be promoted, as it's once dynamically indexed, we cannot promote it as dynamic index on aggregate values is not representable in LLVM IR.

In contrast, if a parameter is passed by value indirectly, that store is replaced with a memcpy. It's straightforward to promote '%s' as they are all memory operands of the same layout.

If you need detailed IR, I may post here for your reference.

In D89980#2371270, @hliao wrote:
In D89980#2368506, @arsenm wrote:

I think this is a dead end approach. I don't see the connection to the original problem you are trying to solve. Can you send me an IR testcase that this is supposed to help?

That's probably commonly known. If we pass an aggregate parameter directly by value and dynamically index it late, that alloca cannot be promoted as that aggregate value in LLVM IR cannot be dynamically indexed. For example,
struct S {
   int a[100];
   int n;
};

int foo(S s) {
  return s.a[s.n];
}

This example is not a kernel

If the underlying ABI chooses to pass s directly by value, we have the following pseudo IR.
%s = alloca S
; store `s` value into %s as the parameter is treated as a local variable by filling its initial value from LLVM IR parameter.
...
; regular parameter access through %s with dynamic indices
that store from the parameter from LLVM IR is an aggregate value store. Later, when %s is to be promoted, as it's once dynamically indexed, we cannot promote it as dynamic index on aggregate values is not representable in LLVM IR.

In contrast, if a parameter is passed by value indirectly, that store is replaced with a memcpy. It's straightforward to promote '%s' as they are all memory operands of the same layout.

If you need detailed IR, I may post here for your reference.

I need an actual source and IR example. I think you are describing the missing promotion of pointers inside byref arguments. We need better promotion here, not eliminate it. It needs to cast the byref pointer, or cast the pointers inside the struct when accessed

The code could be simply converted to a kernel one following the same pattern:

struct S {
        float *p;
        float a[64];
        int n;
};

__global__ void kernel(S s) {
        *s.p = s.a[s.n];
}

Here's the LLVM IR after frontend

define protected amdgpu_kernel void @_Z6kernel1S(%struct.S.coerce %0) #2 {
  %2 = alloca %struct.S, align 8, addrspace(5)
  %3 = addrspacecast %struct.S addrspace(5)* %2 to %struct.S*
  %4 = bitcast %struct.S* %3 to %struct.S.coerce*
  %5 = getelementptr inbounds %struct.S.coerce, %struct.S.coerce* %4, i32 0, i32 0
  %6 = extractvalue %struct.S.coerce %0, 0
  store float addrspace(1)* %6, float addrspace(1)** %5, align 8
  %7 = getelementptr inbounds %struct.S.coerce, %struct.S.coerce* %4, i32 0, i32 1
  %8 = extractvalue %struct.S.coerce %0, 1
  store [64 x float] %8, [64 x float]* %7, align 8
  %9 = getelementptr inbounds %struct.S.coerce, %struct.S.coerce* %4, i32 0, i32 2
  %10 = extractvalue %struct.S.coerce %0, 2
  store i32 %10, i32* %9, align 8
  %11 = getelementptr inbounds %struct.S, %struct.S* %3, i32 0, i32 1
  %12 = getelementptr inbounds %struct.S, %struct.S* %3, i32 0, i32 2
  %13 = load i32, i32* %12, align 8, !tbaa !12
  %14 = sext i32 %13 to i64
  %15 = getelementptr inbounds [64 x float], [64 x float]* %11, i64 0, i64 %14
  %16 = load float, float* %15, align 4, !tbaa !14
  %17 = getelementptr inbounds %struct.S, %struct.S* %3, i32 0, i32 0
  %18 = load float*, float** %17, align 8, !tbaa !16
  store float %16, float* %18, align 4, !tbaa !14
  ret void
}

and here's the optimized IR before codegen

target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7"
target triple = "amdgcn-amd-amdhsa"

%struct.S.coerce = type { float addrspace(1)*, [64 x float], i32 }
%struct.S = type { float*, [64 x float], i32 }

; Function Attrs: nofree norecurse nounwind writeonly
define protected amdgpu_kernel void @_Z6kernel1S(%struct.S.coerce %0) local_unnamed_addr #0 {
  %2 = alloca %struct.S, align 8, addrspace(5)
  %3 = bitcast %struct.S addrspace(5)* %2 to float addrspace(1)* addrspace(5)*
  %4 = extractvalue %struct.S.coerce %0, 0
  store float addrspace(1)* %4, float addrspace(1)* addrspace(5)* %3, align 8
  %5 = extractvalue %struct.S.coerce %0, 1
  %6 = extractvalue [64 x float] %5, 0
  %7 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 0
  store float %6, float addrspace(5)* %7, align 8
  %8 = extractvalue [64 x float] %5, 1
  %9 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 1
  store float %8, float addrspace(5)* %9, align 4
  %10 = extractvalue [64 x float] %5, 2
  %11 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 2
  store float %10, float addrspace(5)* %11, align 8
  %12 = extractvalue [64 x float] %5, 3
  %13 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 3
  store float %12, float addrspace(5)* %13, align 4
  %14 = extractvalue [64 x float] %5, 4
  %15 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 4
  store float %14, float addrspace(5)* %15, align 8
  %16 = extractvalue [64 x float] %5, 5
  %17 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 5
  store float %16, float addrspace(5)* %17, align 4
  %18 = extractvalue [64 x float] %5, 6
  %19 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 6
  store float %18, float addrspace(5)* %19, align 8
  %20 = extractvalue [64 x float] %5, 7
  %21 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 7
  store float %20, float addrspace(5)* %21, align 4
  %22 = extractvalue [64 x float] %5, 8
  %23 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 8
  store float %22, float addrspace(5)* %23, align 8
  %24 = extractvalue [64 x float] %5, 9
  %25 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 9
  store float %24, float addrspace(5)* %25, align 4
  %26 = extractvalue [64 x float] %5, 10
  %27 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 10
  store float %26, float addrspace(5)* %27, align 8
  %28 = extractvalue [64 x float] %5, 11
  %29 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 11
  store float %28, float addrspace(5)* %29, align 4
  %30 = extractvalue [64 x float] %5, 12
  %31 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 12
  store float %30, float addrspace(5)* %31, align 8
  %32 = extractvalue [64 x float] %5, 13
  %33 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 13
  store float %32, float addrspace(5)* %33, align 4
  %34 = extractvalue [64 x float] %5, 14
  %35 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 14
  store float %34, float addrspace(5)* %35, align 8
  %36 = extractvalue [64 x float] %5, 15
  %37 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 15
  store float %36, float addrspace(5)* %37, align 4
  %38 = extractvalue [64 x float] %5, 16
  %39 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 16
  store float %38, float addrspace(5)* %39, align 8
  %40 = extractvalue [64 x float] %5, 17
  %41 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 17
  store float %40, float addrspace(5)* %41, align 4
  %42 = extractvalue [64 x float] %5, 18
  %43 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 18
  store float %42, float addrspace(5)* %43, align 8
  %44 = extractvalue [64 x float] %5, 19
  %45 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 19
  store float %44, float addrspace(5)* %45, align 4
  %46 = extractvalue [64 x float] %5, 20
  %47 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 20
  store float %46, float addrspace(5)* %47, align 8
  %48 = extractvalue [64 x float] %5, 21
  %49 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 21
  store float %48, float addrspace(5)* %49, align 4
  %50 = extractvalue [64 x float] %5, 22
  %51 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 22
  store float %50, float addrspace(5)* %51, align 8
  %52 = extractvalue [64 x float] %5, 23
  %53 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 23
  store float %52, float addrspace(5)* %53, align 4
  %54 = extractvalue [64 x float] %5, 24
  %55 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 24
  store float %54, float addrspace(5)* %55, align 8
  %56 = extractvalue [64 x float] %5, 25
  %57 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 25
  store float %56, float addrspace(5)* %57, align 4
  %58 = extractvalue [64 x float] %5, 26
  %59 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 26
  store float %58, float addrspace(5)* %59, align 8
  %60 = extractvalue [64 x float] %5, 27
  %61 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 27
  store float %60, float addrspace(5)* %61, align 4
  %62 = extractvalue [64 x float] %5, 28
  %63 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 28
  store float %62, float addrspace(5)* %63, align 8
  %64 = extractvalue [64 x float] %5, 29
  %65 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 29
  store float %64, float addrspace(5)* %65, align 4
  %66 = extractvalue [64 x float] %5, 30
  %67 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 30
  store float %66, float addrspace(5)* %67, align 8
  %68 = extractvalue [64 x float] %5, 31
  %69 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 31
  store float %68, float addrspace(5)* %69, align 4
  %70 = extractvalue [64 x float] %5, 32
  %71 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 32
  store float %70, float addrspace(5)* %71, align 8
  %72 = extractvalue [64 x float] %5, 33
  %73 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 33
  store float %72, float addrspace(5)* %73, align 4
  %74 = extractvalue [64 x float] %5, 34
  %75 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 34
  store float %74, float addrspace(5)* %75, align 8
  %76 = extractvalue [64 x float] %5, 35
  %77 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 35
  store float %76, float addrspace(5)* %77, align 4
  %78 = extractvalue [64 x float] %5, 36
  %79 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 36
  store float %78, float addrspace(5)* %79, align 8
  %80 = extractvalue [64 x float] %5, 37
  %81 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 37
  store float %80, float addrspace(5)* %81, align 4
  %82 = extractvalue [64 x float] %5, 38
  %83 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 38
  store float %82, float addrspace(5)* %83, align 8
  %84 = extractvalue [64 x float] %5, 39
  %85 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 39
  store float %84, float addrspace(5)* %85, align 4
  %86 = extractvalue [64 x float] %5, 40
  %87 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 40
  store float %86, float addrspace(5)* %87, align 8
  %88 = extractvalue [64 x float] %5, 41
  %89 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 41
  store float %88, float addrspace(5)* %89, align 4
  %90 = extractvalue [64 x float] %5, 42
  %91 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 42
  store float %90, float addrspace(5)* %91, align 8
  %92 = extractvalue [64 x float] %5, 43
  %93 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 43
  store float %92, float addrspace(5)* %93, align 4
  %94 = extractvalue [64 x float] %5, 44
  %95 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 44
  store float %94, float addrspace(5)* %95, align 8
  %96 = extractvalue [64 x float] %5, 45
  %97 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 45
  store float %96, float addrspace(5)* %97, align 4
  %98 = extractvalue [64 x float] %5, 46
  %99 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 46
  store float %98, float addrspace(5)* %99, align 8
  %100 = extractvalue [64 x float] %5, 47
  %101 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 47
  store float %100, float addrspace(5)* %101, align 4
  %102 = extractvalue [64 x float] %5, 48
  %103 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 48
  store float %102, float addrspace(5)* %103, align 8
  %104 = extractvalue [64 x float] %5, 49
  %105 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 49
  store float %104, float addrspace(5)* %105, align 4
  %106 = extractvalue [64 x float] %5, 50
  %107 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 50
  store float %106, float addrspace(5)* %107, align 8
  %108 = extractvalue [64 x float] %5, 51
  %109 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 51
  store float %108, float addrspace(5)* %109, align 4
  %110 = extractvalue [64 x float] %5, 52
  %111 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 52
  store float %110, float addrspace(5)* %111, align 8
  %112 = extractvalue [64 x float] %5, 53
  %113 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 53
  store float %112, float addrspace(5)* %113, align 4
  %114 = extractvalue [64 x float] %5, 54
  %115 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 54
  store float %114, float addrspace(5)* %115, align 8
  %116 = extractvalue [64 x float] %5, 55
  %117 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 55
  store float %116, float addrspace(5)* %117, align 4
  %118 = extractvalue [64 x float] %5, 56
  %119 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 56
  store float %118, float addrspace(5)* %119, align 8
  %120 = extractvalue [64 x float] %5, 57
  %121 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 57
  store float %120, float addrspace(5)* %121, align 4
  %122 = extractvalue [64 x float] %5, 58
  %123 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 58
  store float %122, float addrspace(5)* %123, align 8
  %124 = extractvalue [64 x float] %5, 59
  %125 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 59
  store float %124, float addrspace(5)* %125, align 4
  %126 = extractvalue [64 x float] %5, 60
  %127 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 60
  store float %126, float addrspace(5)* %127, align 8
  %128 = extractvalue [64 x float] %5, 61
  %129 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 61
  store float %128, float addrspace(5)* %129, align 4
  %130 = extractvalue [64 x float] %5, 62
  %131 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 62
  store float %130, float addrspace(5)* %131, align 8
  %132 = extractvalue [64 x float] %5, 63
  %133 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 63
  store float %132, float addrspace(5)* %133, align 4
  %134 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 2
  %135 = extractvalue %struct.S.coerce %0, 2
  store i32 %135, i32 addrspace(5)* %134, align 8
  %136 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 %135
  %137 = bitcast float addrspace(5)* %136 to i32 addrspace(5)*
  %138 = load i32, i32 addrspace(5)* %137, align 4, !tbaa !4
  %139 = bitcast %struct.S addrspace(5)* %2 to i32* addrspace(5)*
  %140 = load i32*, i32* addrspace(5)* %139, align 8, !tbaa !8
  store i32 %138, i32* %140, align 4, !tbaa !4
  ret void
}

and here's the optimized after this patch, the alloca is eliminated.

target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7"
target triple = "amdgcn-amd-amdhsa"

%struct.S = type { float*, [64 x float], i32 }

; Function Attrs: nofree norecurse nounwind writeonly
define protected amdgpu_kernel void @_Z6kernel1S(%struct.S addrspace(4)* nocapture readonly byref(%struct.S) align 8 %0) local_unnamed_addr #0 {
  %2 = getelementptr %struct.S, %struct.S addrspace(4)* %0, i64 0, i32 2
  %3 = load i32, i32 addrspace(4)* %2, align 8, !tbaa !5
  %4 = sext i32 %3 to i64
  %5 = getelementptr %struct.S, %struct.S addrspace(4)* %0, i64 0, i32 1, i64 %4
  %6 = load float, float addrspace(4)* %5, align 4, !tbaa !11
  %7 = getelementptr %struct.S, %struct.S addrspace(4)* %0, i64 0, i32 0
  %8 = load float*, float* addrspace(4)* %7, align 8, !tbaa !13
  store float %6, float* %8, align 4, !tbaa !11
  ret void
}

In D89980#2371290, @arsenm wrote:
In D89980#2371270, @hliao wrote:
In D89980#2368506, @arsenm wrote:

I think this is a dead end approach. I don't see the connection to the original problem you are trying to solve. Can you send me an IR testcase that this is supposed to help?

That's probably commonly known. If we pass an aggregate parameter directly by value and dynamically index it late, that alloca cannot be promoted as that aggregate value in LLVM IR cannot be dynamically indexed. For example,
struct S {
   int a[100];
   int n;
};

int foo(S s) {
  return s.a[s.n];
}
This example is not a kernel
If the underlying ABI chooses to pass s directly by value, we have the following pseudo IR.
%s = alloca S
; store `s` value into %s as the parameter is treated as a local variable by filling its initial value from LLVM IR parameter.
...
; regular parameter access through %s with dynamic indices
that store from the parameter from LLVM IR is an aggregate value store. Later, when %s is to be promoted, as it's once dynamically indexed, we cannot promote it as dynamic index on aggregate values is not representable in LLVM IR.

In contrast, if a parameter is passed by value indirectly, that store is replaced with a memcpy. It's straightforward to promote '%s' as they are all memory operands of the same layout.

If you need detailed IR, I may post here for your reference.
I need an actual source and IR example. I think you are describing the missing promotion of pointers inside byref arguments. We need better promotion here, not eliminate it. It needs to cast the byref pointer, or cast the pointers inside the struct when accessed

@jlebar -- FYI. This looks pretty similar to the issue you've reported recently for NVPTX.

In D89980#2371526, @tra wrote:

@jlebar -- FYI. This looks pretty similar to the issue you've reported recently for NVPTX.

is that reported in bugs.llvm.org?

In D89980#2371580, @hliao wrote:

is that reported in bugs.llvm.org?

It was exposed in our internal code, but the situation is almost identical to your IR example. https://godbolt.org/z/EPPn6h
For NVPTX we lower byval arguments as alloca because we must guarantee that params are never accessed via generic AS pointer.
Ideally we would want to avoid copying the argument into the local space and access the arg via parameter AS when we can.

This should use byref, but I don't think this should come at the cost of the promotion. I would still like to see this promotion occur for the in-memory byref type

In D89980#2371850, @arsenm wrote:

This should use byref, but I don't think this should come at the cost of the promotion. I would still like to see this promotion occur for the in-memory byref type

Once we use byref, that in-memory byref type has no way to be preserved based on C model as it will be treated as a local variable. The initial value with the coerced type won't be preserved after that. That happens to the case with static index as well, but the promotion helps to build the chain from the initial value to the final use. But, if we cannot promote alloca finally, we lost that information or cannot assume that.

In D89980#2371952, @hliao wrote:

In D89980#2371850, @arsenm wrote:

This should use byref, but I don't think this should come at the cost of the promotion. I would still like to see this promotion occur for the in-memory byref type

Once we use byref, that in-memory byref type has no way to be preserved based on C model as it will be treated as a local variable. The initial value with the coerced type won't be preserved after that. That happens to the case with static index as well, but the promotion helps to build the chain from the initial value to the final use. But, if we cannot promote alloca finally, we lost that information or cannot assume that.

Then the promotion can also be applied to the temporary argument slot

In D89980#2371966, @arsenm wrote:

In D89980#2371952, @hliao wrote:

In D89980#2371850, @arsenm wrote:

This should use byref, but I don't think this should come at the cost of the promotion. I would still like to see this promotion occur for the in-memory byref type

Once we use byref, that in-memory byref type has no way to be preserved based on C model as it will be treated as a local variable. The initial value with the coerced type won't be preserved after that. That happens to the case with static index as well, but the promotion helps to build the chain from the initial value to the final use. But, if we cannot promote alloca finally, we lost that information or cannot assume that.

Then the promotion can also be applied to the temporary argument slot

That's not safe to do that in the frontend as all arguments are local variables as well. They may be modified later. Such an assumption (pointers from argument is GLOBAL or CONSTANT) won't hold anymore.

In D89980#2372102, @hliao wrote:

In D89980#2371966, @arsenm wrote:

In D89980#2371952, @hliao wrote:

In D89980#2371850, @arsenm wrote:

This should use byref, but I don't think this should come at the cost of the promotion. I would still like to see this promotion occur for the in-memory byref type

Once we use byref, that in-memory byref type has no way to be preserved based on C model as it will be treated as a local variable. The initial value with the coerced type won't be preserved after that. That happens to the case with static index as well, but the promotion helps to build the chain from the initial value to the final use. But, if we cannot promote alloca finally, we lost that information or cannot assume that.

Then the promotion can also be applied to the temporary argument slot

That's not safe to do that in the frontend as all arguments are local variables as well. They may be modified later. Such an assumption (pointers from argument is GLOBAL or CONSTANT) won't hold anymore.

The type in the kernel argument byref does not need to match the alloca's type. The coercions can be inserted when initializing the argument alloca

Remove aggregate kernel argument coercion only.

Revise the commit message.

hliao retitled this revision from [hip] Remove kernel argument coercion. to [hip] Remove the coercion on aggregate kernel arguments..Nov 10 2020, 7:24 AM

hliao edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B78294: Diff 304182.Nov 10 2020, 7:56 AM

Harbormaster completed remote builds in B78293: Diff 304181.Nov 10 2020, 8:00 AM

hliao added a reviewer: msearles.Nov 10 2020, 9:42 PM

PING for review

arsenm added inline comments.Nov 12 2020, 11:25 AM

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
55	Should also have a case with a single element struct, which will be treated like a scalar

Add a test case for the single element struct.

hliao marked an inline comment as done.Nov 12 2020, 1:54 PM

hliao added inline comments.

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
55	in test kernel8

Harbormaster completed remote builds in B78689: Diff 304969.Nov 12 2020, 2:38 PM

arsenm accepted this revision.Nov 12 2020, 3:08 PM

This revision is now accepted and ready to land.Nov 12 2020, 3:08 PM

This revision was landed with ongoing or failed builds.Nov 12 2020, 6:19 PM

Closed by commit rG8920ef06a138: [hip] Remove the coercion on aggregate kernel arguments. (authored by hliao). · Explain Why

This revision was automatically updated to reflect the committed changes.

hliao marked an inline comment as done.

hliao added a commit: rG8920ef06a138: [hip] Remove the coercion on aggregate kernel arguments..

Revision Contents

Path

Size

clang/

lib/

CodeGen/

TargetInfo.cpp

28 lines

test/

CodeGenCUDA/

amdgpu-kernel-arg-pointer-type.cu

51 lines

kernel-args.cu

21 lines

Diff 305008

clang/lib/CodeGen/TargetInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,706 Lines • ▼ Show 20 Lines	private:
static const unsigned MaxNumRegsForArgsRet = 16;		static const unsigned MaxNumRegsForArgsRet = 16;

unsigned numRegsForType(QualType Ty) const;		unsigned numRegsForType(QualType Ty) const;

bool isHomogeneousAggregateBaseType(QualType Ty) const override;		bool isHomogeneousAggregateBaseType(QualType Ty) const override;
bool isHomogeneousAggregateSmallEnough(const Type *Base,		bool isHomogeneousAggregateSmallEnough(const Type *Base,
uint64_t Members) const override;		uint64_t Members) const override;

// Coerce HIP pointer arguments from generic pointers to global ones.		// Coerce HIP scalar pointer arguments from generic pointers to global ones.
llvm::Type coerceKernelArgumentType(llvm::Type Ty, unsigned FromAS,		llvm::Type coerceKernelArgumentType(llvm::Type Ty, unsigned FromAS,
unsigned ToAS) const {		unsigned ToAS) const {
// Structure types.
if (auto STy = dyn_cast<llvm::StructType>(Ty)) {
SmallVector<llvm::Type *, 8> EltTys;
bool Changed = false;
for (auto T : STy->elements()) {
auto NT = coerceKernelArgumentType(T, FromAS, ToAS);
EltTys.push_back(NT);
Changed \|= (NT != T);
}
// Skip if there is no change in element types.
if (!Changed)
return STy;
if (STy->hasName())
return llvm::StructType::create(
EltTys, (STy->getName() + ".coerce").str(), STy->isPacked());
return llvm::StructType::get(getVMContext(), EltTys, STy->isPacked());
}
// Array types.
if (auto ATy = dyn_cast<llvm::ArrayType>(Ty)) {
auto T = ATy->getElementType();
auto NT = coerceKernelArgumentType(T, FromAS, ToAS);
// Skip if there is no change in that element type.
if (NT == T)
return ATy;
return llvm::ArrayType::get(NT, ATy->getNumElements());
}
// Single value types.		// Single value types.
if (Ty->isPointerTy() && Ty->getPointerAddressSpace() == FromAS)		if (Ty->isPointerTy() && Ty->getPointerAddressSpace() == FromAS)
return llvm::PointerType::get(		return llvm::PointerType::get(
cast<llvm::PointerType>(Ty)->getElementType(), ToAS);		cast<llvm::PointerType>(Ty)->getElementType(), ToAS);
return Ty;		return Ty;
}		}

public:		public:
▲ Show 20 Lines • Show All 2,437 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu

	// REQUIRES: x86-registered-target			// REQUIRES: x86-registered-target
	// REQUIRES: amdgpu-registered-target			// REQUIRES: amdgpu-registered-target

	// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -emit-llvm -x hip %s -o - \| FileCheck --check-prefixes=COMMON,CHECK %s			// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -emit-llvm -x hip %s -o - \| FileCheck --check-prefixes=COMMON,CHECK %s
	// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -emit-llvm -x hip %s -disable-O0-optnone -o - \| opt -S -O2 \| FileCheck %s --check-prefixes=COMMON,OPT			// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -emit-llvm -x hip %s -disable-O0-optnone -o - \| opt -S -O2 \| FileCheck %s --check-prefixes=COMMON,OPT
	// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -emit-llvm -x hip %s -o - \| FileCheck -check-prefix=HOST %s			// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -emit-llvm -x hip %s -o - \| FileCheck -check-prefix=HOST %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	// Coerced struct from `struct S` without all generic pointers lowered into			// Coerced struct from `struct S` without all generic pointers lowered into
	// global ones.			// global ones.
	// COMMON: %struct.S.coerce = type { i32 addrspace(1), float addrspace(1) }
	// COMMON: %struct.T.coerce = type { [2 x float addrspace(1)*] }

	// On the host-side compilation, generic pointer won't be coerced.			// On the host-side compilation, generic pointer won't be coerced.
	// HOST-NOT: %struct.S.coerce			// HOST-NOT: %struct.S.coerce
	// HOST-NOT: %struct.T.coerce			// HOST-NOT: %struct.T.coerce

	// HOST: define void @_Z22__device_stub__kernel1Pi(i32* %x)			// HOST: define void @_Z22__device_stub__kernel1Pi(i32* %x)
	// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel1Pi(i32 addrspace(1){{.}} %x.coerce)			// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel1Pi(i32 addrspace(1){{.}} %x.coerce)
	// CHECK: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]			// CHECK: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]
	// CHECK-NOT: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]			// CHECK-NOT: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]
	// OPT: [[VAL:%.]] = load i32, i32 addrspace(1) %x.coerce, align 4			// OPT: [[VAL:%.]] = load i32, i32 addrspace(1) %x.coerce, align 4
	// OPT: [[INC:%.*]] = add nsw i32 [[VAL]], 1			// OPT: [[INC:%.*]] = add nsw i32 [[VAL]], 1
				arsenmUnsubmitted Not Done Reply Inline Actions This is still a regression. Fixing up AA does not solve the problem this promotions this is intended to solve. Generic accesses are worse independently of the aliasing properties arsenm: This is still a regression. Fixing up AA does not solve the problem this promotions this is…
				hliaoAuthorUnsubmitted Done Reply Inline Actions Do you mean FLAT load/store has worse addressing mode than GLOBAL ones? hliao: Do you mean FLAT load/store has worse addressing mode than GLOBAL ones?
				arsenmUnsubmitted Not Done Reply Inline Actions Yes. The flat offsets have a smaller range, and do not have the saddr mode. Flat accesses also won't avoid the extra lgmkcnt wait arsenm: Yes. The flat offsets have a smaller range, and do not have the saddr mode. Flat accesses also…
				hliaoAuthorUnsubmitted Done Reply Inline Actions I plan to add support to select GLOBAL ones once we could confirm that pointer could only point to GLOBAL/CONSTANT address spaces. Do you think that's a reasonable solution? hliao: I plan to add support to select GLOBAL ones once we could confirm that pointer could only point…
				arsenmUnsubmitted Not Done Reply Inline Actions I would much rather have the IR express the address space rather than fixing it up later. IR passes are aware of the addressing mode differences. Relying on AA for basic selection would also be worse for compile time arsenm: I would much rather have the IR express the address space rather than fixing it up later. IR…
				hliaoAuthorUnsubmitted Done Reply Inline Actions LLVM IR is agnostic to the underlying addressing mode. In code selection, we won't use AA but just needs to check IR. hliao: LLVM IR is agnostic to the underlying addressing mode. In code selection, we won't use AA but…
				arsenmUnsubmitted Not Done Reply Inline Actions It's not really. The codegen IR passes check the addressing mode based on the indexed address space. Using a flat pointer is strictly worse than an addrspace(1) pointer arsenm: It's not really. The codegen IR passes check the addressing mode based on the indexed address…
	// OPT: store i32 [[INC]], i32 addrspace(1)* %x.coerce, align 4			// OPT: store i32 [[INC]], i32 addrspace(1)* %x.coerce, align 4
	// OPT: ret void			// OPT: ret void
	__global__ void kernel1(int *x) {			__global__ void kernel1(int *x) {
	x[0]++;			x[0]++;
	}			}

	arsenmUnsubmitted Not Done Reply Inline Actions This test should not be deleted. I want to see the diff in the IR produced here. I think this still be worse, since AA does not answer all of the problems of using generic pointers arsenm: This test should not be deleted. I want to see the diff in the IR produced here. I think this…
	hliaoAuthorUnsubmitted Done Reply Inline Actions Could you elaborate on problems using generic pointers? hliao: Could you elaborate on problems using generic pointers?
	// HOST: define void @_Z22__device_stub__kernel2Ri(i32* nonnull align 4 dereferenceable(4) %x)			// HOST: define void @_Z22__device_stub__kernel2Ri(i32* nonnull align 4 dereferenceable(4) %x)
	// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel2Ri(i32 addrspace(1){{.}} nonnull align 4 dereferenceable(4) %x.coerce)			// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel2Ri(i32 addrspace(1){{.}} nonnull align 4 dereferenceable(4) %x.coerce)
	// CHECK: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]			// CHECK: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]
	// CHECK-NOT: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]			// CHECK-NOT: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]
	// OPT: [[VAL:%.]] = load i32, i32 addrspace(1) %x.coerce, align 4			// OPT: [[VAL:%.]] = load i32, i32 addrspace(1) %x.coerce, align 4
	// OPT: [[INC:%.*]] = add nsw i32 [[VAL]], 1			// OPT: [[INC:%.*]] = add nsw i32 [[VAL]], 1
	// OPT: store i32 [[INC]], i32 addrspace(1)* %x.coerce, align 4			// OPT: store i32 [[INC]], i32 addrspace(1)* %x.coerce, align 4
	// OPT: ret void			// OPT: ret void
	Show All 10 Lines
	}			}

	// COMMON-LABEL: define void @_Z4funcPi(i32{{.}} %x)			// COMMON-LABEL: define void @_Z4funcPi(i32{{.}} %x)
	// CHECK-NOT: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]			// CHECK-NOT: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]
	__device__ void func(int *x) {			__device__ void func(int *x) {
	x[0]++;			x[0]++;
	}			}

	struct S {			struct S {
				arsenmUnsubmitted Done Reply Inline Actions Should also have a case with a single element struct, which will be treated like a scalar arsenm: Should also have a case with a single element struct, which will be treated like a scalar
				hliaoAuthorUnsubmitted Done Reply Inline Actions in test kernel8 hliao: in test kernel8
	int *x;			int *x;
	float *y;			float *y;
	};			};
	// `by-val` struct will be coerced into a similar struct with all generic			// `by-val` struct will be coerced into a similar struct with all generic
	// pointers lowerd into global ones.			// pointers lowerd into global ones.
	// HOST: define void @_Z22__device_stub__kernel41S(i32* %s.coerce0, float* %s.coerce1)			// HOST: define void @_Z22__device_stub__kernel41S(i32* %s.coerce0, float* %s.coerce1)
	// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel41S(%struct.S.coerce %s.coerce)			// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel41S(%struct.S addrspace(4){{.}} byref(%struct.S) align 8 %0)
	// OPT: [[P0:%.*]] = extractvalue %struct.S.coerce %s.coerce, 0			// OPT: [[R0:%.]] = getelementptr inbounds %struct.S, %struct.S addrspace(4) %0, i64 0, i32 0
	// OPT: [[P1:%.*]] = extractvalue %struct.S.coerce %s.coerce, 1			// OPT: [[P0:%.]] = load i32, i32* addrspace(4)* [[R0]], align 8
	// OPT: [[V0:%.]] = load i32, i32 addrspace(1) [[P0]], align 4			// OPT: [[R1:%.]] = getelementptr inbounds %struct.S, %struct.S addrspace(4) %0, i64 0, i32 1
				// OPT: [[P1:%.]] = load float, float* addrspace(4)* [[R1]], align 8
				// OPT: [[V0:%.]] = load i32, i32 [[P0]], align 4
	// OPT: [[INC:%.*]] = add nsw i32 [[V0]], 1			// OPT: [[INC:%.*]] = add nsw i32 [[V0]], 1
	// OPT: store i32 [[INC]], i32 addrspace(1)* [[P0]], align 4			// OPT: store i32 [[INC]], i32* [[P0]], align 4
	// OPT: [[V1:%.]] = load float, float addrspace(1) [[P1]], align 4			// OPT: [[V1:%.]] = load float, float [[P1]], align 4
	// OPT: [[ADD:%.*]] = fadd contract float [[V1]], 1.000000e+00			// OPT: [[ADD:%.*]] = fadd contract float [[V1]], 1.000000e+00
	// OPT: store float [[ADD]], float addrspace(1)* [[P1]], align 4			// OPT: store float [[ADD]], float* [[P1]], align 4
	// OPT: ret void			// OPT: ret void
	__global__ void kernel4(struct S s) {			__global__ void kernel4(struct S s) {
	s.x[0]++;			s.x[0]++;
	s.y[0] += 1.f;			s.y[0] += 1.f;
	}			}

	// If a pointer to struct is passed, only the pointer itself is coerced into the global one.			// If a pointer to struct is passed, only the pointer itself is coerced into the global one.
	// HOST: define void @_Z22__device_stub__kernel5P1S(%struct.S* %s)			// HOST: define void @_Z22__device_stub__kernel5P1S(%struct.S* %s)
	// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel5P1S(%struct.S addrspace(1){{.}} %s.coerce)			// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel5P1S(%struct.S addrspace(1){{.}} %s.coerce)
	__global__ void kernel5(struct S *s) {			__global__ void kernel5(struct S *s) {
	s->x[0]++;			s->x[0]++;
	s->y[0] += 1.f;			s->y[0] += 1.f;
	}			}

	struct T {			struct T {
	float *x[2];			float *x[2];
	};			};
	// `by-val` array is also coerced.			// `by-val` array is also coerced.
	// HOST: define void @_Z22__device_stub__kernel61T(float* %t.coerce0, float* %t.coerce1)			// HOST: define void @_Z22__device_stub__kernel61T(float* %t.coerce0, float* %t.coerce1)
	// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel61T(%struct.T.coerce %t.coerce)			// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel61T(%struct.T addrspace(4){{.}} byref(%struct.T) align 8 %0)
	// OPT: [[ARR:%.*]] = extractvalue %struct.T.coerce %t.coerce, 0			// OPT: [[R0:%.]] = getelementptr inbounds %struct.T, %struct.T addrspace(4) %0, i64 0, i32 0, i64 0
	// OPT: [[P0:%.]] = extractvalue [2 x float addrspace(1)] [[ARR]], 0			// OPT: [[P0:%.]] = load float, float* addrspace(4)* [[R0]], align 8
	// OPT: [[P1:%.]] = extractvalue [2 x float addrspace(1)] [[ARR]], 1			// OPT: [[R1:%.]] = getelementptr inbounds %struct.T, %struct.T addrspace(4) %0, i64 0, i32 0, i64 1
	// OPT: [[V0:%.]] = load float, float addrspace(1) [[P0]], align 4			// OPT: [[P1:%.]] = load float, float* addrspace(4)* [[R1]], align 8
				// OPT: [[V0:%.]] = load float, float [[P0]], align 4
	// OPT: [[ADD0:%.*]] = fadd contract float [[V0]], 1.000000e+00			// OPT: [[ADD0:%.*]] = fadd contract float [[V0]], 1.000000e+00
	// OPT: store float [[ADD0]], float addrspace(1)* [[P0]], align 4			// OPT: store float [[ADD0]], float* [[P0]], align 4
	// OPT: [[V1:%.]] = load float, float addrspace(1) [[P1]], align 4			// OPT: [[V1:%.]] = load float, float [[P1]], align 4
	// OPT: [[ADD1:%.*]] = fadd contract float [[V1]], 2.000000e+00			// OPT: [[ADD1:%.*]] = fadd contract float [[V1]], 2.000000e+00
	// OPT: store float [[ADD1]], float addrspace(1)* [[P1]], align 4			// OPT: store float [[ADD1]], float* [[P1]], align 4
	// OPT: ret void			// OPT: ret void
	__global__ void kernel6(struct T t) {			__global__ void kernel6(struct T t) {
	t.x[0][0] += 1.f;			t.x[0][0] += 1.f;
	t.x[1][0] += 2.f;			t.x[1][0] += 2.f;
	}			}

	// Check that coerced pointers retain the noalias attribute when qualified with __restrict.			// Check that coerced pointers retain the noalias attribute when qualified with __restrict.
	// HOST: define void @_Z22__device_stub__kernel7Pi(i32* noalias %x)			// HOST: define void @_Z22__device_stub__kernel7Pi(i32* noalias %x)
	// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel7Pi(i32 addrspace(1)* noalias{{.*}} %x.coerce)			// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel7Pi(i32 addrspace(1)* noalias{{.*}} %x.coerce)
	__global__ void kernel7(int *__restrict x) {			__global__ void kernel7(int *__restrict x) {
	x[0]++;			x[0]++;
	}			}

				// Single element struct.
				struct SS {
				float *x;
				};
				// HOST: define void @_Z22__device_stub__kernel82SS(float* %a.coerce)
				// COMMON-LABEL: define amdgpu_kernel void @_Z7kernel82SS(float addrspace(1){{.}} %a.coerce)
				// CHECK: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]
				// CHECK-NOT: = addrspacecast [[TYPE:.]] addrspace(1) %{{.}} to [[TYPE]]
				// OPT: [[VAL:%.]] = load float, float addrspace(1) %a.coerce, align 4
				// OPT: [[INC:%.*]] = fadd contract float [[VAL]], 3.000000e+00
				// OPT: store float [[INC]], float addrspace(1)* %a.coerce, align 4
				// OPT: ret void
				__global__ void kernel8(struct SS a) {
				*a.x += 3.f;
				}

clang/test/CodeGenCUDA/kernel-args.cu

	// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device \			// RUN: %clang_cc1 -x hip -triple amdgcn-amd-amdhsa -fcuda-is-device \
	// RUN: -emit-llvm %s -o - \| FileCheck -check-prefix=AMDGCN %s			// RUN: -emit-llvm %s -o - \| FileCheck -check-prefix=AMDGCN %s
	// RUN: %clang_cc1 -triple nvptx64-nvidia-cuda- -fcuda-is-device \			// RUN: %clang_cc1 -x cuda -triple nvptx64-nvidia-cuda- -fcuda-is-device \
	// RUN: -emit-llvm %s -o - \| FileCheck -check-prefix=NVPTX %s			// RUN: -emit-llvm %s -o - \| FileCheck -check-prefix=NVPTX %s
	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	struct A {			struct A {
	int a[32];			int a[32];
				float *p;
	};			};

	// AMDGCN: define amdgpu_kernel void @_Z6kernel1A(%struct.A addrspace(4)* byref(%struct.A) align 4 %{{.+}})			// AMDGCN: define amdgpu_kernel void @_Z6kernel1A(%struct.A addrspace(4)* byref(%struct.A) align 8 %{{.+}})
	// NVPTX: define void @_Z6kernel1A(%struct.A* byval(%struct.A) align 4 %x)			// NVPTX: define void @_Z6kernel1A(%struct.A* byval(%struct.A) align 8 %x)
	__global__ void kernel(A x) {			__global__ void kernel(A x) {
	}			}

	class Kernel {			class Kernel {
	public:			public:
	// AMDGCN: define amdgpu_kernel void @_ZN6Kernel12memberKernelE1A(%struct.A addrspace(4)* byref(%struct.A) align 4 %{{.+}})			// AMDGCN: define amdgpu_kernel void @_ZN6Kernel12memberKernelE1A(%struct.A addrspace(4)* byref(%struct.A) align 8 %{{.+}})
	// NVPTX: define void @_ZN6Kernel12memberKernelE1A(%struct.A* byval(%struct.A) align 4 %x)			// NVPTX: define void @_ZN6Kernel12memberKernelE1A(%struct.A* byval(%struct.A) align 8 %x)
	static __global__ void memberKernel(A x){}			static __global__ void memberKernel(A x){}
	template<typename T> static __global__ void templateMemberKernel(T x) {}			template<typename T> static __global__ void templateMemberKernel(T x) {}
	};			};


	template <typename T>			template <typename T>
	__global__ void templateKernel(T x) {}			__global__ void templateKernel(T x) {}

	void launch(void*);			void launch(void*);

	void test() {			void test() {
	Kernel K;			Kernel K;
	// AMDGCN: define amdgpu_kernel void @_Z14templateKernelI1AEvT_(%struct.A addrspace(4)* byref(%struct.A) align 4 %{{.+}}			// AMDGCN: define amdgpu_kernel void @_Z14templateKernelI1AEvT_(%struct.A addrspace(4)* byref(%struct.A) align 8 %{{.+}}
	// NVPTX: define void @_Z14templateKernelI1AEvT_(%struct.A* byval(%struct.A) align 4 %x)			// NVPTX: define void @_Z14templateKernelI1AEvT_(%struct.A* byval(%struct.A) align 8 %x)
	launch((void*)templateKernel<A>);			launch((void*)templateKernel<A>);

	// AMDGCN: define amdgpu_kernel void @_ZN6Kernel20templateMemberKernelI1AEEvT_(%struct.A addrspace(4)* byref(%struct.A) align 4 %{{.+}}			// AMDGCN: define amdgpu_kernel void @_ZN6Kernel20templateMemberKernelI1AEEvT_(%struct.A addrspace(4)* byref(%struct.A) align 8 %{{.+}}
	// NVPTX: define void @_ZN6Kernel20templateMemberKernelI1AEEvT_(%struct.A* byval(%struct.A) align 4 %x)			// NVPTX: define void @_ZN6Kernel20templateMemberKernelI1AEEvT_(%struct.A* byval(%struct.A) align 8 %x)
	launch((void*)Kernel::templateMemberKernel<A>);			launch((void*)Kernel::templateMemberKernel<A>);
	}			}