- If an aggregate argument is indirectly accessed within kernels, direct passing results in unpromotable alloca, which degrade performance significantly. InferAddrSpace pass is enhanced in D91121 to take the assumption that generic pointers loaded from the constant memory could be regarded global ones. The need for the coercion on aggregate arguments is mitigated.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
the existing test kernel-args.cu is enhanced by adding a pointer in that aggregate kernel argument. Previously, as the pointer is used in that aggregate kernel argument, it will be coerced into a global pointer. Now, it won't be changed anymore.
Revise the comment and point the safety issue by coercing the kernel argument
from a generic pointer to a global one.
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
30 | This test should not be deleted. I want to see the diff in the IR produced here. I think this still be worse, since AA does not answer all of the problems of using generic pointers |
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
30 | Could you elaborate on problems using generic pointers? |
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
1 | This is still a regression. Fixing up AA does not solve the problem this promotions this is intended to solve. Generic accesses are worse independently of the aliasing properties |
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
1 | Do you mean FLAT load/store has worse addressing mode than GLOBAL ones? |
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
1 | Yes. The flat offsets have a smaller range, and do not have the saddr mode. Flat accesses also won't avoid the extra lgmkcnt wait |
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
1 | I plan to add support to select GLOBAL ones once we could confirm that pointer could only point to GLOBAL/CONSTANT address spaces. Do you think that's a reasonable solution? |
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
1 | I would much rather have the IR express the address space rather than fixing it up later. IR passes are aware of the addressing mode differences. Relying on AA for basic selection would also be worse for compile time |
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
1 | LLVM IR is agnostic to the underlying addressing mode. In code selection, we won't use AA but just needs to check IR. |
Besides the unpromotable alloca issue due to indirect accesses, such coercion to GLOBAL pointer directly is not safe as, in HIP/CUDA, both CONSTANT and GLOBAL pointers would be passed as the kernel arguments. Without introducing a new address space combing GLOBAL/CONSTANT, such coercion would be unsafe.
I'm not sure what you are saying here. Constant address space doesn't really exist and we could eliminate it. It is always valid to use global in place of constant.
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
1 | It's not really. The codegen IR passes check the addressing mode based on the indexed address space. Using a flat pointer is strictly worse than an addrspace(1) pointer |
Even GLOBAL may have a better addressing mode, the unpromotable alloca resolved in this change has an even significant performance issue. We could favor GLOBAL LOAD/STORE for kernel function as I proposed in other threads but, considering that an aggregate argument may be accessed indirectly, we need to pass it indirectly.
I think this is a dead end approach. I don't see the connection to the original problem you are trying to solve. Can you send me an IR testcase that this is supposed to help?
That's probably commonly known. If we pass an aggregate parameter directly by value and dynamically index it late, that alloca cannot be promoted as that aggregate value in LLVM IR cannot be dynamically indexed. For example,
struct S { int a[100]; int n; }; int foo(S s) { return s.a[s.n]; }
If the underlying ABI chooses to pass s directly by value, we have the following pseudo IR.
%s = alloca S ; store `s` value into %s as the parameter is treated as a local variable by filling its initial value from LLVM IR parameter. ... ; regular parameter access through %s with dynamic indices
that store from the parameter from LLVM IR is an aggregate value store. Later, when %s is to be promoted, as it's once dynamically indexed, we cannot promote it as dynamic index on aggregate values is not representable in LLVM IR.
In contrast, if a parameter is passed by value indirectly, that store is replaced with a memcpy. It's straightforward to promote '%s' as they are all memory operands of the same layout.
If you need detailed IR, I may post here for your reference.
This example is not a kernel
If the underlying ABI chooses to pass s directly by value, we have the following pseudo IR.
%s = alloca S ; store `s` value into %s as the parameter is treated as a local variable by filling its initial value from LLVM IR parameter. ... ; regular parameter access through %s with dynamic indicesthat store from the parameter from LLVM IR is an aggregate value store. Later, when %s is to be promoted, as it's once dynamically indexed, we cannot promote it as dynamic index on aggregate values is not representable in LLVM IR.
In contrast, if a parameter is passed by value indirectly, that store is replaced with a memcpy. It's straightforward to promote '%s' as they are all memory operands of the same layout.
If you need detailed IR, I may post here for your reference.
I need an actual source and IR example. I think you are describing the missing promotion of pointers inside byref arguments. We need better promotion here, not eliminate it. It needs to cast the byref pointer, or cast the pointers inside the struct when accessed
The code could be simply converted to a kernel one following the same pattern:
struct S { float *p; float a[64]; int n; }; __global__ void kernel(S s) { *s.p = s.a[s.n]; }
Here's the LLVM IR after frontend
define protected amdgpu_kernel void @_Z6kernel1S(%struct.S.coerce %0) #2 { %2 = alloca %struct.S, align 8, addrspace(5) %3 = addrspacecast %struct.S addrspace(5)* %2 to %struct.S* %4 = bitcast %struct.S* %3 to %struct.S.coerce* %5 = getelementptr inbounds %struct.S.coerce, %struct.S.coerce* %4, i32 0, i32 0 %6 = extractvalue %struct.S.coerce %0, 0 store float addrspace(1)* %6, float addrspace(1)** %5, align 8 %7 = getelementptr inbounds %struct.S.coerce, %struct.S.coerce* %4, i32 0, i32 1 %8 = extractvalue %struct.S.coerce %0, 1 store [64 x float] %8, [64 x float]* %7, align 8 %9 = getelementptr inbounds %struct.S.coerce, %struct.S.coerce* %4, i32 0, i32 2 %10 = extractvalue %struct.S.coerce %0, 2 store i32 %10, i32* %9, align 8 %11 = getelementptr inbounds %struct.S, %struct.S* %3, i32 0, i32 1 %12 = getelementptr inbounds %struct.S, %struct.S* %3, i32 0, i32 2 %13 = load i32, i32* %12, align 8, !tbaa !12 %14 = sext i32 %13 to i64 %15 = getelementptr inbounds [64 x float], [64 x float]* %11, i64 0, i64 %14 %16 = load float, float* %15, align 4, !tbaa !14 %17 = getelementptr inbounds %struct.S, %struct.S* %3, i32 0, i32 0 %18 = load float*, float** %17, align 8, !tbaa !16 store float %16, float* %18, align 4, !tbaa !14 ret void }
and here's the optimized IR before codegen
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7" target triple = "amdgcn-amd-amdhsa" %struct.S.coerce = type { float addrspace(1)*, [64 x float], i32 } %struct.S = type { float*, [64 x float], i32 } ; Function Attrs: nofree norecurse nounwind writeonly define protected amdgpu_kernel void @_Z6kernel1S(%struct.S.coerce %0) local_unnamed_addr #0 { %2 = alloca %struct.S, align 8, addrspace(5) %3 = bitcast %struct.S addrspace(5)* %2 to float addrspace(1)* addrspace(5)* %4 = extractvalue %struct.S.coerce %0, 0 store float addrspace(1)* %4, float addrspace(1)* addrspace(5)* %3, align 8 %5 = extractvalue %struct.S.coerce %0, 1 %6 = extractvalue [64 x float] %5, 0 %7 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 0 store float %6, float addrspace(5)* %7, align 8 %8 = extractvalue [64 x float] %5, 1 %9 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 1 store float %8, float addrspace(5)* %9, align 4 %10 = extractvalue [64 x float] %5, 2 %11 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 2 store float %10, float addrspace(5)* %11, align 8 %12 = extractvalue [64 x float] %5, 3 %13 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 3 store float %12, float addrspace(5)* %13, align 4 %14 = extractvalue [64 x float] %5, 4 %15 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 4 store float %14, float addrspace(5)* %15, align 8 %16 = extractvalue [64 x float] %5, 5 %17 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 5 store float %16, float addrspace(5)* %17, align 4 %18 = extractvalue [64 x float] %5, 6 %19 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 6 store float %18, float addrspace(5)* %19, align 8 %20 = extractvalue [64 x float] %5, 7 %21 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 7 store float %20, float addrspace(5)* %21, align 4 %22 = extractvalue [64 x float] %5, 8 %23 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 8 store float %22, float addrspace(5)* %23, align 8 %24 = extractvalue [64 x float] %5, 9 %25 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 9 store float %24, float addrspace(5)* %25, align 4 %26 = extractvalue [64 x float] %5, 10 %27 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 10 store float %26, float addrspace(5)* %27, align 8 %28 = extractvalue [64 x float] %5, 11 %29 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 11 store float %28, float addrspace(5)* %29, align 4 %30 = extractvalue [64 x float] %5, 12 %31 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 12 store float %30, float addrspace(5)* %31, align 8 %32 = extractvalue [64 x float] %5, 13 %33 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 13 store float %32, float addrspace(5)* %33, align 4 %34 = extractvalue [64 x float] %5, 14 %35 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 14 store float %34, float addrspace(5)* %35, align 8 %36 = extractvalue [64 x float] %5, 15 %37 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 15 store float %36, float addrspace(5)* %37, align 4 %38 = extractvalue [64 x float] %5, 16 %39 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 16 store float %38, float addrspace(5)* %39, align 8 %40 = extractvalue [64 x float] %5, 17 %41 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 17 store float %40, float addrspace(5)* %41, align 4 %42 = extractvalue [64 x float] %5, 18 %43 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 18 store float %42, float addrspace(5)* %43, align 8 %44 = extractvalue [64 x float] %5, 19 %45 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 19 store float %44, float addrspace(5)* %45, align 4 %46 = extractvalue [64 x float] %5, 20 %47 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 20 store float %46, float addrspace(5)* %47, align 8 %48 = extractvalue [64 x float] %5, 21 %49 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 21 store float %48, float addrspace(5)* %49, align 4 %50 = extractvalue [64 x float] %5, 22 %51 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 22 store float %50, float addrspace(5)* %51, align 8 %52 = extractvalue [64 x float] %5, 23 %53 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 23 store float %52, float addrspace(5)* %53, align 4 %54 = extractvalue [64 x float] %5, 24 %55 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 24 store float %54, float addrspace(5)* %55, align 8 %56 = extractvalue [64 x float] %5, 25 %57 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 25 store float %56, float addrspace(5)* %57, align 4 %58 = extractvalue [64 x float] %5, 26 %59 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 26 store float %58, float addrspace(5)* %59, align 8 %60 = extractvalue [64 x float] %5, 27 %61 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 27 store float %60, float addrspace(5)* %61, align 4 %62 = extractvalue [64 x float] %5, 28 %63 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 28 store float %62, float addrspace(5)* %63, align 8 %64 = extractvalue [64 x float] %5, 29 %65 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 29 store float %64, float addrspace(5)* %65, align 4 %66 = extractvalue [64 x float] %5, 30 %67 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 30 store float %66, float addrspace(5)* %67, align 8 %68 = extractvalue [64 x float] %5, 31 %69 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 31 store float %68, float addrspace(5)* %69, align 4 %70 = extractvalue [64 x float] %5, 32 %71 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 32 store float %70, float addrspace(5)* %71, align 8 %72 = extractvalue [64 x float] %5, 33 %73 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 33 store float %72, float addrspace(5)* %73, align 4 %74 = extractvalue [64 x float] %5, 34 %75 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 34 store float %74, float addrspace(5)* %75, align 8 %76 = extractvalue [64 x float] %5, 35 %77 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 35 store float %76, float addrspace(5)* %77, align 4 %78 = extractvalue [64 x float] %5, 36 %79 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 36 store float %78, float addrspace(5)* %79, align 8 %80 = extractvalue [64 x float] %5, 37 %81 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 37 store float %80, float addrspace(5)* %81, align 4 %82 = extractvalue [64 x float] %5, 38 %83 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 38 store float %82, float addrspace(5)* %83, align 8 %84 = extractvalue [64 x float] %5, 39 %85 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 39 store float %84, float addrspace(5)* %85, align 4 %86 = extractvalue [64 x float] %5, 40 %87 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 40 store float %86, float addrspace(5)* %87, align 8 %88 = extractvalue [64 x float] %5, 41 %89 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 41 store float %88, float addrspace(5)* %89, align 4 %90 = extractvalue [64 x float] %5, 42 %91 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 42 store float %90, float addrspace(5)* %91, align 8 %92 = extractvalue [64 x float] %5, 43 %93 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 43 store float %92, float addrspace(5)* %93, align 4 %94 = extractvalue [64 x float] %5, 44 %95 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 44 store float %94, float addrspace(5)* %95, align 8 %96 = extractvalue [64 x float] %5, 45 %97 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 45 store float %96, float addrspace(5)* %97, align 4 %98 = extractvalue [64 x float] %5, 46 %99 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 46 store float %98, float addrspace(5)* %99, align 8 %100 = extractvalue [64 x float] %5, 47 %101 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 47 store float %100, float addrspace(5)* %101, align 4 %102 = extractvalue [64 x float] %5, 48 %103 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 48 store float %102, float addrspace(5)* %103, align 8 %104 = extractvalue [64 x float] %5, 49 %105 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 49 store float %104, float addrspace(5)* %105, align 4 %106 = extractvalue [64 x float] %5, 50 %107 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 50 store float %106, float addrspace(5)* %107, align 8 %108 = extractvalue [64 x float] %5, 51 %109 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 51 store float %108, float addrspace(5)* %109, align 4 %110 = extractvalue [64 x float] %5, 52 %111 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 52 store float %110, float addrspace(5)* %111, align 8 %112 = extractvalue [64 x float] %5, 53 %113 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 53 store float %112, float addrspace(5)* %113, align 4 %114 = extractvalue [64 x float] %5, 54 %115 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 54 store float %114, float addrspace(5)* %115, align 8 %116 = extractvalue [64 x float] %5, 55 %117 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 55 store float %116, float addrspace(5)* %117, align 4 %118 = extractvalue [64 x float] %5, 56 %119 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 56 store float %118, float addrspace(5)* %119, align 8 %120 = extractvalue [64 x float] %5, 57 %121 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 57 store float %120, float addrspace(5)* %121, align 4 %122 = extractvalue [64 x float] %5, 58 %123 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 58 store float %122, float addrspace(5)* %123, align 8 %124 = extractvalue [64 x float] %5, 59 %125 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 59 store float %124, float addrspace(5)* %125, align 4 %126 = extractvalue [64 x float] %5, 60 %127 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 60 store float %126, float addrspace(5)* %127, align 8 %128 = extractvalue [64 x float] %5, 61 %129 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 61 store float %128, float addrspace(5)* %129, align 4 %130 = extractvalue [64 x float] %5, 62 %131 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 62 store float %130, float addrspace(5)* %131, align 8 %132 = extractvalue [64 x float] %5, 63 %133 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 63 store float %132, float addrspace(5)* %133, align 4 %134 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 2 %135 = extractvalue %struct.S.coerce %0, 2 store i32 %135, i32 addrspace(5)* %134, align 8 %136 = getelementptr inbounds %struct.S, %struct.S addrspace(5)* %2, i32 0, i32 1, i32 %135 %137 = bitcast float addrspace(5)* %136 to i32 addrspace(5)* %138 = load i32, i32 addrspace(5)* %137, align 4, !tbaa !4 %139 = bitcast %struct.S addrspace(5)* %2 to i32* addrspace(5)* %140 = load i32*, i32* addrspace(5)* %139, align 8, !tbaa !8 store i32 %138, i32* %140, align 4, !tbaa !4 ret void }
and here's the optimized after this patch, the alloca is eliminated.
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7" target triple = "amdgcn-amd-amdhsa" %struct.S = type { float*, [64 x float], i32 } ; Function Attrs: nofree norecurse nounwind writeonly define protected amdgpu_kernel void @_Z6kernel1S(%struct.S addrspace(4)* nocapture readonly byref(%struct.S) align 8 %0) local_unnamed_addr #0 { %2 = getelementptr %struct.S, %struct.S addrspace(4)* %0, i64 0, i32 2 %3 = load i32, i32 addrspace(4)* %2, align 8, !tbaa !5 %4 = sext i32 %3 to i64 %5 = getelementptr %struct.S, %struct.S addrspace(4)* %0, i64 0, i32 1, i64 %4 %6 = load float, float addrspace(4)* %5, align 4, !tbaa !11 %7 = getelementptr %struct.S, %struct.S addrspace(4)* %0, i64 0, i32 0 %8 = load float*, float* addrspace(4)* %7, align 8, !tbaa !13 store float %6, float* %8, align 4, !tbaa !11 ret void }
@jlebar -- FYI. This looks pretty similar to the issue you've reported recently for NVPTX.
It was exposed in our internal code, but the situation is almost identical to your IR example. https://godbolt.org/z/EPPn6h
For NVPTX we lower byval arguments as alloca because we must guarantee that params are never accessed via generic AS pointer.
Ideally we would want to avoid copying the argument into the local space and access the arg via parameter AS when we can.
This should use byref, but I don't think this should come at the cost of the promotion. I would still like to see this promotion occur for the in-memory byref type
Once we use byref, that in-memory byref type has no way to be preserved based on C model as it will be treated as a local variable. The initial value with the coerced type won't be preserved after that. That happens to the case with static index as well, but the promotion helps to build the chain from the initial value to the final use. But, if we cannot promote alloca finally, we lost that information or cannot assume that.
That's not safe to do that in the frontend as all arguments are local variables as well. They may be modified later. Such an assumption (pointers from argument is GLOBAL or CONSTANT) won't hold anymore.
The type in the kernel argument byref does not need to match the alloca's type. The coercions can be inserted when initializing the argument alloca
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
0 | Should also have a case with a single element struct, which will be treated like a scalar |
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
0 | in test kernel8 |
This test should not be deleted. I want to see the diff in the IR produced here. I think this still be worse, since AA does not answer all of the problems of using generic pointers