Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
- Build Status
Buildable 40494 Build 40604: arc lint + arc unit
Event Timeline
It happens that Sam has a similar patch of this one. After discussion, we agreed that this patch addresses more cases found in the workloads. Thank Sam for the test case.
we need a test for byval struct and array. better to have a struct containing an array which contain another struct which contains a pointer. thanks.
clang/lib/CodeGen/CGCall.cpp | ||
---|---|---|
1308–1310 | I would somewhat prefer 2 dyn_cast and getAddressSpace, this is essentially isa + cast combo | |
clang/lib/CodeGen/TargetInfo.cpp | ||
7719 | No tests with arrays or structs? It's also not immediately obvious to me that this optimization is still valid if the pointer is buried in a struct |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
7719 | the original generic kernel pointer promotion to a global one only handles the pointer directly passed. From a critical workload, I found quite a few cases where the global pointers are passed through a by-val struct. We didn't handle that yet. With this case, we could start to handle that. | |
7719 | struct tests are added. From test cases, it seems to me that arry is not passed by value. I need to double-confirm. |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
7689 | Now it could use a more descriptive name, too. :-) You can now also make DefaultAS/GlobalAS into local variables as you have access to getContext() here. |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
7689 | You may not need it, ever and it would be easy to add, but I'll leave it up to you. If you do want to keep them as parameters you may want to consider renaming them to FromAS/ToAS. |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
7689 | From the target device side, we have generic and global addresses. But, at the language level, we have opencl_global and cuda_device. Even though they map into the same address space, it would be very confusing if they are misused to initialize that address space numbers. That's why the original helper makes more sense to me and makes the code more readable. Anyway, I change the parameter names to give a clear direction. |
Thank you!
clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | ||
---|---|---|
2–3 | Perhaps we should add host-side test, too to make sure the pointers there do remain generic. |
I am a little bit concerned that user may have such code:
struct A { int *p; } __global__ kernel(A a) { int x; a.p = &x; f(a); }
@arsenm what happens if a private pointer is mis-used as a global pointer?
I am wondering if we should coerce byval struct kernel arg to global only if they are const, e.g.
__global__ kernel(const A a);
I understand this may lose performance. Or should we introduce an option to let user disable coerce of non-const struct kernel arg to global.
This should not be a concern. The coercing is only applied to the parameter itself. Within the function body, we still use the original struct A. The preparation in function prolog will copy that coerced argument into the original one (alloca-ed.) The modification of that parameter later will be applied to the original one due to the by-val nature.
A modified version of your code is compiled into the following code at O0:
define protected amdgpu_kernel void @_Z3foo1A(%struct.A.coerce %a.coerce) #0 { entry: %a = alloca %struct.A, align 8, addrspace(5) %a1 = addrspacecast %struct.A addrspace(5)* %a to %struct.A* %x = alloca i32, align 4, addrspace(5) %x.ascast = addrspacecast i32 addrspace(5)* %x to i32* %agg.tmp = alloca %struct.A, align 8, addrspace(5) %agg.tmp.ascast = addrspacecast %struct.A addrspace(5)* %agg.tmp to %struct.A* %0 = bitcast %struct.A* %a1 to %struct.A.coerce* %1 = getelementptr inbounds %struct.A.coerce, %struct.A.coerce* %0, i32 0, i32 0 %2 = extractvalue %struct.A.coerce %a.coerce, 0 store i32 addrspace(1)* %2, i32 addrspace(1)** %1, align 8 %3 = getelementptr inbounds %struct.A.coerce, %struct.A.coerce* %0, i32 0, i32 1 %4 = extractvalue %struct.A.coerce %a.coerce, 1 store i32 addrspace(1)* %4, i32 addrspace(1)** %3, align 8 %p = getelementptr inbounds %struct.A, %struct.A* %a1, i32 0, i32 0 store i32* %x.ascast, i32** %p, align 8 %5 = bitcast %struct.A* %agg.tmp.ascast to i8* %6 = bitcast %struct.A* %a1 to i8* call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %5, i8* align 8 %6, i64 16, i1 false) %7 = getelementptr inbounds %struct.A, %struct.A* %agg.tmp.ascast, i32 0, i32 0 %8 = load i32*, i32** %7, align 8 %9 = getelementptr inbounds %struct.A, %struct.A* %agg.tmp.ascast, i32 0, i32 1 %10 = load i32*, i32** %9, align 8 call void @_Z1f1A(i32* %8, i32* %10) #3 ret void }
The modification of parameter a is applied the alloca-ed one.
I would somewhat prefer 2 dyn_cast and getAddressSpace, this is essentially isa + cast combo