This is an archive of the discontinued LLVM Phabricator instance.

[hip] Enable pointer argument lowering through coercing type.
ClosedPublic

Authored by hliao on Nov 4 2019, 2:05 PM.

Event Timeline

hliao created this revision.Nov 4 2019, 2:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 4 2019, 2:05 PM
hliao added a comment.Nov 4 2019, 2:07 PM

It happens that Sam has a similar patch of this one. After discussion, we agreed that this patch addresses more cases found in the workloads. Thank Sam for the test case.

yaxunl added a comment.Nov 4 2019, 2:16 PM

we need a test for byval struct and array. better to have a struct containing an array which contain another struct which contains a pointer. thanks.

tra added inline comments.Nov 4 2019, 2:26 PM
clang/lib/CodeGen/TargetInfo.cpp
7688

Nit: for lower -> for lowering or that lowers

7690

I don't think you need a class here -- it just complicates calling of coerce().
I'd just make coerce() a member function.

7696

Nit: VM in VMCtx is not useful. Ctx or LLVMCtx would be better, IMO.

arsenm added a subscriber: arsenm.Nov 4 2019, 2:27 PM
arsenm added inline comments.
clang/lib/CodeGen/CGCall.cpp
1308–1310

I would somewhat prefer 2 dyn_cast and getAddressSpace, this is essentially isa + cast combo

clang/lib/CodeGen/TargetInfo.cpp
7719

No tests with arrays or structs?

It's also not immediately obvious to me that this optimization is still valid if the pointer is buried in a struct

hliao updated this revision to Diff 227784.Nov 4 2019, 2:31 PM

add the test case for struct.

hliao marked 2 inline comments as done.Nov 4 2019, 2:37 PM
hliao added inline comments.
clang/lib/CodeGen/TargetInfo.cpp
7719

the original generic kernel pointer promotion to a global one only handles the pointer directly passed. From a critical workload, I found quite a few cases where the global pointers are passed through a by-val struct. We didn't handle that yet. With this case, we could start to handle that.

7719

struct tests are added. From test cases, it seems to me that arry is not passed by value. I need to double-confirm.

hliao updated this revision to Diff 227787.Nov 4 2019, 2:56 PM

revise code following reviwers' comments.

hliao marked 4 inline comments as done.Nov 4 2019, 2:57 PM
tra added inline comments.Nov 4 2019, 3:04 PM
clang/lib/CodeGen/TargetInfo.cpp
7689

Now it could use a more descriptive name, too. :-)

You can now also make DefaultAS/GlobalAS into local variables as you have access to getContext() here.

hliao updated this revision to Diff 227795.Nov 4 2019, 3:52 PM
  • revise member function name.
  • add the test case for by-val array types.
hliao marked 3 inline comments as done.Nov 4 2019, 3:55 PM
hliao added inline comments.
clang/lib/CodeGen/TargetInfo.cpp
7689

name is changed but I want to leave DefaultAS and GlobalAS as parameters as they may vary from HIP to OpenCL and different targets. Even though it may be rare case, I want to avoid careless errors.

7719

a test case for arrary types is added.

tra added inline comments.Nov 4 2019, 4:23 PM
clang/lib/CodeGen/TargetInfo.cpp
7689

You may not need it, ever and it would be easy to add, but I'll leave it up to you.

If you do want to keep them as parameters you may want to consider renaming them to FromAS/ToAS.
There's nothing in the code that has anything to do with whether they are for generic/specific address space and the function name does not indicate the direction of coercion between them. It's very easy to pass them in the wrong order and not notice it. Making them local variables would avoid it. Giving names some sort of 'directionality' would at least give user a hint what goes where, even if it would not prevent making the error.

hliao updated this revision to Diff 227864.Nov 5 2019, 6:58 AM
hliao marked an inline comment as done.

revise parameter names

hliao marked 2 inline comments as done.Nov 5 2019, 7:04 AM
hliao added inline comments.
clang/lib/CodeGen/TargetInfo.cpp
7689

From the target device side, we have generic and global addresses. But, at the language level, we have opencl_global and cuda_device. Even though they map into the same address space, it would be very confusing if they are misused to initialize that address space numbers. That's why the original helper makes more sense to me and makes the code more readable. Anyway, I change the parameter names to give a clear direction.

tra accepted this revision.Nov 5 2019, 9:39 AM

Thank you!

clang/test/CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu
2–3

Perhaps we should add host-side test, too to make sure the pointers there do remain generic.

This revision is now accepted and ready to land.Nov 5 2019, 9:39 AM
hliao updated this revision to Diff 227908.Nov 5 2019, 10:02 AM
hliao marked an inline comment as done.

Add host-side checks.

hliao marked an inline comment as done.Nov 5 2019, 10:02 AM
This revision was automatically updated to reflect the committed changes.

I am a little bit concerned that user may have such code:

struct A { int *p; }
__global__ kernel(A a) {
  int x;
  a.p = &x;
  f(a);
}

@arsenm what happens if a private pointer is mis-used as a global pointer?

I am wondering if we should coerce byval struct kernel arg to global only if they are const, e.g.

__global__ kernel(const A a);

I understand this may lose performance. Or should we introduce an option to let user disable coerce of non-const struct kernel arg to global.

hliao added a comment.Nov 5 2019, 10:30 AM

I am a little bit concerned that user may have such code:

struct A { int *p; }
__global__ kernel(A a) {
  int x;
  a.p = &x;
  f(a);
}

@arsenm what happens if a private pointer is mis-used as a global pointer?

I am wondering if we should coerce byval struct kernel arg to global only if they are const, e.g.

__global__ kernel(const A a);

I understand this may lose performance. Or should we introduce an option to let user disable coerce of non-const struct kernel arg to global.

This should not be a concern. The coercing is only applied to the parameter itself. Within the function body, we still use the original struct A. The preparation in function prolog will copy that coerced argument into the original one (alloca-ed.) The modification of that parameter later will be applied to the original one due to the by-val nature.

A modified version of your code is compiled into the following code at O0:

define protected amdgpu_kernel void @_Z3foo1A(%struct.A.coerce %a.coerce) #0 {
entry:
  %a = alloca %struct.A, align 8, addrspace(5)
  %a1 = addrspacecast %struct.A addrspace(5)* %a to %struct.A*
  %x = alloca i32, align 4, addrspace(5)
  %x.ascast = addrspacecast i32 addrspace(5)* %x to i32*
  %agg.tmp = alloca %struct.A, align 8, addrspace(5)
  %agg.tmp.ascast = addrspacecast %struct.A addrspace(5)* %agg.tmp to %struct.A*
  %0 = bitcast %struct.A* %a1 to %struct.A.coerce*
  %1 = getelementptr inbounds %struct.A.coerce, %struct.A.coerce* %0, i32 0, i32 0
  %2 = extractvalue %struct.A.coerce %a.coerce, 0
  store i32 addrspace(1)* %2, i32 addrspace(1)** %1, align 8
  %3 = getelementptr inbounds %struct.A.coerce, %struct.A.coerce* %0, i32 0, i32 1
  %4 = extractvalue %struct.A.coerce %a.coerce, 1
  store i32 addrspace(1)* %4, i32 addrspace(1)** %3, align 8
  %p = getelementptr inbounds %struct.A, %struct.A* %a1, i32 0, i32 0
  store i32* %x.ascast, i32** %p, align 8
  %5 = bitcast %struct.A* %agg.tmp.ascast to i8*
  %6 = bitcast %struct.A* %a1 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %5, i8* align 8 %6, i64 16, i1 false)
  %7 = getelementptr inbounds %struct.A, %struct.A* %agg.tmp.ascast, i32 0, i32 0
  %8 = load i32*, i32** %7, align 8
  %9 = getelementptr inbounds %struct.A, %struct.A* %agg.tmp.ascast, i32 0, i32 1
  %10 = load i32*, i32** %9, align 8
  call void @_Z1f1A(i32* %8, i32* %10) #3
  ret void
}

The modification of parameter a is applied the alloca-ed one.

I am a little bit concerned that user may have such code:

struct A { int *p; }
__global__ kernel(A a) {
  int x;
  a.p = &x;
  f(a);
}

@arsenm what happens if a private pointer is mis-used as a global pointer?

I am wondering if we should coerce byval struct kernel arg to global only if they are const, e.g.

__global__ kernel(const A a);

I understand this may lose performance. Or should we introduce an option to let user disable coerce of non-const struct kernel arg to global.

This should not be a concern. The coercing is only applied to the parameter itself. Within the function body, we still use the original struct A. The preparation in function prolog will copy that coerced argument into the original one (alloca-ed.) The modification of that parameter later will be applied to the original one due to the by-val nature.

A modified version of your code is compiled into the following code at O0:

define protected amdgpu_kernel void @_Z3foo1A(%struct.A.coerce %a.coerce) #0 {
entry:
  %a = alloca %struct.A, align 8, addrspace(5)
  %a1 = addrspacecast %struct.A addrspace(5)* %a to %struct.A*
  %x = alloca i32, align 4, addrspace(5)
  %x.ascast = addrspacecast i32 addrspace(5)* %x to i32*
  %agg.tmp = alloca %struct.A, align 8, addrspace(5)
  %agg.tmp.ascast = addrspacecast %struct.A addrspace(5)* %agg.tmp to %struct.A*
  %0 = bitcast %struct.A* %a1 to %struct.A.coerce*
  %1 = getelementptr inbounds %struct.A.coerce, %struct.A.coerce* %0, i32 0, i32 0
  %2 = extractvalue %struct.A.coerce %a.coerce, 0
  store i32 addrspace(1)* %2, i32 addrspace(1)** %1, align 8
  %3 = getelementptr inbounds %struct.A.coerce, %struct.A.coerce* %0, i32 0, i32 1
  %4 = extractvalue %struct.A.coerce %a.coerce, 1
  store i32 addrspace(1)* %4, i32 addrspace(1)** %3, align 8
  %p = getelementptr inbounds %struct.A, %struct.A* %a1, i32 0, i32 0
  store i32* %x.ascast, i32** %p, align 8
  %5 = bitcast %struct.A* %agg.tmp.ascast to i8*
  %6 = bitcast %struct.A* %a1 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %5, i8* align 8 %6, i64 16, i1 false)
  %7 = getelementptr inbounds %struct.A, %struct.A* %agg.tmp.ascast, i32 0, i32 0
  %8 = load i32*, i32** %7, align 8
  %9 = getelementptr inbounds %struct.A, %struct.A* %agg.tmp.ascast, i32 0, i32 1
  %10 = load i32*, i32** %9, align 8
  call void @_Z1f1A(i32* %8, i32* %10) #3
  ret void
}

The modification of parameter a is applied the alloca-ed one.

OK. Thanks for clarification.