This is an archive of the discontinued LLVM Phabricator instance.

[nvptx] Skip alloca for read-only byval arguments.
AbandonedPublic

Authored by hliao on Nov 21 2020, 11:51 PM.

Details

Reviewers
tra
jlebar
Summary
  • Once a byval argument is attributed with read-only, there's no store into that argument and it's safe to skip generating alloca to match read-only input parameter space property. Cast that generic pointer to the parameter space and back so that the address space inference pass could infer the correct parameter space.

Diff Detail

Event Timeline

hliao created this revision.Nov 21 2020, 11:51 PM
hliao requested review of this revision.Nov 21 2020, 11:51 PM

It turns out that the simplest way is to skip generating alloca once that byval argument is readonly. As readonly will be attributed once there's no write to that argument, it's safe to just cast that pointer to the parameter space if it has readonly. Basically, that argument lowering pass does a similar to D91590 but, instead, applies that in the backend. I verified that, for that simple test CUDA code, it would generate the same SASS.

This looks really simple, which is awesome. I am enthusiastic. But I am worried it may not be correct.

AIUI params are special in that they *must* be read from the param address space. It is illegal to do a generic load of a param.

So this change is correct only if we can guarantee that address space inference will infer the specific address space for all uses of the pointer.

But address space inference is not guaranteed. For example, you could select on two pointers of two different address spaces. So long as you only ever read from these pointers, the arg can still be marked as ReadOnly. But with this patch, we'd end up doing a generic load from the param space, which would be illegal.

Take it all with a grain of salt since I've also been out of the game for a while.

llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
166

nit: s/could/can/

hliao added a comment.Nov 22 2020, 9:20 AM

This looks really simple, which is awesome. I am enthusiastic. But I am worried it may not be correct.

AIUI params are special in that they *must* be read from the param address space. It is illegal to do a generic load of a param.

So this change is correct only if we can guarantee that address space inference will infer the specific address space for all uses of the pointer.

But address space inference is not guaranteed. For example, you could select on two pointers of two different address spaces. So long as you only ever read from these pointers, the arg can still be marked as ReadOnly. But with this patch, we'd end up doing a generic load from the param space, which would be illegal.

Take it all with a grain of salt since I've also been out of the game for a while.

readonly is marked in the middle-end (function and argument attribute deduction) and way before the backend. That deduction looks through all relevant users including PHI, addrspacecast and calls. I don't believe there's any exception to prove that deduction wrong.
The address space inference here only refers to the one in the backend directly after this argument lowering gpass. Thep helps translate that argument loading into the correct one using parameter space and doesn't help the aforementioned argument attribute deduction.

I don't believe there's any exception to prove deduction [of the readonly attribute] wrong.

Understood.

The address space inference here only refers to the one in the backend directly after this argument lowering gpass.

Also understood.

This isn't speaking to my concern, though.

Suppose we have

__global__ void foo(int x, const int* y, int* out, bool flag) {
  int* ptr = flag ? &x : y;
  *out = *ptr;
}

In this case we can say with confidence that x is readonly.

But address space inference cannot infer the address space of ptr (how could it?). Therefore we will do a generic load, which is wrong.

I don't believe there's any exception to prove deduction [of the readonly attribute] wrong.

Understood.

The address space inference here only refers to the one in the backend directly after this argument lowering gpass.

Also understood.

This isn't speaking to my concern, though.

Suppose we have

__global__ void foo(int x, const int* y, int* out, bool flag) {
  int* ptr = flag ? &x : y;
  *out = *ptr;
}

In this case we can say with confidence that x is readonly.

But address space inference cannot infer the address space of ptr (how could it?). Therefore we will do a generic load, which is wrong.

I see your point. PTX doesn't state the generic addressing could be performed on that parameter space. But, that case could be excluded with the extra check on how that parameter space pointer is used. In case it's not used in PHI or SELECT and cannot ensure the result is also a pointer to the parameter space, we could skip alloca insertion.

In case it's not used in PHI or SELECT and cannot ensure the result is also a pointer to the parameter space, we could skip alloca insertion.

I think an allowlist might be more appropriate than a denylist. Rather than, anything other than PHI and SELECT, could it be, if it's only transitively used by gep and load we're good?

I am not 100% sure even that works, though. The real problem is that this pass is trying to reason about what the addrspace inference pass is capable of. We can only do the transformation if here if we're positive that addrspace inference will eliminate all generic loads from the arg. That's a layering violation and ultimately is fragile.

hliao abandoned this revision.Nov 22 2020, 3:02 PM

In case it's not used in PHI or SELECT and cannot ensure the result is also a pointer to the parameter space, we could skip alloca insertion.

I think an allowlist might be more appropriate than a denylist. Rather than, anything other than PHI and SELECT, could it be, if it's only transitively used by gep and load we're good?

I am not 100% sure even that works, though. The real problem is that this pass is trying to reason about what the addrspace inference pass is capable of. We can only do the transformation if here if we're positive that addrspace inference will eliminate all generic loads from the arg. That's a layering violation and ultimately is fragile.

yeah, it seems the other approach is more appropriate to place the alloca in the frontend and that explicitly copy from the parameter space to the private space.

tra added a comment.Nov 30 2020, 10:43 AM

yeah, it seems the other approach is more appropriate to place the alloca in the frontend and that explicitly copy from the parameter space to the private space.

+1. Inserting alloca+copy early would be beneficial in general -- it will face more optimization opportunities which should be possible to see-through the copy in some cases.
Adding readonly on the original argument would probably be good, too.