Page MenuHomePhabricator

clang: Use byref for aggregate kernel arguments
ClosedPublic

Authored by arsenm on May 11 2020, 1:55 PM.

Details

Summary

Add address space to indirect abi info and use it for kernels.

Previously, indirect arguments assumed assumed a stack passed object
in the alloca address space using byval. A stack pointer is unsuitable
for kernel arguments, which are passed in a separate, constant buffer
with a different address space.

Start using the new byref for aggregate kernel arguments. Previously
these were emitted as raw struct arguments, and turned into loads in
the backend. These will lower identically, although with byref you now
have the option of applying an explicit alignment. In the future, a
reasonable implementation would use byref for all kernel arguments
(this would be a practical problem at the moment due to losing things
like noalias on pointer arguments).

This is mostly to avoid fighting the optimizer's treatment of
aggregate load/store. SROA and instcombine both turn aggregate loads
and stores into a long sequence of element loads and stores, rather
than the optimizable memcpy I would expect in this situation. Now an
explicit memcpy will be introduced up-front which is better understood
and helps eliminate the alloca in more situations.

This skips using byref in the case where HIP kernel pointer arguments
in structs are promoted to global pointers. At minimum an additional
patch is needed to allow coercion with indirect arguments.

Diff Detail

Event Timeline

arsenm created this revision.May 11 2020, 1:55 PM
arsenm updated this revision to Diff 263274.May 11 2020, 2:02 PM

Forgot to commit a new test

Typo in commit message: "Previously, indirect arguments assumed assumed".

The parameter variable is formally considered to be in a particular address space, and taking the address of it yields a pointer in that address space. That can only work for an indirect argument if either (1) the address space of the pointer that's actually passed can be freely converted to that formal address space (because it's a subspace, or because it's a superspace but known to be in the right subspace) or (2) we're willing to copy the object into the right address space on the callee side (and able to — obviously this only works for POD types). I do see the merit of allowing an address space to be specified for targets that consider locals to be in a different formal address space than the stack would naturally be (e.g. a generic address space); I don't remember if that applies to AMDGPU.

byval in LLVM is not a generic "this is a by-value argument" annotation; it specifically means that the value is actually passed on the stack despite the formally indirect convention, and therefore there's an implicit memcpy on the caller side. That is why byval is inherently tied to the alloca address space: there's no actual pointer being passed, so it doesn't make sense to pretend it might have been promoted into a different address space. That is also why there is no restriction to writing into the pointer: there's no reason to prevent writing to the argument slot.

The parameter variable is formally considered to be in a particular address space, and taking the address of it yields a pointer in that address space. That can only work for an indirect argument if either (1) the address space of the pointer that's actually passed can be freely converted to that formal address space (because it's a subspace, or because it's a superspace but known to be in the right subspace) or (2) we're willing to copy the object into the right address space on the callee side (and able to — obviously this only works for POD types). I do see the merit of allowing an address space to be specified for targets that consider locals to be in a different formal address space than the stack would naturally be (e.g. a generic address space); I don't remember if that applies to AMDGPU.

Kernel arguments aren't directly visible to the user program, so this is an implementation detail. The user variable is the alloca that we need to explicitly copy to as you mentioned, which this patch does. It's possible to poke at these through intrinsics, but the kernel address space isn't part of the language. POD type or not, we're going to have to do something to unload values from the constant buffer onto the stack.

byval in LLVM is not a generic "this is a by-value argument" annotation; it specifically means that the value is actually passed on the stack despite the formally indirect convention, and therefore there's an implicit memcpy on the caller side. That is why byval is inherently tied to the alloca address space: there's no actual pointer being passed, so it doesn't make sense to pretend it might have been promoted into a different address space. That is also why there is no restriction to writing into the pointer: there's no reason to prevent writing to the argument slot.

In this case there is never a call. Only the callee read exists as this is the entry point. What would the alternative be? Add another flavor of byval attribute that's nearly identical?

The parameter variable is formally considered to be in a particular address space, and taking the address of it yields a pointer in that address space. That can only work for an indirect argument if either (1) the address space of the pointer that's actually passed can be freely converted to that formal address space (because it's a subspace, or because it's a superspace but known to be in the right subspace) or (2) we're willing to copy the object into the right address space on the callee side (and able to — obviously this only works for POD types). I do see the merit of allowing an address space to be specified for targets that consider locals to be in a different formal address space than the stack would naturally be (e.g. a generic address space); I don't remember if that applies to AMDGPU.

Kernel arguments aren't directly visible to the user program, so this is an implementation detail. The user variable is the alloca that we need to explicitly copy to as you mentioned, which this patch does.

It's usually a goal of indirect arguments to not need this copy. We usually bind parameters directly to the pointer that was passed in.

It's possible to poke at these through intrinsics, but the kernel address space isn't part of the language. POD type or not, we're going to have to do something to unload values from the constant buffer onto the stack.

byval in LLVM is not a generic "this is a by-value argument" annotation; it specifically means that the value is actually passed on the stack despite the formally indirect convention, and therefore there's an implicit memcpy on the caller side. That is why byval is inherently tied to the alloca address space: there's no actual pointer being passed, so it doesn't make sense to pretend it might have been promoted into a different address space. That is also why there is no restriction to writing into the pointer: there's no reason to prevent writing to the argument slot.

In this case there is never a call. Only the callee read exists as this is the entry point. What would the alternative be? Add another flavor of byval attribute that's nearly identical?

It's hard to answer this because I'm not sure what you're trying to accomplish by marking the arguments byval.

The parameter variable is formally considered to be in a particular address space, and taking the address of it yields a pointer in that address space. That can only work for an indirect argument if either (1) the address space of the pointer that's actually passed can be freely converted to that formal address space (because it's a subspace, or because it's a superspace but known to be in the right subspace) or (2) we're willing to copy the object into the right address space on the callee side (and able to — obviously this only works for POD types). I do see the merit of allowing an address space to be specified for targets that consider locals to be in a different formal address space than the stack would naturally be (e.g. a generic address space); I don't remember if that applies to AMDGPU.

Kernel arguments aren't directly visible to the user program, so this is an implementation detail. The user variable is the alloca that we need to explicitly copy to as you mentioned, which this patch does.

It's usually a goal of indirect arguments to not need this copy. We usually bind parameters directly to the pointer that was passed in.

It's possible to poke at these through intrinsics, but the kernel address space isn't part of the language. POD type or not, we're going to have to do something to unload values from the constant buffer onto the stack.

byval in LLVM is not a generic "this is a by-value argument" annotation; it specifically means that the value is actually passed on the stack despite the formally indirect convention, and therefore there's an implicit memcpy on the caller side. That is why byval is inherently tied to the alloca address space: there's no actual pointer being passed, so it doesn't make sense to pretend it might have been promoted into a different address space. That is also why there is no restriction to writing into the pointer: there's no reason to prevent writing to the argument slot.

In this case there is never a call. Only the callee read exists as this is the entry point. What would the alternative be? Add another flavor of byval attribute that's nearly identical?

It's hard to answer this because I'm not sure what you're trying to accomplish by marking the arguments byval.

The short answer is I'm trying to avoid fighting the optimizer's handling of aggregate load and stores. Passes already understand byval, but are actively harmful to aggregate loads and stores. Having explicit loads in the IR also brings it closer to what these are ultimately lowered to. Currently we emit a raw struct argument, which produces an aggregate store to alloca. Both SROA and instcombine unhelpfully split up the aggregate store, which doesn't optimize nicely like the memcpy from constant memory to alloca. The end result is we end up with more allocas and copies from constant memory that could have just read directly from the constant pointer.

The most honest calling convention handling would be to have an explicit constant pointer that all arguments are read from, and the function would have an empty argument list. This is somewhat impractical, since we still need to track the argument sizes and offsets somehow in the IR. It also becomes much more difficult to emit nicely annotated IR, since things like noalias still largely expect to be attached to a function argument. We do have an optimization pass that rewrites the arguments to look like this to enable vectorization later in the backend, but it suffers from losing useful alias annotations.

Okay. So the only real ABI here is the layout of the memory that the arguments are actually written into? And that memory needs to be treated as constant?

Unfortunately, I think byval just doesn't match what you want because of the mutability — the frontend *has* to have a copy into a local to get IR with correct semantics, because byval is assumed to be locally mutable by both IR-generation and (potentially) LLVM optimization. And I don't think you really want non-byval indirect. So I guess the question is what we can do in the frontend to get the optimizer behavior you need.

Okay. So the only real ABI here is the layout of the memory that the arguments are actually written into? And that memory needs to be treated as constant?

Yes, the actual kernel ABI is supposed to invisible.

Unfortunately, I think byval just doesn't match what you want because of the mutability — the frontend *has* to have a copy into a local to get IR with correct semantics, because byval is assumed to be locally mutable by both IR-generation and (potentially) LLVM optimization. And I don't think you really want non-byval indirect. So I guess the question is what we can do in the frontend to get the optimizer behavior you need.

You are allowed to have readonly on a byval pointer argument, in which case optimizations wouldn't be allowed to write into it. Is just adding readonly parameter attributes sufficient? It would be somewhat contrived, but could also define byval as constant if it's not in the alloca address space.

Unfortunately, I think byval just doesn't match what you want because of the mutability — the frontend *has* to have a copy into a local to get IR with correct semantics, because byval is assumed to be locally mutable by both IR-generation and (potentially) LLVM optimization. And I don't think you really want non-byval indirect. So I guess the question is what we can do in the frontend to get the optimizer behavior you need.

You are allowed to have readonly on a byval pointer argument, in which case optimizations wouldn't be allowed to write into it. Is just adding readonly parameter attributes sufficient? It would be somewhat contrived, but could also define byval as constant if it's not in the alloca address space.

Another option is to add a form of indirect argument passing that just from a constant offset from an intrinsic call. We would still need to leave an unused struct argument in the function argument list for size keeping, and lose the ability to add an explicit alignment. We would also be mixing multiple ways of accessing arguments which is gross but survivable.

Okay. So the only real ABI here is the layout of the memory that the arguments are actually written into? And that memory needs to be treated as constant?

Yes, the actual kernel ABI is supposed to invisible.

Unfortunately, I think byval just doesn't match what you want because of the mutability — the frontend *has* to have a copy into a local to get IR with correct semantics, because byval is assumed to be locally mutable by both IR-generation and (potentially) LLVM optimization. And I don't think you really want non-byval indirect. So I guess the question is what we can do in the frontend to get the optimizer behavior you need.

You are allowed to have readonly on a byval pointer argument, in which case optimizations wouldn't be allowed to write into it. Is just adding readonly parameter attributes sufficient? It would be somewhat contrived, but could also define byval as constant if it's not in the alloca address space.

byval is fundamentally about expressing that something is passed on the stack. If you want an indirect readonly noalias argument, you can make one; however, to convince IRGen to do it, you really need to think of this as a new kind of argument-passing, because "pass the address of an immutable object that the callee can't modify" is not something that we normally need to do in calling-convention lowering. That feels like a lot of complexity to solve a rather narrow problem.

A completely different approach: OpenMP has to solve some very similar problems and just lowers them completely in the frontend; have you considered just doing that? Kernels need a ton of special-case handling anyway, and IIUC you can never optimize over the boundary anyway.

A completely different approach: OpenMP has to solve some very similar problems and just lowers them completely in the frontend; have you considered just doing that? Kernels need a ton of special-case handling anyway, and IIUC you can never optimize over the boundary anyway.

Yes, this is what I was describing. The problem is this ends up hurting optimizations because now we end up losing things like noalias on the pointer arguments. I also would still need to track the argument size/offset/align information in the IR, but for that we could keep on doing what we're doing and just never use the kernel arguments. One of the advantages of byval was an explicit align field to track this

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory. Hence, the readonly is pretty useless *in the callee* if other accesses can write the memory. The readonly is useful in the caller though, usually if we have basically noalias information there (e.g., it is an alloca). Noalias is useful in the callee regardless of readonly but even better with.

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory.

Oh, do we really not have a way to mark that memory is known to be truly immutable for a time? That seems like a really bad limitation. It should be doable with a custom alias analysis at least.

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

I thought that was finished a few years ago; is it really not considered usable yet? Or does "we" not just mean LLVM here?

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory.

Oh, do we really not have a way to mark that memory is known to be truly immutable for a time? That seems like a really bad limitation. It should be doable with a custom alias analysis at least.

noalias + readone on an argument basically implies immutable for the function scope. I think we have invariant intrinsics that could do the trick as well, though I'm not too familiar with those. I was eventually hoping for paired/scoped llvm.assumes which would allow noalias + readnone again. Then there is invariant which can be placed on a load instruction. Finally, TBAA has a "constant memory" tag (or something like that), but again it is a per-access thing. That are all the in-tree ways I can think of right now.

Custom alias analysis can probably be used to some degree but except address spaces I'm unsure we have much that you can attach to a pointer and that "really sticks".

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

I thought that was finished a few years ago; is it really not considered usable yet? Or does "we" not just mean LLVM here?

Yes, I meant "we" = LLVM here. Maybe we talk about different things. I was referring to local restrict qualified variables, e.g., double * __restrict Ptr = .... The proposal to not just ignore the restrict (see https://godbolt.org/z/jLzjR3) came last year, it was a big one and progress unfortunately stalled (partly my fault). Now we are just about to see a second push to get it done.
Is that what you meant too?

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory.

Oh, do we really not have a way to mark that memory is known to be truly immutable for a time? That seems like a really bad limitation. It should be doable with a custom alias analysis at least.

noalias + readone on an argument basically implies immutable for the function scope. I think we have invariant intrinsics that could do the trick as well, though I'm not too familiar with those. I was eventually hoping for paired/scoped llvm.assumes which would allow noalias + readnone again. Then there is invariant which can be placed on a load instruction. Finally, TBAA has a "constant memory" tag (or something like that), but again it is a per-access thing. That are all the in-tree ways I can think of right now.

Custom alias analysis can probably be used to some degree but except address spaces I'm unsure we have much that you can attach to a pointer and that "really sticks".

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

I thought that was finished a few years ago; is it really not considered usable yet? Or does "we" not just mean LLVM here?

Yes, I meant "we" = LLVM here. Maybe we talk about different things. I was referring to local restrict qualified variables, e.g., double * __restrict Ptr = .... The proposal to not just ignore the restrict (see https://godbolt.org/z/jLzjR3) came last year, it was a big one and progress unfortunately stalled (partly my fault). Now we are just about to see a second push to get it done.
Is that what you meant too?

I thought I remembered Hal doing a lot of work on local restrict a few years ago. I'm probably just misremembering, or I didn't realize that the work never landed or got pulled out later.

Okay. So where we're at is that you'd like to add a new argument-passing convention that's basically "passed indirect but immutable", implying that the frontend has to copy it in order to create the mutable local parameter. That's not actually a totally ridiculous convention in principle, although it has poor worst-case behavior (copies on both sides), and that happens to be what the frontend will often have to conservatively emit. I would still prefer not to add new argument-passing conventions just to satisfy short-term optimization needs, though. Are there any other reasonable options?

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory.

Oh, do we really not have a way to mark that memory is known to be truly immutable for a time? That seems like a really bad limitation. It should be doable with a custom alias analysis at least.

noalias + readone on an argument basically implies immutable for the function scope. I think we have invariant intrinsics that could do the trick as well, though I'm not too familiar with those. I was eventually hoping for paired/scoped llvm.assumes which would allow noalias + readnone again. Then there is invariant which can be placed on a load instruction. Finally, TBAA has a "constant memory" tag (or something like that), but again it is a per-access thing. That are all the in-tree ways I can think of right now.

Custom alias analysis can probably be used to some degree but except address spaces I'm unsure we have much that you can attach to a pointer and that "really sticks".

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

I thought that was finished a few years ago; is it really not considered usable yet? Or does "we" not just mean LLVM here?

Yes, I meant "we" = LLVM here. Maybe we talk about different things. I was referring to local restrict qualified variables, e.g., double * __restrict Ptr = .... The proposal to not just ignore the restrict (see https://godbolt.org/z/jLzjR3) came last year, it was a big one and progress unfortunately stalled (partly my fault). Now we are just about to see a second push to get it done.
Is that what you meant too?

I thought I remembered Hal doing a lot of work on local restrict a few years ago. I'm probably just misremembering, or I didn't realize that the work never landed or got pulled out later.

Okay. So where we're at is that you'd like to add a new argument-passing convention that's basically "passed indirect but immutable", implying that the frontend has to copy it in order to create the mutable local parameter. That's not actually a totally ridiculous convention in principle, although it has poor worst-case behavior (copies on both sides), and that happens to be what the frontend will often have to conservatively emit. I would still prefer not to add new argument-passing conventions just to satisfy short-term optimization needs, though. Are there any other reasonable options?

For the purpose here, only the callee exists. This is essentially a freestanding function, the entry point to the program. There is no caller function, and in the future I would like to make a call to amdgpu_kernel an IR verifier error (technically OpenCL device enqueue is an exception to this, but we don't treat this as a call. Instead there's a lot of library magic to invoke the kernel. From the perspective of the callee nothing changes, since it's still not allowed to modify the incoming argument buffer or aware it was called this way).

The load-from-constant nature needs to be exposed earlier, so I think this necessarily involves changing the convention lowering in some way and it's just a question of what it looks like. To summarize the 2.5 options I've come up with are

  1. Use constant byval parameters, as this patch does. This requires the least implementation effort but doesn't exactly fit in with byval as defined.
  2. Replace all IR argument uses with loads from a constant offset from an intrinsic call. This still needs to leave the IR arguments in place because we do need to know the original argument sizes and offsets, but they would never have a use (or I would need to invent some other method of tracking this information)
  3. Keep clang IR generation unchanged, but move the pass that lowers arguments to loads earlier and hack out aggregate IR loads before SROA makes things worse. This is really just a kludgier version of option 2. We do ultimately do this late in the backend to enable vectorization, but it does seem to make the middle end optimizer unhappy

For the purpose here, only the callee exists. This is essentially a freestanding function, the entry point to the program. There is no caller function, and in the future I would like to make a call to amdgpu_kernel an IR verifier error (technically OpenCL device enqueue is an exception to this, but we don't treat this as a call. Instead there's a lot of library magic to invoke the kernel. From the perspective of the callee nothing changes, since it's still not allowed to modify the incoming argument buffer or aware it was called this way).

Did you consider callback annotation for the device enqueue call? While that might not change anything *now*, I'm expecting interesting optimization opportunities there at some point "soon".

The load-from-constant nature needs to be exposed earlier, so I think this necessarily involves changing the convention lowering in some way and it's just a question of what it looks like. To summarize the 2.5 options I've come up with are

  1. Use constant byval parameters, as this patch does. This requires the least implementation effort but doesn't exactly fit in with byval as defined.

And, as was noted in the byval lang ref patch (D79636), there is a reasonable argument to be made to phase-out byval in favor of some explicit copying. If that happens, this solution should not be "the last byval user". Also, byval arguments could be used as scratchpad by smart optimizations. I somehow don't believe this is a great choice but I can by now see that the others are neither.

  1. Replace all IR argument uses with loads from a constant offset from an intrinsic call. This still needs to leave the IR arguments in place because we do need to know the original argument sizes and offsets, but they would never have a use (or I would need to invent some other method of tracking this information)
  2. Keep clang IR generation unchanged, but move the pass that lowers arguments to loads earlier and hack out aggregate IR loads before SROA makes things worse. This is really just a kludgier version of option 2. We do ultimately do this late in the backend to enable vectorization, but it does seem to make the middle end optimizer unhappy

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory.

Oh, do we really not have a way to mark that memory is known to be truly immutable for a time? That seems like a really bad limitation. It should be doable with a custom alias analysis at least.

noalias + readone on an argument basically implies immutable for the function scope. I think we have invariant intrinsics that could do the trick as well, though I'm not too familiar with those. I was eventually hoping for paired/scoped llvm.assumes which would allow noalias + readnone again. Then there is invariant which can be placed on a load instruction. Finally, TBAA has a "constant memory" tag (or something like that), but again it is a per-access thing. That are all the in-tree ways I can think of right now.

Custom alias analysis can probably be used to some degree but except address spaces I'm unsure we have much that you can attach to a pointer and that "really sticks".

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

I thought that was finished a few years ago; is it really not considered usable yet? Or does "we" not just mean LLVM here?

Yes, I meant "we" = LLVM here. Maybe we talk about different things. I was referring to local restrict qualified variables, e.g., double * __restrict Ptr = .... The proposal to not just ignore the restrict (see https://godbolt.org/z/jLzjR3) came last year, it was a big one and progress unfortunately stalled (partly my fault). Now we are just about to see a second push to get it done.
Is that what you meant too?

I thought I remembered Hal doing a lot of work on local restrict a few years ago. I'm probably just misremembering, or I didn't realize that the work never landed or got pulled out later.

Okay. So where we're at is that you'd like to add a new argument-passing convention that's basically "passed indirect but immutable", implying that the frontend has to copy it in order to create the mutable local parameter. That's not actually a totally ridiculous convention in principle, although it has poor worst-case behavior (copies on both sides), and that happens to be what the frontend will often have to conservatively emit. I would still prefer not to add new argument-passing conventions just to satisfy short-term optimization needs, though. Are there any other reasonable options?

For the purpose here, only the callee exists. This is essentially a freestanding function, the entry point to the program.

I'm definitely not going to let you add a new "generic" argument-passing convention and then only implement exactly the parts you need; that's just adding technical debt.

The load-from-constant nature needs to be exposed earlier, so I think this necessarily involves changing the convention lowering in some way and it's just a question of what it looks like. To summarize the 2.5 options I've come up with are

  1. Use constant byval parameters, as this patch does. This requires the least implementation effort but doesn't exactly fit in with byval as defined.

I assume you generate a byval caller in some later pass, which will then implicitly copy. Or do you lower byval in a nonstandard way in your backend?

  1. Replace all IR argument uses with loads from a constant offset from an intrinsic call. This still needs to leave the IR arguments in place because we do need to know the original argument sizes and offsets, but they would never have a use (or I would need to invent some other method of tracking this information)

I guess passing aggregate arguments using a normal indirect convention has this same tracking problem. You'd also have to copy on the caller side to maintain semantics, which is probably hard to eliminate, but I would guess byval has the same problem?

  1. Keep clang IR generation unchanged, but move the pass that lowers arguments to loads earlier and hack out aggregate IR loads before SROA makes things worse. This is really just a kludgier version of option 2. We do ultimately do this late in the backend to enable vectorization, but it does seem to make the middle end optimizer unhappy

Yeah, Clang tries to avoid first-class aggregates for exactly this reason, LLVM does a terrible job generating code for them.

For the purpose here, only the callee exists. This is essentially a freestanding function, the entry point to the program. There is no caller function, and in the future I would like to make a call to amdgpu_kernel an IR verifier error (technically OpenCL device enqueue is an exception to this, but we don't treat this as a call. Instead there's a lot of library magic to invoke the kernel. From the perspective of the callee nothing changes, since it's still not allowed to modify the incoming argument buffer or aware it was called this way).

Did you consider callback annotation for the device enqueue call? While that might not change anything *now*, I'm expecting interesting optimization opportunities there at some point "soon".

I'm not sure what you mean by this exactly. I'm assuming this means "move device enqueue implementation into the backend". I don't know all the details of how device enqueue is implemented, but there's so much code in the library to support this, I don't think this would end up looking like a normal calling convention lowering. It would guess it would end up looking like an IR pass that would need a calling convention with this restriction, and then a pass to insert the code the library uses now.

The load-from-constant nature needs to be exposed earlier, so I think this necessarily involves changing the convention lowering in some way and it's just a question of what it looks like. To summarize the 2.5 options I've come up with are

  1. Use constant byval parameters, as this patch does. This requires the least implementation effort but doesn't exactly fit in with byval as defined.

And, as was noted in the byval lang ref patch (D79636), there is a reasonable argument to be made to phase-out byval in favor of some explicit copying. If that happens, this solution should not be "the last byval user". Also, byval arguments could be used as scratchpad by smart optimizations. I somehow don't believe this is a great choice but I can by now see that the others are neither.

I assume this would need replacement with a slew of other attributes to capture the type/size/alignment/dereferencability, or a new attribute entirely?

For the purpose here, only the callee exists. This is essentially a freestanding function, the entry point to the program.

I'm definitely not going to let you add a new "generic" argument-passing convention and then only implement exactly the parts you need; that's just adding technical debt.

I'm not sure what you mean here. I don't really want or need a totally new generic argument passing convention. Not every IR feature is expected to be implemented by every backend. For instance, inalloca/preallocated exist solely to implement the x86 windows ABI and no other target tries to handle them. This is just another flavor of the same fundamental concept of a parameter attribute needed for a target specific calling convention.

The load-from-constant nature needs to be exposed earlier, so I think this necessarily involves changing the convention lowering in some way and it's just a question of what it looks like. To summarize the 2.5 options I've come up with are

  1. Use constant byval parameters, as this patch does. This requires the least implementation effort but doesn't exactly fit in with byval as defined.

I assume you generate a byval caller in some later pass, which will then implicitly copy. Or do you lower byval in a nonstandard way in your backend?

No, the caller is only an external driver of some kind. Since the address spaces don't match (and the source address space is read only), anything that behaves like a stack byval doesn't really make sense from the beginning which is why this patch changes the address space. If we were to leave this as a stack address space, we would have to add an alloca and insert a memcpy anyway. This would still leave an unusable byval argument around a pass could still theoretically try to write into.

D79630 adds the lowering that uses this. Because reasons (some of which I referenced in the previous comments), we have 3 different implementations of the ABI. The byval pointer value is simply replaced with a new pointer derived from a constant offset from an intrinsic call (this is most obvious in the AMDGPULowerKernelArguments IR version)

  1. Replace all IR argument uses with loads from a constant offset from an intrinsic call. This still needs to leave the IR arguments in place because we do need to know the original argument sizes and offsets, but they would never have a use (or I would need to invent some other method of tracking this information)

I guess passing aggregate arguments using a normal indirect convention has this same tracking problem. You'd also have to copy on the caller side to maintain semantics, which is probably hard to eliminate, but I would guess byval has the same problem?

Isn't using byval the normal indirect convention? Really the only properties I want that byval gives me is a pointee size and alignment. The most abstract attribute I could probably come up with is a pointee(type) annotation that I could use, without the stack copying implications of byval

For the purpose here, only the callee exists. This is essentially a freestanding function, the entry point to the program.

I'm definitely not going to let you add a new "generic" argument-passing convention and then only implement exactly the parts you need; that's just adding technical debt.

I'm not sure what you mean here. I don't really want or need a totally new generic argument passing convention. Not every IR feature is expected to be implemented by every backend. For instance, inalloca/preallocated exist solely to implement the x86 windows ABI and no other target tries to handle them. This is just another flavor of the same fundamental concept of a parameter attribute needed for a target specific calling convention.

I mean the Clang code for supporting this new convention, not the IR support. Of course LLVM has target-specific attributes.

The load-from-constant nature needs to be exposed earlier, so I think this necessarily involves changing the convention lowering in some way and it's just a question of what it looks like. To summarize the 2.5 options I've come up with are

  1. Use constant byval parameters, as this patch does. This requires the least implementation effort but doesn't exactly fit in with byval as defined.

I assume you generate a byval caller in some later pass, which will then implicitly copy. Or do you lower byval in a nonstandard way in your backend?

No, the caller is only an external driver of some kind. Since the address spaces don't match (and the source address space is read only), anything that behaves like a stack byval doesn't really make sense from the beginning which is why this patch changes the address space. If we were to leave this as a stack address space, we would have to add an alloca and insert a memcpy anyway. This would still leave an unusable byval argument around a pass could still theoretically try to write into.

D79630 adds the lowering that uses this. Because reasons (some of which I referenced in the previous comments), we have 3 different implementations of the ABI. The byval pointer value is simply replaced with a new pointer derived from a constant offset from an intrinsic call (this is most obvious in the AMDGPULowerKernelArguments IR version)

  1. Replace all IR argument uses with loads from a constant offset from an intrinsic call. This still needs to leave the IR arguments in place because we do need to know the original argument sizes and offsets, but they would never have a use (or I would need to invent some other method of tracking this information)

I guess passing aggregate arguments using a normal indirect convention has this same tracking problem. You'd also have to copy on the caller side to maintain semantics, which is probably hard to eliminate, but I would guess byval has the same problem?

Isn't using byval the normal indirect convention? Really the only properties I want that byval gives me is a pointee size and alignment. The most abstract attribute I could probably come up with is a pointee(type) annotation that I could use, without the stack copying implications of byval

byval is not an indirect convention. It looks like one in IR, but it means "it's passed in the argument area of the stack", which is essentially more like being passed directly than otherwise.

In IR, the normal indirect convention is just to pass a pointer without any extra treatment. Clang does set nonnull dereferenceable(size) align 4 for optimization purposes, though.

arsenm added a comment.Jun 2 2020, 3:12 PM

For the purpose here, only the callee exists. This is essentially a freestanding function, the entry point to the program.

I'm definitely not going to let you add a new "generic" argument-passing convention and then only implement exactly the parts you need; that's just adding technical debt.

I'm not sure what you mean here. I don't really want or need a totally new generic argument passing convention. Not every IR feature is expected to be implemented by every backend. For instance, inalloca/preallocated exist solely to implement the x86 windows ABI and no other target tries to handle them. This is just another flavor of the same fundamental concept of a parameter attribute needed for a target specific calling convention.

I mean the Clang code for supporting this new convention, not the IR support. Of course LLVM has target-specific attributes.

I think this is converging to adding a new IR attribute that essentially just provides the pointee type for ABI purposes. I guess my name ideas for this would be "indirect", "value", "memoryvalue", "abitype"?

I forget that frontends exist sometimes, so I'm not sure I understand your clang side objection.

arsenm added a comment.Jun 2 2020, 3:17 PM

I think this is converging to adding a new IR attribute that essentially just provides the pointee type for ABI purposes. I guess my name ideas for this would be "indirect", "value", "memoryvalue", "abitype"?

My main question for a new attribute is whether it needs to be explicitly read only. I think the answer is no, since you can't ordinarily write to any random pointer argument

I think this is converging to adding a new IR attribute that essentially just provides the pointee type for ABI purposes. I guess my name ideas for this would be "indirect", "value", "memoryvalue", "abitype"?

My main question for a new attribute is whether it needs to be explicitly read only. I think the answer is no, since you can't ordinarily write to any random pointer argument

I think this is "in practice" correct (assuming you don't see a write to that location in which case you can do some crazy stuff).

I think this is converging to adding a new IR attribute that essentially just provides the pointee type for ABI purposes.

If you don't feel like you can rely on the dereferenceable attributes, then I guess so, yes.

arsenm updated this revision to Diff 279635.Jul 21 2020, 2:48 PM
arsenm retitled this revision from clang: Add address space to indirect abi info and use it for kernels to clang: Use byref for kernel arguments.
arsenm edited the summary of this revision. (Show Details)

Switch to byref. Doesn't handle the HIP kernel argument promotion case, since that requires more work

arsenm retitled this revision from clang: Use byref for kernel arguments to clang: Use byref for aggregate kernel arguments.Jul 21 2020, 2:49 PM

Arguably we should add this attribute to all indirect arguments. I can understand not wanting to update all the test cases, but you could probably avoid adding a new IndirectByRef kind of ABIArgInfo by treating kernels specially in ConstructAttributeList.

Or, sorry, I forget — is this semantically necessary because the argument is to constant memory and the callee has to copy it to form the mutable local? If so, I think (1) the above statement about theoretically using byref on all arguments still applies and (2) we do need a new ABIArgInfo kind, but we should name it something like IndirectAliased.

Arguably we should add this attribute to all indirect arguments. I can understand not wanting to update all the test cases, but you could probably avoid adding a new IndirectByRef kind of ABIArgInfo by treating kernels specially in ConstructAttributeList.

Or, sorry, I forget — is this semantically necessary because the argument is to constant memory and the callee has to copy it to form the mutable local? If so, I think (1) the above statement about theoretically using byref on all arguments still applies and (2) we do need a new ABIArgInfo kind, but we should name it something like IndirectAliased.

Yes, it's semantically needed to insert the copy from constant memory. I originally interpreted a copy as necessary if the indirect addrspace did not match the stack address space, which is a sort of roundabout way of achieving the same thing

arsenm updated this revision to Diff 280607.Jul 24 2020, 3:32 PM

Use distinct ABIArgInfo::Kind. Also don't enable this for OpenCL yet, since that requires fixing the callable kernel workaround

rjmccall added inline comments.Jul 24 2020, 9:30 PM
clang/include/clang/CodeGen/CGFunctionInfo.h
52

Hmm. I guess there's actually two different potential conventions here:

  • The caller can provide the address of a known-immutable object that has the right value, and the callee has to copy it if it needs the object to have a unique address or wants to mutate it.
  • The caller can provide the address of *any* object that has the right value, and the callee has to copy it if it needs the object to have a unique address, wants to mutate it, or needs the value to stick around across call boundaries.

The advantage of the second is that IRGen could avoid some copies on the caller side, which could be advantageous when the callee is sufficiently trivial. The disadvantage is that the callee would have to defensively copy in more situations.

Probably we should use the former. Please be explicit about it, though:

Similar to Indirect, but the pointer may be to an object that is otherwise
referenced.  The object is known to not be modified through any other
references for the duration of the call, and the callee must not itself
modify the object.  Because C allows parameter variables to be modified
and guarantees that they have unique addresses, the callee must
defensively copy the object into a local variable if it might be modified or
its address might be compared.  Since those are uncommon, in principle
this convention allows programs to avoid copies in more situations.
However, it may introduce *extra* copies if the callee fails to prove that
a copy is unnecessary and the caller naturally produces an unaliased
object for the argument.
clang/lib/CodeGen/CGCall.cpp
2216

Please add a TODO here that we could add the byref attribute if we're willing to update the test cases. Maybe whoever does that can add alignments at the same time.

2465

"copy it to ensure that the parameter variable is mutable and has a unique address, as C requires".

I've wanted Sema to track whether local variables are mutated or have their address taken for a long time; maybe someday we can do that and then take advantage of it here. Just a random thought, sorry.

4700

Please just make this use the Indirect code. If we gave it special attention, we could optimize it better, but conservatively doing what Indirect does should still work.

clang/lib/CodeGen/TargetInfo.cpp
1997

In principle, this can be inreg just as much as Indirect can.

8814

I don't see why you'd use byref when promoting pointers in structs. Maybe it works as a hack with your backend, but it seems *extremely* special-case and should not be hacked into the general infrastructure.

9382

No reason not to use the Indirect code here.

9752

Same.

arsenm updated this revision to Diff 281629.Jul 29 2020, 9:08 AM
arsenm marked 5 inline comments as done.

Address comments

clang/lib/CodeGen/TargetInfo.cpp
1997

The IR verifier currently will reject byref + inreg

8814

The whole point is to reinterpret the address space of the pointers in memory since we know if it's a kernel argument it has to be an addrspace(1) pointer or garbage. We can't infer the address space of a generic pointer loaded from memory.

byref doesn't change that, it just makes the fact that these are passed in memory explicit

9382

I generally don't like speculatively adding handling for features I can't write a testcase for, but I've moved these

rjmccall added inline comments.Aug 3 2020, 10:31 AM
clang/lib/CodeGen/TargetInfo.cpp
1997

Why? inreg is essentially orthogonal.

arsenm added inline comments.Aug 4 2020, 11:49 AM
clang/lib/CodeGen/TargetInfo.cpp
1997

Mostly inherited from the other similar attribute handling. It can be lifted if there's a use

arsenm added inline comments.Aug 5 2020, 6:03 AM
clang/lib/CodeGen/TargetInfo.cpp
1997

Plus the name here is isArgInAlloca; this is not necessarily passed in an alloca

rjmccall added inline comments.Aug 5 2020, 11:25 AM
clang/lib/CodeGen/TargetInfo.cpp
1997

I agree that we don't need to update this.

8814

byref is interpreted by your backend passes as an instruction that the argument value is actually the address of an object that's passed to the kernel by value, so you need to expand the memory in the kernel argument marshalling. Why would that be something you'd want to trigger when passing a struct with a pointer in it? You're not going to recursively copy and pass down the pointee values of those pointers.

arsenm added inline comments.Aug 5 2020, 11:36 AM
clang/lib/CodeGen/TargetInfo.cpp
8814

Because all arguments are really passed byref, we're just not at the point yet where we can switch all IR arguments to use byref for all arguments. All of the relevant properties are really always on the in-memory value.

The promotion this is talking about is really orthogonal to the IR mechanism used for passing kernel arguments. This promotion is because the language only exposes generic pointers. In the context of a pointer inside a struct passed as a kernel argument, we semantically know the address space of any valid pointers must be global. You could not produce a valid generic pointer from another address space here. The pointers/structs are still the same size and layout, but coercing the in-memory address space is semantically more useful to the optimizer

rjmccall added inline comments.Aug 5 2020, 11:46 AM
clang/lib/CodeGen/TargetInfo.cpp
8814

I understand that the promotion is orthogonal to the IR mechanism used for passing kernel arguments, which is exactly why I'm asking why there's a comment saying that we should "use byref when promoting pointers in struct", because I have no idea what that's supposed to mean when the pointer is just a part of the value being passed.

It sounds like what you want is to maybe customize the code that's emitted to copy a byref parameter into a parameter variable when the parameter type is a struct containing a pointer you want to promote. But that doesn't really have anything to do with byref; if you weren't using byref, you'd still want a similar customization when creating the parameter variable. So it seems to me that the comment is still off-target.

arsenm updated this revision to Diff 283345.Aug 5 2020, 12:32 PM

Reword comment

rjmccall accepted this revision.Aug 6 2020, 12:26 PM

Thanks, LGTM.

This revision is now accepted and ready to land.Aug 6 2020, 12:26 PM