Page MenuHomePhabricator

clang: Add address space to indirect abi info and use it for kernels
Needs ReviewPublic

Authored by arsenm on Mon, May 11, 1:55 PM.

Details

Summary

Previously, indirect arguments assumed assumed a stack passed object
in the alloca address space. A stack pointer is unsuitable for kernel
arguments, which are passed in a separate, constant buffer with a
different address space.

Start using byval for aggregate kernel arguments. Previously these
were emitted as raw struct arguments, and turned into loads in the
backend. These will lower identically, although with byval you now
have the option of applying an explicit alignment. In the future, a
reasonable implementation would use byval for all kernel arguments
(this would be a practical problem at the moment due to losing things
like noalias on pointer arguments).

This is mostly to avoid fighting the optimizer's treatment of
aggregate load/store. SROA and instcombine both turn aggregate loads
and stores into a long sequence of element loads and stores, rather
than the optimizable memcpy I would expect in this situation. Now an
explicit memcpy will be introduced up-front which is better understood
and helps eliminate the alloca in more situations.

Most of the language surrounding byval involves the stack, however I
can't find any real reason this needs to limited to a stack address
space. My main concern is that nothing seems to explicitly disallow
writing to a byval address, but it is illegal to write to this read
only pointer. A theoretical stack space reusing pass might try to
reuse some byval bytes. The only one we have now is the stack coloring
machine pass, which would never see a frame index for these byval
arguments. I could go a step further and add the readonly attribute to
these to make sure writes should never be introduced, although this
patch doesn't do that yet. There is code in the clang handling
explicitly removing readnone from byval arguments for some reason.

Diff Detail

Event Timeline

arsenm created this revision.Mon, May 11, 1:55 PM
arsenm updated this revision to Diff 263274.Mon, May 11, 2:02 PM

Forgot to commit a new test

Typo in commit message: "Previously, indirect arguments assumed assumed".

The parameter variable is formally considered to be in a particular address space, and taking the address of it yields a pointer in that address space. That can only work for an indirect argument if either (1) the address space of the pointer that's actually passed can be freely converted to that formal address space (because it's a subspace, or because it's a superspace but known to be in the right subspace) or (2) we're willing to copy the object into the right address space on the callee side (and able to — obviously this only works for POD types). I do see the merit of allowing an address space to be specified for targets that consider locals to be in a different formal address space than the stack would naturally be (e.g. a generic address space); I don't remember if that applies to AMDGPU.

byval in LLVM is not a generic "this is a by-value argument" annotation; it specifically means that the value is actually passed on the stack despite the formally indirect convention, and therefore there's an implicit memcpy on the caller side. That is why byval is inherently tied to the alloca address space: there's no actual pointer being passed, so it doesn't make sense to pretend it might have been promoted into a different address space. That is also why there is no restriction to writing into the pointer: there's no reason to prevent writing to the argument slot.

The parameter variable is formally considered to be in a particular address space, and taking the address of it yields a pointer in that address space. That can only work for an indirect argument if either (1) the address space of the pointer that's actually passed can be freely converted to that formal address space (because it's a subspace, or because it's a superspace but known to be in the right subspace) or (2) we're willing to copy the object into the right address space on the callee side (and able to — obviously this only works for POD types). I do see the merit of allowing an address space to be specified for targets that consider locals to be in a different formal address space than the stack would naturally be (e.g. a generic address space); I don't remember if that applies to AMDGPU.

Kernel arguments aren't directly visible to the user program, so this is an implementation detail. The user variable is the alloca that we need to explicitly copy to as you mentioned, which this patch does. It's possible to poke at these through intrinsics, but the kernel address space isn't part of the language. POD type or not, we're going to have to do something to unload values from the constant buffer onto the stack.

byval in LLVM is not a generic "this is a by-value argument" annotation; it specifically means that the value is actually passed on the stack despite the formally indirect convention, and therefore there's an implicit memcpy on the caller side. That is why byval is inherently tied to the alloca address space: there's no actual pointer being passed, so it doesn't make sense to pretend it might have been promoted into a different address space. That is also why there is no restriction to writing into the pointer: there's no reason to prevent writing to the argument slot.

In this case there is never a call. Only the callee read exists as this is the entry point. What would the alternative be? Add another flavor of byval attribute that's nearly identical?

The parameter variable is formally considered to be in a particular address space, and taking the address of it yields a pointer in that address space. That can only work for an indirect argument if either (1) the address space of the pointer that's actually passed can be freely converted to that formal address space (because it's a subspace, or because it's a superspace but known to be in the right subspace) or (2) we're willing to copy the object into the right address space on the callee side (and able to — obviously this only works for POD types). I do see the merit of allowing an address space to be specified for targets that consider locals to be in a different formal address space than the stack would naturally be (e.g. a generic address space); I don't remember if that applies to AMDGPU.

Kernel arguments aren't directly visible to the user program, so this is an implementation detail. The user variable is the alloca that we need to explicitly copy to as you mentioned, which this patch does.

It's usually a goal of indirect arguments to not need this copy. We usually bind parameters directly to the pointer that was passed in.

It's possible to poke at these through intrinsics, but the kernel address space isn't part of the language. POD type or not, we're going to have to do something to unload values from the constant buffer onto the stack.

byval in LLVM is not a generic "this is a by-value argument" annotation; it specifically means that the value is actually passed on the stack despite the formally indirect convention, and therefore there's an implicit memcpy on the caller side. That is why byval is inherently tied to the alloca address space: there's no actual pointer being passed, so it doesn't make sense to pretend it might have been promoted into a different address space. That is also why there is no restriction to writing into the pointer: there's no reason to prevent writing to the argument slot.

In this case there is never a call. Only the callee read exists as this is the entry point. What would the alternative be? Add another flavor of byval attribute that's nearly identical?

It's hard to answer this because I'm not sure what you're trying to accomplish by marking the arguments byval.

The parameter variable is formally considered to be in a particular address space, and taking the address of it yields a pointer in that address space. That can only work for an indirect argument if either (1) the address space of the pointer that's actually passed can be freely converted to that formal address space (because it's a subspace, or because it's a superspace but known to be in the right subspace) or (2) we're willing to copy the object into the right address space on the callee side (and able to — obviously this only works for POD types). I do see the merit of allowing an address space to be specified for targets that consider locals to be in a different formal address space than the stack would naturally be (e.g. a generic address space); I don't remember if that applies to AMDGPU.

Kernel arguments aren't directly visible to the user program, so this is an implementation detail. The user variable is the alloca that we need to explicitly copy to as you mentioned, which this patch does.

It's usually a goal of indirect arguments to not need this copy. We usually bind parameters directly to the pointer that was passed in.

It's possible to poke at these through intrinsics, but the kernel address space isn't part of the language. POD type or not, we're going to have to do something to unload values from the constant buffer onto the stack.

byval in LLVM is not a generic "this is a by-value argument" annotation; it specifically means that the value is actually passed on the stack despite the formally indirect convention, and therefore there's an implicit memcpy on the caller side. That is why byval is inherently tied to the alloca address space: there's no actual pointer being passed, so it doesn't make sense to pretend it might have been promoted into a different address space. That is also why there is no restriction to writing into the pointer: there's no reason to prevent writing to the argument slot.

In this case there is never a call. Only the callee read exists as this is the entry point. What would the alternative be? Add another flavor of byval attribute that's nearly identical?

It's hard to answer this because I'm not sure what you're trying to accomplish by marking the arguments byval.

The short answer is I'm trying to avoid fighting the optimizer's handling of aggregate load and stores. Passes already understand byval, but are actively harmful to aggregate loads and stores. Having explicit loads in the IR also brings it closer to what these are ultimately lowered to. Currently we emit a raw struct argument, which produces an aggregate store to alloca. Both SROA and instcombine unhelpfully split up the aggregate store, which doesn't optimize nicely like the memcpy from constant memory to alloca. The end result is we end up with more allocas and copies from constant memory that could have just read directly from the constant pointer.

The most honest calling convention handling would be to have an explicit constant pointer that all arguments are read from, and the function would have an empty argument list. This is somewhat impractical, since we still need to track the argument sizes and offsets somehow in the IR. It also becomes much more difficult to emit nicely annotated IR, since things like noalias still largely expect to be attached to a function argument. We do have an optimization pass that rewrites the arguments to look like this to enable vectorization later in the backend, but it suffers from losing useful alias annotations.

Okay. So the only real ABI here is the layout of the memory that the arguments are actually written into? And that memory needs to be treated as constant?

Unfortunately, I think byval just doesn't match what you want because of the mutability — the frontend *has* to have a copy into a local to get IR with correct semantics, because byval is assumed to be locally mutable by both IR-generation and (potentially) LLVM optimization. And I don't think you really want non-byval indirect. So I guess the question is what we can do in the frontend to get the optimizer behavior you need.

Okay. So the only real ABI here is the layout of the memory that the arguments are actually written into? And that memory needs to be treated as constant?

Yes, the actual kernel ABI is supposed to invisible.

Unfortunately, I think byval just doesn't match what you want because of the mutability — the frontend *has* to have a copy into a local to get IR with correct semantics, because byval is assumed to be locally mutable by both IR-generation and (potentially) LLVM optimization. And I don't think you really want non-byval indirect. So I guess the question is what we can do in the frontend to get the optimizer behavior you need.

You are allowed to have readonly on a byval pointer argument, in which case optimizations wouldn't be allowed to write into it. Is just adding readonly parameter attributes sufficient? It would be somewhat contrived, but could also define byval as constant if it's not in the alloca address space.

Unfortunately, I think byval just doesn't match what you want because of the mutability — the frontend *has* to have a copy into a local to get IR with correct semantics, because byval is assumed to be locally mutable by both IR-generation and (potentially) LLVM optimization. And I don't think you really want non-byval indirect. So I guess the question is what we can do in the frontend to get the optimizer behavior you need.

You are allowed to have readonly on a byval pointer argument, in which case optimizations wouldn't be allowed to write into it. Is just adding readonly parameter attributes sufficient? It would be somewhat contrived, but could also define byval as constant if it's not in the alloca address space.

Another option is to add a form of indirect argument passing that just from a constant offset from an intrinsic call. We would still need to leave an unused struct argument in the function argument list for size keeping, and lose the ability to add an explicit alignment. We would also be mixing multiple ways of accessing arguments which is gross but survivable.

Okay. So the only real ABI here is the layout of the memory that the arguments are actually written into? And that memory needs to be treated as constant?

Yes, the actual kernel ABI is supposed to invisible.

Unfortunately, I think byval just doesn't match what you want because of the mutability — the frontend *has* to have a copy into a local to get IR with correct semantics, because byval is assumed to be locally mutable by both IR-generation and (potentially) LLVM optimization. And I don't think you really want non-byval indirect. So I guess the question is what we can do in the frontend to get the optimizer behavior you need.

You are allowed to have readonly on a byval pointer argument, in which case optimizations wouldn't be allowed to write into it. Is just adding readonly parameter attributes sufficient? It would be somewhat contrived, but could also define byval as constant if it's not in the alloca address space.

byval is fundamentally about expressing that something is passed on the stack. If you want an indirect readonly noalias argument, you can make one; however, to convince IRGen to do it, you really need to think of this as a new kind of argument-passing, because "pass the address of an immutable object that the callee can't modify" is not something that we normally need to do in calling-convention lowering. That feels like a lot of complexity to solve a rather narrow problem.

A completely different approach: OpenMP has to solve some very similar problems and just lowers them completely in the frontend; have you considered just doing that? Kernels need a ton of special-case handling anyway, and IIUC you can never optimize over the boundary anyway.

A completely different approach: OpenMP has to solve some very similar problems and just lowers them completely in the frontend; have you considered just doing that? Kernels need a ton of special-case handling anyway, and IIUC you can never optimize over the boundary anyway.

Yes, this is what I was describing. The problem is this ends up hurting optimizations because now we end up losing things like noalias on the pointer arguments. I also would still need to track the argument size/offset/align information in the IR, but for that we could keep on doing what we're doing and just never use the kernel arguments. One of the advantages of byval was an explicit align field to track this

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory. Hence, the readonly is pretty useless *in the callee* if other accesses can write the memory. The readonly is useful in the caller though, usually if we have basically noalias information there (e.g., it is an alloca). Noalias is useful in the callee regardless of readonly but even better with.

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory.

Oh, do we really not have a way to mark that memory is known to be truly immutable for a time? That seems like a really bad limitation. It should be doable with a custom alias analysis at least.

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

I thought that was finished a few years ago; is it really not considered usable yet? Or does "we" not just mean LLVM here?

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory.

Oh, do we really not have a way to mark that memory is known to be truly immutable for a time? That seems like a really bad limitation. It should be doable with a custom alias analysis at least.

noalias + readone on an argument basically implies immutable for the function scope. I think we have invariant intrinsics that could do the trick as well, though I'm not too familiar with those. I was eventually hoping for paired/scoped llvm.assumes which would allow noalias + readnone again. Then there is invariant which can be placed on a load instruction. Finally, TBAA has a "constant memory" tag (or something like that), but again it is a per-access thing. That are all the in-tree ways I can think of right now.

Custom alias analysis can probably be used to some degree but except address spaces I'm unsure we have much that you can attach to a pointer and that "really sticks".

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

I thought that was finished a few years ago; is it really not considered usable yet? Or does "we" not just mean LLVM here?

Yes, I meant "we" = LLVM here. Maybe we talk about different things. I was referring to local restrict qualified variables, e.g., double * __restrict Ptr = .... The proposal to not just ignore the restrict (see https://godbolt.org/z/jLzjR3) came last year, it was a big one and progress unfortunately stalled (partly my fault). Now we are just about to see a second push to get it done.
Is that what you meant too?

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory.

Oh, do we really not have a way to mark that memory is known to be truly immutable for a time? That seems like a really bad limitation. It should be doable with a custom alias analysis at least.

noalias + readone on an argument basically implies immutable for the function scope. I think we have invariant intrinsics that could do the trick as well, though I'm not too familiar with those. I was eventually hoping for paired/scoped llvm.assumes which would allow noalias + readnone again. Then there is invariant which can be placed on a load instruction. Finally, TBAA has a "constant memory" tag (or something like that), but again it is a per-access thing. That are all the in-tree ways I can think of right now.

Custom alias analysis can probably be used to some degree but except address spaces I'm unsure we have much that you can attach to a pointer and that "really sticks".

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

I thought that was finished a few years ago; is it really not considered usable yet? Or does "we" not just mean LLVM here?

Yes, I meant "we" = LLVM here. Maybe we talk about different things. I was referring to local restrict qualified variables, e.g., double * __restrict Ptr = .... The proposal to not just ignore the restrict (see https://godbolt.org/z/jLzjR3) came last year, it was a big one and progress unfortunately stalled (partly my fault). Now we are just about to see a second push to get it done.
Is that what you meant too?

I thought I remembered Hal doing a lot of work on local restrict a few years ago. I'm probably just misremembering, or I didn't realize that the work never landed or got pulled out later.

Okay. So where we're at is that you'd like to add a new argument-passing convention that's basically "passed indirect but immutable", implying that the frontend has to copy it in order to create the mutable local parameter. That's not actually a totally ridiculous convention in principle, although it has poor worst-case behavior (copies on both sides), and that happens to be what the frontend will often have to conservatively emit. I would still prefer not to add new argument-passing conventions just to satisfy short-term optimization needs, though. Are there any other reasonable options?

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory.

Oh, do we really not have a way to mark that memory is known to be truly immutable for a time? That seems like a really bad limitation. It should be doable with a custom alias analysis at least.

noalias + readone on an argument basically implies immutable for the function scope. I think we have invariant intrinsics that could do the trick as well, though I'm not too familiar with those. I was eventually hoping for paired/scoped llvm.assumes which would allow noalias + readnone again. Then there is invariant which can be placed on a load instruction. Finally, TBAA has a "constant memory" tag (or something like that), but again it is a per-access thing. That are all the in-tree ways I can think of right now.

Custom alias analysis can probably be used to some degree but except address spaces I'm unsure we have much that you can attach to a pointer and that "really sticks".

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

I thought that was finished a few years ago; is it really not considered usable yet? Or does "we" not just mean LLVM here?

Yes, I meant "we" = LLVM here. Maybe we talk about different things. I was referring to local restrict qualified variables, e.g., double * __restrict Ptr = .... The proposal to not just ignore the restrict (see https://godbolt.org/z/jLzjR3) came last year, it was a big one and progress unfortunately stalled (partly my fault). Now we are just about to see a second push to get it done.
Is that what you meant too?

I thought I remembered Hal doing a lot of work on local restrict a few years ago. I'm probably just misremembering, or I didn't realize that the work never landed or got pulled out later.

Okay. So where we're at is that you'd like to add a new argument-passing convention that's basically "passed indirect but immutable", implying that the frontend has to copy it in order to create the mutable local parameter. That's not actually a totally ridiculous convention in principle, although it has poor worst-case behavior (copies on both sides), and that happens to be what the frontend will often have to conservatively emit. I would still prefer not to add new argument-passing conventions just to satisfy short-term optimization needs, though. Are there any other reasonable options?

For the purpose here, only the callee exists. This is essentially a freestanding function, the entry point to the program. There is no caller function, and in the future I would like to make a call to amdgpu_kernel an IR verifier error (technically OpenCL device enqueue is an exception to this, but we don't treat this as a call. Instead there's a lot of library magic to invoke the kernel. From the perspective of the callee nothing changes, since it's still not allowed to modify the incoming argument buffer or aware it was called this way).

The load-from-constant nature needs to be exposed earlier, so I think this necessarily involves changing the convention lowering in some way and it's just a question of what it looks like. To summarize the 2.5 options I've come up with are

  1. Use constant byval parameters, as this patch does. This requires the least implementation effort but doesn't exactly fit in with byval as defined.
  2. Replace all IR argument uses with loads from a constant offset from an intrinsic call. This still needs to leave the IR arguments in place because we do need to know the original argument sizes and offsets, but they would never have a use (or I would need to invent some other method of tracking this information)
  3. Keep clang IR generation unchanged, but move the pass that lowers arguments to loads earlier and hack out aggregate IR loads before SROA makes things worse. This is really just a kludgier version of option 2. We do ultimately do this late in the backend to enable vectorization, but it does seem to make the middle end optimizer unhappy

For the purpose here, only the callee exists. This is essentially a freestanding function, the entry point to the program. There is no caller function, and in the future I would like to make a call to amdgpu_kernel an IR verifier error (technically OpenCL device enqueue is an exception to this, but we don't treat this as a call. Instead there's a lot of library magic to invoke the kernel. From the perspective of the callee nothing changes, since it's still not allowed to modify the incoming argument buffer or aware it was called this way).

Did you consider callback annotation for the device enqueue call? While that might not change anything *now*, I'm expecting interesting optimization opportunities there at some point "soon".

The load-from-constant nature needs to be exposed earlier, so I think this necessarily involves changing the convention lowering in some way and it's just a question of what it looks like. To summarize the 2.5 options I've come up with are

  1. Use constant byval parameters, as this patch does. This requires the least implementation effort but doesn't exactly fit in with byval as defined.

And, as was noted in the byval lang ref patch (D79636), there is a reasonable argument to be made to phase-out byval in favor of some explicit copying. If that happens, this solution should not be "the last byval user". Also, byval arguments could be used as scratchpad by smart optimizations. I somehow don't believe this is a great choice but I can by now see that the others are neither.

  1. Replace all IR argument uses with loads from a constant offset from an intrinsic call. This still needs to leave the IR arguments in place because we do need to know the original argument sizes and offsets, but they would never have a use (or I would need to invent some other method of tracking this information)
  2. Keep clang IR generation unchanged, but move the pass that lowers arguments to loads earlier and hack out aggregate IR loads before SROA makes things worse. This is really just a kludgier version of option 2. We do ultimately do this late in the backend to enable vectorization, but it does seem to make the middle end optimizer unhappy

Drive by, I haven't read the entire history.

I don't understand why noalias is even a concern if the whole buffer passed to the kernel is read-only. noalias is primarily about proving non-interference, but if you can tell IR that the buffer is read-only, that's a much more powerful statement.

The problem is that it is a "per-pointer" attribute and not "per-object". Given two argument pointers, where one is marked readonly, may still alias. Similarly, an access to a global, indirect accesses, ... can write the "readonly" memory.

Oh, do we really not have a way to mark that memory is known to be truly immutable for a time? That seems like a really bad limitation. It should be doable with a custom alias analysis at least.

noalias + readone on an argument basically implies immutable for the function scope. I think we have invariant intrinsics that could do the trick as well, though I'm not too familiar with those. I was eventually hoping for paired/scoped llvm.assumes which would allow noalias + readnone again. Then there is invariant which can be placed on a load instruction. Finally, TBAA has a "constant memory" tag (or something like that), but again it is a per-access thing. That are all the in-tree ways I can think of right now.

Custom alias analysis can probably be used to some degree but except address spaces I'm unsure we have much that you can attach to a pointer and that "really sticks".

Regardless, if you do need noalias, you should be able to emit the same IR that we'd emit if pointers to all the fields were assigned into restrict local variables and then only those variables were subsequently used.

We are still debating the local restrict pointer support. Once we merge that in it should be usable here.

I thought that was finished a few years ago; is it really not considered usable yet? Or does "we" not just mean LLVM here?

Yes, I meant "we" = LLVM here. Maybe we talk about different things. I was referring to local restrict qualified variables, e.g., double * __restrict Ptr = .... The proposal to not just ignore the restrict (see https://godbolt.org/z/jLzjR3) came last year, it was a big one and progress unfortunately stalled (partly my fault). Now we are just about to see a second push to get it done.
Is that what you meant too?

I thought I remembered Hal doing a lot of work on local restrict a few years ago. I'm probably just misremembering, or I didn't realize that the work never landed or got pulled out later.

Okay. So where we're at is that you'd like to add a new argument-passing convention that's basically "passed indirect but immutable", implying that the frontend has to copy it in order to create the mutable local parameter. That's not actually a totally ridiculous convention in principle, although it has poor worst-case behavior (copies on both sides), and that happens to be what the frontend will often have to conservatively emit. I would still prefer not to add new argument-passing conventions just to satisfy short-term optimization needs, though. Are there any other reasonable options?

For the purpose here, only the callee exists. This is essentially a freestanding function, the entry point to the program.

I'm definitely not going to let you add a new "generic" argument-passing convention and then only implement exactly the parts you need; that's just adding technical debt.

The load-from-constant nature needs to be exposed earlier, so I think this necessarily involves changing the convention lowering in some way and it's just a question of what it looks like. To summarize the 2.5 options I've come up with are

  1. Use constant byval parameters, as this patch does. This requires the least implementation effort but doesn't exactly fit in with byval as defined.

I assume you generate a byval caller in some later pass, which will then implicitly copy. Or do you lower byval in a nonstandard way in your backend?

  1. Replace all IR argument uses with loads from a constant offset from an intrinsic call. This still needs to leave the IR arguments in place because we do need to know the original argument sizes and offsets, but they would never have a use (or I would need to invent some other method of tracking this information)

I guess passing aggregate arguments using a normal indirect convention has this same tracking problem. You'd also have to copy on the caller side to maintain semantics, which is probably hard to eliminate, but I would guess byval has the same problem?

  1. Keep clang IR generation unchanged, but move the pass that lowers arguments to loads earlier and hack out aggregate IR loads before SROA makes things worse. This is really just a kludgier version of option 2. We do ultimately do this late in the backend to enable vectorization, but it does seem to make the middle end optimizer unhappy

Yeah, Clang tries to avoid first-class aggregates for exactly this reason, LLVM does a terrible job generating code for them.