This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Initialize LDS pointers after alloca, but before call.
AbandonedPublic

Authored by hsmhsm on Sep 10 2021, 4:50 AM.

Details

Summary

The LDS pointers need to be initialized within the entry basic block of
kernel(s) after all alloca, but before any call instruction. If this is
not possible, then we skip running this pass for now.

Ideally alloca can appear anywhere within the function, and the AMDGPU
backend should be able to handle it, but at the moment it cannot. Once
AMDGPU backend is able to robustly handle alloca inserted anywhere, then
this hack is no longer required.

Diff Detail

Event Timeline

hsmhsm created this revision.Sep 10 2021, 4:50 AM
hsmhsm requested review of this revision.Sep 10 2021, 4:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 10 2021, 4:50 AM
JonChesterfield added a comment.EditedSep 13 2021, 3:37 AM

I don't understand the precondition. Why after every alloca but before every call?

It seems like you want to introduce N stores to addrspace(3) globals, where each store needs to precede any loads from that global. I can see 'call' as a rough approximation for 'might load', but why is alloca involved at all?

edit: commit message should change to state why this precondition is required, so that later people trying to work out why we're transforming variables like this can determine if the constraint still holds. E.g.

  • why are we splitting the basic block instead of inserting a relaxed atomic store
  • if an alloca outside of the entry BB is broken and this transform moves calls out of the entry block, and those calls are then inlined introducing alloca, then this transform is broken, please document whatever makes that OK

It seems likely that allowing alloca outside of the entry bb would be a trivial change (just hoist them to entry), simplify this work, and unbreak other things

foad added a comment.Sep 13 2021, 3:49 AM

I don't understand the precondition. Why after every alloca but before every call?

It seems like you want to introduce N stores to addrspace(3) globals, where each store needs to precede any loads from that global. I can see 'call' as a rough approximation for 'might load', but why is alloca involved at all?

It's a QOI thing. You want to try hard to leave allocas in the entry block if possible, because LLVM convention is that allocas in the entry block are static (allocated very efficiently on entry to the function) and allocas elsewhere are dynamic (allocated less efficiently, has knock-on effects like forcing you to reserve a frame pointer register, possibly not even supported by the AMDGPU backend?). See also D108971 for some prior discussion.

It's a QOI thing. You want to try hard to leave allocas in the entry block if possible, because LLVM convention is that allocas in the entry block are static...

That sounds right. However, this transform, by moving calls out of the entry block, will itself have that effect if those calls are inlined.

Alternatives are to emit the stores (probably as relaxed atomic) in the entry block, such that every lane executes it but we don't split the CFG, or to add a fairly late pass that hoists alloca into the entry bb.

I'd be inclined to do both. Hoisting 'dynamic' alloca into entry will fix some miscompilation (I haven't looked recently, but ~ 6 months ago alloca outside of entry was an error in the backend) and/or make things faster. Emitting the store from all lanes instead of branching means, well, less branching, but also we don't rearrange the entry block into a CFG.

If an atomic store of a uniform value is better expressed as masking off all lanes but one, I suspect we're better off doing that transform once exec is available for manipulation. Somewhere in MIR.

hsmhsm added a comment.EditedSep 13 2021, 9:20 AM

It's a QOI thing. You want to try hard to leave allocas in the entry block if possible, because LLVM convention is that allocas in the entry block are static...

That sounds right. However, this transform, by moving calls out of the entry block, will itself have that effect if those calls are inlined.

Alternatives are to emit the stores (probably as relaxed atomic) in the entry block, such that every lane executes it but we don't split the CFG, or to add a fairly late pass that hoists alloca into the entry bb.

I'd be inclined to do both. Hoisting 'dynamic' alloca into entry will fix some miscompilation (I haven't looked recently, but ~ 6 months ago alloca outside of entry was an error in the backend) and/or make things faster. Emitting the store from all lanes instead of branching means, well, less branching, but also we don't rearrange the entry block into a CFG.

If an atomic store of a uniform value is better expressed as masking off all lanes but one, I suspect we're better off doing that transform once exec is available for manipulation. Somewhere in MIR.

Often, (probably because of too many things on the one's plate) we tend to forget what we discussed and finalized in the past except one who actually works on it.

After lengthy discussion of possible ways of initializing LDS pointers at the entry block of kernels (including relaxed atomic one) we finalized about 0th lane from each wave to do initialization (which is nothing but the stores that you indicated above). So I did not take the route of relaxed atomic approach.

Yes store to LDS pointers should happen within the entry basic block of the kernel, before loading back them within non-kernel functions (that is before any call within kernel)

Now, we need to do these stores after alloca, otherwise alloca will be moved out of entry basic block into the newly split block which will have adverse side effect.

By the time this pass is run, inlining pass is already run, but, I am still safely checking for the possibility of alloca being living somewhere after call (in the entry block) and avoid running this pass in that case, which is a safe bet from the correctness point of view.

As Jay mentioned, though in theory it is true that alloca can live anywhere within the function, it is better and safe to cluster them at the beginning of the entry block and that is what usually happen most of the time. Please refer - (1) https://lists.llvm.org/pipermail/llvm-dev/2015-September/090168.html (2) https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas

arsenm added inline comments.Sep 13 2021, 10:58 AM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
443

I don't like this arbitrary condition. Why not just collect all allocas and re-insert them at the start of the entry block?

hsmhsm marked an inline comment as done.Sep 13 2021, 6:01 PM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
443

Even I do not like it.

But I do not know, how *neat* and how *quick* the implementation of moving all allocas to entry block. While researching and googling I found in some llvm discussion that there was an effort (not sure if it was a pass) to do that generically, but later it was removed due to some (genuine) reason.

Considering the necessity of keeping allocas at the entry block from the LLVM point of view, I think someone must have done it if it was feasible. So, I do not want to deviate again from my main task of LDS stuff and keep it hold for unknown time as it is already delayed so long due to one or the other reason.

That said, I am having it on my plate to investigate alloca stuff in generic (for LLVM) and some constraints towards it in AMDGPU backend, and to fix it as neat as possible. But, it will not be one or two days of task and I cannot hold on LDS for this alloca task to complete.

Finally, as said, it is a temporary hack, and I will be removing it once I fix alloca stuff.

hsmhsm marked an inline comment as done.Sep 13 2021, 7:13 PM

Since this isn’t required for correctness, this should just skip the allocas clustered at the beginning. There’s no need to scan beyond or consider calls

Since this isn’t required for correctness, this should just skip the allocas clustered at the beginning. There’s no need to scan beyond or consider calls

That is what I did in the patch https://reviews.llvm.org/D108971 , and then after enabling this LDS pass at https://reviews.llvm.org/D109062, we got some OpenMP build-bot test failures, and I had to revert the patches. So, I added code to scan alloca beyond calls to avoid surprising failures. We really need to take this alloca stuff as a separate task, and I do not think we can do any quick fix for it. Until then, for LDS we need to live with this hack.

Often, (probably because of too many things on the one's plate) we tend to forget what we discussed and finalized in the past except one who actually works on it.

After lengthy discussion of possible ways of initializing LDS pointers at the entry block of kernels (including relaxed atomic one) we finalized about 0th lane from each wave to do initialization (which is nothing but the stores that you indicated above). So I did not take the route of relaxed atomic approach.

This doesn't sound right. I remember a preference for dropping to lane 0 on power efficiency grounds. I also remember a proposal to do that by modifying the exec mask in IR. I don't recall any discussion of the costs of splitting the entry block, or a discussion of the consequences of that if calls later in the entry block are subsequently inlined. The previous round of apply/fail/revert, and how you're trying to work around it here, suggests that is indeed a hazard.

Instead of adding, in your words, "temporary hacks", let's fix the alloca lowering so it cannot break us here or elsewhere. By iterating over the function and moving all alloca into the entry block, or by doing the call frame setup thing in the back end. Just start from the error message from llc and work backwards.

I'd also prefer we make the branch vs power tradeoff in MIR instead of IR as we generally try to avoid splitting basic blocks (since doing so usually blocks optimisations), but care less about that than about not miscompiling openmp.

hsmhsm added a comment.EditedSep 14 2021, 3:40 AM

Often, (probably because of too many things on the one's plate) we tend to forget what we discussed and finalized in the past except one who actually works on it.

After lengthy discussion of possible ways of initializing LDS pointers at the entry block of kernels (including relaxed atomic one) we finalized about 0th lane from each wave to do initialization (which is nothing but the stores that you indicated above). So I did not take the route of relaxed atomic approach.

This doesn't sound right. I remember a preference for dropping to lane 0 on power efficiency grounds. I also remember a proposal to do that by modifying the exec mask in IR. I don't recall any discussion of the costs of splitting the entry block, or a discussion of the consequences of that if calls later in the entry block are subsequently inlined. The previous round of apply/fail/revert, and how you're trying to work around it here, suggests that is indeed a hazard.

Instead of adding, in your words, "temporary hacks", let's fix the alloca lowering so it cannot break us here or elsewhere. By iterating over the function and moving all alloca into the entry block, or by doing the call frame setup thing in the back end. Just start from the error message from llc and work backwards.

I'd also prefer we make the branch vs power tradeoff in MIR instead of IR as we generally try to avoid splitting basic blocks (since doing so usually blocks optimisations), but care less about that than about not miscompiling openmp.

  • Proposal to do lds pointer initialization by modifying the exec mask in (LLVM) IR is completely ruled out because it is a hacky stuff. At the IR level, there is no neat way of modifying and maintaining exec mask (please check with Tony), he has a clear opposition for it.
  • I did not have an a priori understanding of splitting entry basic block (because we want only lane 0 to do initialization) will result in surprising failures like this. Only after implementation and test failures, I realized it. So, discussing it a priori is ruled out.
  • Fixing alloca lowering will be a completely separate task. And, I even do not know the complexity of this work at the moment. I also have my own doubt that if moving all alloca to the beginning of the entry basic block is so simple as you guys are mentioning here, why not anyone from LLVM community did not attempt it till date considering the importance of keeping alloca at the beginning of the entry basic block for better optimization?
  • Moving to work on alloca means this LDS pass need to wait until then, and I am not sure if we really have that much luxury of time considering the importance of this pass. So, I came-up with this temporary patch until we clean-up alloca, which is a valid attempt in my opinion considering the importance of this pass. That said, if we all of us including higher technical management team is okay for this patch to wait for arbitrary time, I am fine with that too.
  • This is an late LLVM IR pass, I am not getting how it is linked to MIR.

Modifying exec from IR is right out. I was referring back to the internal discussion on how to implement this where that was briefly considered.

Given we now know more than when the original decision was taken, let us revisit that decision and not split the block. MIR is a point where this and all other uniform stores could be optimised for power consumption by masking exec, without changing IR optimisations.

Moving all alloca to entry was proposed on the mailing list some years ago and implemented in at least one out of tree target. I suspect it is not already done in tree because codegen is more efficient on most CPU targets without it. I doubt it is especially complicated to implement - the entry block dominates the other blocks in the function.

I'm not convinced the transform being enabled here is necessary, and have previously outlined a variety of alternatives which you remain unwilling to consider. My interest here is solely in avoiding your "temporary hacks" breaking openmp.

Modifying exec from IR is right out. I was referring back to the internal discussion on how to implement this where that was briefly considered.

Given we now know more than when the original decision was taken, let us revisit that decision and not split the block. MIR is a point where this and all other uniform stores could be optimised for power consumption by masking exec, without changing IR optimisations.

If you think so, then let's have an internal discussion by involving all the key folks, and take a decision, I am fine with whatever decision is being taken.

Moving all alloca to entry was proposed on the mailing list some years ago and implemented in at least one out of tree target. I suspect it is not already done in tree because codegen is more efficient on most CPU targets without it. I doubt it is especially complicated to implement - the entry block dominates the other blocks in the function.

This is just only your suspect as I have my own suspect. We will not be having clear answer to it, until we ask community itself. Let me try that too. We do not know how the out of tree target has implemented it, it could also be just hacky stuff with some serious constraints, but it just work for them.

I'm not convinced the transform being enabled here is necessary, and have previously outlined a variety of alternatives which you remain unwilling to consider. My interest here is solely in avoiding your "temporary hacks" breaking openmp.

I am not sure what are the *variety* of alternatives discussed here which are quickly and neatly implementable. Can you again outline them here?

arsenm added inline comments.Sep 14 2021, 5:50 AM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
195

All this patch should do is skip the insertion point past the allocas clustered at the start of the entry block. If there are further allocas in the program which may be broken, you should not be concerning yourself with them. I don't want to add a workaround for non-entry allocas here

hsmhsm abandoned this revision.Sep 15 2021, 10:36 PM

Abandoning this patch, since it is not relevant anymore as per internal email discussion. Submitted new patch to enable lds pointer replacement pass at https://reviews.llvm.org/D109870