This is an archive of the discontinued LLVM Phabricator instance.

[ARM] DLS/LE low-overhead loop code generation
ClosedPublic

Authored by samparker on Jun 18 2019, 12:36 AM.

Details

Summary

Introduce three pseudo instructions to be used during DAG ISel to represent v8.1-m low-overhead loops. One maps to set_loop_iterations while loop_decrement_reg is lowered to two, so that we can separate the decrement and branching operations. The pseudo instructions are expanded pre-emission where we can still decide whether we actually want to generate a low-overhead loop. The pass currently bails, revering to an sub, icmp and br, in the cases where a call or stack spill/restore happens between the decrement and branching instructions.

Diff Detail

Event Timeline

samparker created this revision.Jun 18 2019, 12:36 AM
SjoerdMeijer added inline comments.Jun 20 2019, 6:59 AM
lib/Target/ARM/ARMFinalizeLoops.cpp
139 ↗(On Diff #205264)

if Revert is true here at some point, can we stop iterating over the rest of the blocks/instructions?

lib/Target/ARM/ARMISelDAGToDAG.cpp
2995

nit: can this be simplified using getIntrinsicID()?

3004

another nit: indentation a bit off?

forgot to say: went through this for the first time, some first nits in my previous comment. Will continue looking tomorrow.

SjoerdMeijer added inline comments.Jun 21 2019, 8:37 AM
lib/Target/ARM/ARMFinalizeLoops.cpp
25 ↗(On Diff #205264)

Nit, perhaps: "ARM low-overhead loop..."

175 ↗(On Diff #205264)

This looks fine for now. It might be that WLS/DLS is an expensive MOV instruction, that we possibly don't even need. But I think that's an optimisation that we can worry about later.

lib/Target/ARM/ARMInstrThumb2.td
5194

nit, perhaps } // isNotDuplicable = 1

test/Transforms/HardwareLoops/ARM/massive.mir
2

yes, it is massive! :-) But I think we can simplify this a lot by using intrinsic @llvm.arm.space:

// A space-consuming intrinsic primarily for testing ARMConstantIslands. The
// first argument is the number of bytes this "instruction" takes up, the second
// and return value are essentially chains, used to force ordering during ISel.
def int_arm_space : Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty], []>;

as mentioned constant island tests are using this, I think you want to do something similar here.

How confident are you that this is always valid? From what I understand it creates an assumption that the code will not grow past this point in the pipeline (technically that the LE will not move further from the loop start). What would stop, for example, constant island pass from putting a constant pool into the loop?

And as Sjoerd says, I found out that DLS doesn't really do anything and can be ignored (all the "smarts" are in the LE). WLS obviously does more, and a DLSTP is important for tail predication. But if the count is already in lr, we needn't emit the DLS.

> How confident are you that this is always valid?

What are our options here?

I guess one of them is making sure literal pools won't get placed inside loops, or reordering this:

addPass(createARMFinalizeLoopsPass());
addPass(createARMConstantIslandPass());

or is there a reason we can't do this?

samparker marked 2 inline comments as done.Jun 24 2019, 12:58 AM
samparker added inline comments.
lib/Target/ARM/ARMFinalizeLoops.cpp
139 ↗(On Diff #205264)

Maybe... We'd still need to find 'End' and that would be the terminator, but I guess it could help prevent visiting other blocks. I reorder the block iterator and that should work.

test/Transforms/HardwareLoops/ARM/massive.mir
2

Bingo! Thanks

I will try to reorder the final passes. I hope that I can change the size of the pseudo instructions to be pessimistically big enough to be a cmp and br. I imagine that TTI will have to try to calculate the size, or at least the amount of live variables, so that these loops don't cause unnecessary register pressure and actually slow things down because of spills.

samparker updated this revision to Diff 206171.Jun 24 2019, 2:20 AM
  • Renamed all the things to low-overhead loops.
  • Used the arm.space intrinsic and added another test for the edge case.
  • Reversed the order of the search for LoopEnd and LoopDec, breaking early if possible.
  • Switched the order of constant island and low-overhead loops.

I don't know of any reason why this way round wouldn't work, so long as we tell constant island pass that the size of the instructions is large enough. It might be a little inefficient at times? I think the version of this on the branch put the handling _into_ constant island pass, but that version worked a little differently.

SjoerdMeijer accepted this revision.Jun 24 2019, 1:12 PM

This looks good to me. I.e., the approach of expanding the pseudos as late as possible, and the availability of size estimation in TTI for these pseudos, seems the right approach to me. I hope that when overestimate the size, if necessary, that this will be always be safe.

So I am happy with this, if Dave is happy with this too.

test/Transforms/HardwareLoops/ARM/massive.mir
47

a proper nit: just a minor clean up, but we don't need these comments on the functions

57

and this stack protector stuff

63

and these attributes

This revision is now accepted and ready to land.Jun 24 2019, 1:12 PM

This looks pretty neat to me. I have some comments that could be done in followups (as in, if its broken, we can fix it later).

  • It may be better to put this into constant island pass instead, to use its iterative nature and have it remain as the only place that deals with sizes.
  • Can you put a nice long comment into somewhere like the start of ARMFinalizeHardwareLoops explaining things like what the pseudos do and why they are needed (they are for registry allocation, essentially?), and what assumptions this is making about them remaining in specific blocks.
lib/Target/ARM/ARMInstrThumb2.td
5193

12 bytes for something that should usually take 4 seems overly pessimistic to me, and may pessimise other optimisations. Can we get this smaller somehow? I would expect the fallback to probably be (subs; bne) ideally.

lib/Target/ARM/ARMLowOverheadLoops.cpp
76

How expensive is this? We might as well not do it for cores that won't have low overhead loops.

96

These don't really seem to add much to me :)

135

Can you explain this a little? Is it for correctness or performance? Because we can't merge the two instructions?

148

I think this should give a better error message in release builds. report_fatal_error or similar. Otherwise It may just miscompile. It should never happen, as far as I understand?

165

Formatting

173

Format

201

This is just the same call, with one parameter different?

samparker marked 6 inline comments as done.Jun 25 2019, 12:35 AM

I will add some comments, but I really don't think this belongs in constant islands. This doesn't have to worry about iterative changes, the loop size may only vary by 8 bytes, which is nothing compared to the 4KB that we need to concern ourselves with. Plus this is a very specific pass, especially once we start having to handle the tail predicated loops!

lib/Target/ARM/ARMInstrThumb2.td
5193

12 is what this implementation will produce though in the fallback case. When we use subs, we can reduce it.

lib/Target/ARM/ARMLowOverheadLoops.cpp
76

Good point! I'll add a check and an exit.

96

All in good time... while and the vector ones will follow. But yes, LoopDec can go.

135

Sure, I'll add a comment. To summarise though, its because the decremented value of LR has probably been stored to the stack - but this decrement isn't really going to happen until the end of the block, where we can't spill. If we want the value of LR to be on the stack, we'd have to perform a manual sub. Added to this that we're probably reloading LR at the bottom of the loop for LE, we'd have to either perform an add before LE or use the other form of LE that doesn't perform the decrement.

148

cheers!

201

Yes, i know it looks awkward... I'll try to see if I can make it nicer.

I will add some comments, but I really don't think this belongs in constant islands. This doesn't have to worry about iterative changes, the loop size may only vary by 8 bytes, which is nothing compared to the 4KB that we need to concern ourselves with. Plus this is a very specific pass, especially once we start having to handle the tail predicated loops!

Its not really about the 4KB range of a LE instruction, I agree that's not super important, but the much smaller range of a cbz for example. If we are over-estimating the size of the loop in constant island pass, we may loose out on other optimisations we would otherwise have performed.

lib/Target/ARM/ARMLowOverheadLoops.cpp
76

Also, I'm not sure it's doing everything it should, and might not be calculating offsets. Can you add a test with multiple loop bbs that together go over the limit?

samparker marked an inline comment as done.Jun 25 2019, 2:43 AM
samparker added inline comments.
lib/Target/ARM/ARMLowOverheadLoops.cpp
76

sounds good.

This revision was automatically updated to reflect the committed changes.