Introduce three pseudo instructions to be used during DAG ISel to represent v8.1-m low-overhead loops. One maps to set_loop_iterations while loop_decrement_reg is lowered to two, so that we can separate the decrement and branching operations. The pseudo instructions are expanded pre-emission where we can still decide whether we actually want to generate a low-overhead loop. The pass currently bails, revering to an sub, icmp and br, in the cases where a call or stack spill/restore happens between the decrement and branching instructions.
Details
Diff Detail
Event Timeline
forgot to say: went through this for the first time, some first nits in my previous comment. Will continue looking tomorrow.
lib/Target/ARM/ARMFinalizeLoops.cpp | ||
---|---|---|
25 ↗ | (On Diff #205264) | Nit, perhaps: "ARM low-overhead loop..." |
175 ↗ | (On Diff #205264) | This looks fine for now. It might be that WLS/DLS is an expensive MOV instruction, that we possibly don't even need. But I think that's an optimisation that we can worry about later. |
lib/Target/ARM/ARMInstrThumb2.td | ||
5194 | nit, perhaps } // isNotDuplicable = 1 | |
test/Transforms/HardwareLoops/ARM/massive.mir | ||
2 | yes, it is massive! :-) But I think we can simplify this a lot by using intrinsic @llvm.arm.space: // A space-consuming intrinsic primarily for testing ARMConstantIslands. The // first argument is the number of bytes this "instruction" takes up, the second // and return value are essentially chains, used to force ordering during ISel. def int_arm_space : Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty], []>; as mentioned constant island tests are using this, I think you want to do something similar here. |
How confident are you that this is always valid? From what I understand it creates an assumption that the code will not grow past this point in the pipeline (technically that the LE will not move further from the loop start). What would stop, for example, constant island pass from putting a constant pool into the loop?
And as Sjoerd says, I found out that DLS doesn't really do anything and can be ignored (all the "smarts" are in the LE). WLS obviously does more, and a DLSTP is important for tail predication. But if the count is already in lr, we needn't emit the DLS.
> How confident are you that this is always valid?
What are our options here?
I guess one of them is making sure literal pools won't get placed inside loops, or reordering this:
addPass(createARMFinalizeLoopsPass()); addPass(createARMConstantIslandPass());
or is there a reason we can't do this?
I will try to reorder the final passes. I hope that I can change the size of the pseudo instructions to be pessimistically big enough to be a cmp and br. I imagine that TTI will have to try to calculate the size, or at least the amount of live variables, so that these loops don't cause unnecessary register pressure and actually slow things down because of spills.
- Renamed all the things to low-overhead loops.
- Used the arm.space intrinsic and added another test for the edge case.
- Reversed the order of the search for LoopEnd and LoopDec, breaking early if possible.
- Switched the order of constant island and low-overhead loops.
I don't know of any reason why this way round wouldn't work, so long as we tell constant island pass that the size of the instructions is large enough. It might be a little inefficient at times? I think the version of this on the branch put the handling _into_ constant island pass, but that version worked a little differently.
This looks good to me. I.e., the approach of expanding the pseudos as late as possible, and the availability of size estimation in TTI for these pseudos, seems the right approach to me. I hope that when overestimate the size, if necessary, that this will be always be safe.
So I am happy with this, if Dave is happy with this too.
test/Transforms/HardwareLoops/ARM/massive.mir | ||
---|---|---|
47 | a proper nit: just a minor clean up, but we don't need these comments on the functions | |
57 | and this stack protector stuff | |
63 | and these attributes |
This looks pretty neat to me. I have some comments that could be done in followups (as in, if its broken, we can fix it later).
- It may be better to put this into constant island pass instead, to use its iterative nature and have it remain as the only place that deals with sizes.
- Can you put a nice long comment into somewhere like the start of ARMFinalizeHardwareLoops explaining things like what the pseudos do and why they are needed (they are for registry allocation, essentially?), and what assumptions this is making about them remaining in specific blocks.
lib/Target/ARM/ARMInstrThumb2.td | ||
---|---|---|
5193 | 12 bytes for something that should usually take 4 seems overly pessimistic to me, and may pessimise other optimisations. Can we get this smaller somehow? I would expect the fallback to probably be (subs; bne) ideally. | |
lib/Target/ARM/ARMLowOverheadLoops.cpp | ||
76 | How expensive is this? We might as well not do it for cores that won't have low overhead loops. | |
96 | These don't really seem to add much to me :) | |
135 | Can you explain this a little? Is it for correctness or performance? Because we can't merge the two instructions? | |
148 | I think this should give a better error message in release builds. report_fatal_error or similar. Otherwise It may just miscompile. It should never happen, as far as I understand? | |
165 | Formatting | |
173 | Format | |
201 | This is just the same call, with one parameter different? |
I will add some comments, but I really don't think this belongs in constant islands. This doesn't have to worry about iterative changes, the loop size may only vary by 8 bytes, which is nothing compared to the 4KB that we need to concern ourselves with. Plus this is a very specific pass, especially once we start having to handle the tail predicated loops!
lib/Target/ARM/ARMInstrThumb2.td | ||
---|---|---|
5193 | 12 is what this implementation will produce though in the fallback case. When we use subs, we can reduce it. | |
lib/Target/ARM/ARMLowOverheadLoops.cpp | ||
76 | Good point! I'll add a check and an exit. | |
96 | All in good time... while and the vector ones will follow. But yes, LoopDec can go. | |
135 | Sure, I'll add a comment. To summarise though, its because the decremented value of LR has probably been stored to the stack - but this decrement isn't really going to happen until the end of the block, where we can't spill. If we want the value of LR to be on the stack, we'd have to perform a manual sub. Added to this that we're probably reloading LR at the bottom of the loop for LE, we'd have to either perform an add before LE or use the other form of LE that doesn't perform the decrement. | |
148 | cheers! | |
201 | Yes, i know it looks awkward... I'll try to see if I can make it nicer. |
Its not really about the 4KB range of a LE instruction, I agree that's not super important, but the much smaller range of a cbz for example. If we are over-estimating the size of the loop in constant island pass, we may loose out on other optimisations we would otherwise have performed.
lib/Target/ARM/ARMLowOverheadLoops.cpp | ||
---|---|---|
76 | Also, I'm not sure it's doing everything it should, and might not be calculating offsets. Can you add a test with multiple loop bbs that together go over the limit? |
lib/Target/ARM/ARMLowOverheadLoops.cpp | ||
---|---|---|
76 | sounds good. |
nit: can this be simplified using getIntrinsicID()?