This patch inserts the loop start instruction at the end of the block as long as the count register would have the same value and the insertion point kills it. This in turn allows tail-predication of more loops.
Insert loop start at end of block in more cases
Hmm. Just a quick check - do we want that? I can see it improves some tail predication cases, that's good. But do we want that in general? The DLS instructions have a latency like any other, and earlier is better from that perspective. Or are we assuming that that latency will never matter into the LE instruction?
|146–147 ↗||(On Diff #296983)|
These tests are hard to be sure, but it looks like the old version was doing something with lr but that might now be a different value?
I wasn't so much concerned about latencies, but I was guessing that by moving the instruction, the goal is to get some more tail-predication "for free". That is suggesting to me that it is a work-around for the analysis of tail-predication not recognising that this is safe?
I wasn't so much concerned about latencies, but I was guessing that by moving the instruction, the goal is to get some more tail-predication "for free".
Indeed, this greatly simplifies the effort required to ensure that we can generate the LSTP version. And I can't see the latency of DLS having any real affect on the performance of a loop, especially not compared to all the other things that can go bad!
Can this logic be simplified? I don't think we should need to be concerned with LR, or at least the top-level conditional block should be guard with:
if (FirstNonTerminator == MBB->end() && RDA.isReachingDefLiveOut(Start, ARM::LR)
Then it looks like what we just need to check whether InsertPt kills CountReg?
Sorry, I didn't realise this had been updated.
I seem to remember being concerned that though we've already chosen an InsertPt, we then compare reaching defs against Start instead... In the case where InsertPt != Start, it means there's a mov lr, after Start and so RDA.isReachingDefLiveOut(Start, ARM::LR) shouldn't be true and RDA.hasSameReachingDef(Start, &*FirstNonTerminator, ARM::LR) may also not be true. Would you mind making another patch to compare against InsertPt and then rebase this on top of that?
nit: your indentations are too wide.
And I can't see the latency of DLS having any real affect on the performance of a loop, ..
Unfortunately I'm not sure that is always true exactly, but I'm not against the general idea of moving the loop start closer loop, especially for DLSTP. If it simplifies tail predication and makes it more reliable, then that's certainly a good thing.
I have reverted the patch that this depends on (38f625d0d1360b0) because it's had issues for a while now and we could do with it being correct so that we have a better place work work from. That way we can re-do this whilst having a higher confidence that we are not introducing bugs.