This is an archive of the discontinued LLVM Phabricator instance.

[ARM][LowOverheadLoops] Insert loop start at end of block in more cases
AbandonedPublic

Authored by samtebbs on Oct 8 2020, 7:53 AM.

Details

Summary

This patch inserts the loop start instruction at the end of the block as long as the count register would have the same value and the insertion point kills it. This in turn allows tail-predication of more loops.

Diff Detail

Event Timeline

samtebbs created this revision.Oct 8 2020, 7:53 AM
samtebbs requested review of this revision.Oct 8 2020, 7:53 AM

Insert loop start at end of block in more cases

Hmm. Just a quick check - do we want that? I can see it improves some tail predication cases, that's good. But do we want that in general? The DLS instructions have a latency like any other, and earlier is better from that perspective. Or are we assuming that that latency will never matter into the LE instruction?

llvm/test/CodeGen/Thumb2/LowOverheadLoops/it-block-chain-store.mir
146–147 ↗(On Diff #296983)

These tests are hard to be sure, but it looks like the old version was doing something with lr but that might now be a different value?

Insert loop start at end of block in more cases

Hmm. Just a quick check - do we want that? I can see it improves some tail predication cases, that's good. But do we want that in general? The DLS instructions have a latency like any other, and earlier is better from that perspective. Or are we assuming that that latency will never matter into the LE instruction?

I wasn't so much concerned about latencies, but I was guessing that by moving the instruction, the goal is to get some more tail-predication "for free". That is suggesting to me that it is a work-around for the analysis of tail-predication not recognising that this is safe?

I wasn't so much concerned about latencies, but I was guessing that by moving the instruction, the goal is to get some more tail-predication "for free".

Indeed, this greatly simplifies the effort required to ensure that we can generate the LSTP version. And I can't see the latency of DLS having any real affect on the performance of a loop, especially not compared to all the other things that can go bad!

samparker added inline comments.Oct 9 2020, 1:59 AM
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
1089

Can this logic be simplified? I don't think we should need to be concerned with LR, or at least the top-level conditional block should be guard with:

if (FirstNonTerminator == MBB->end() && RDA.isReachingDefLiveOut(Start, ARM::LR)

Then it looks like what we just need to check whether InsertPt kills CountReg?

samparker added inline comments.Oct 9 2020, 2:01 AM
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
1089

... along with your existing check for RDA.hasSameReachingDef(LastMI, &*InsertPt, CountReg).

samtebbs updated this revision to Diff 297237.Oct 9 2020, 7:58 AM
samtebbs edited the summary of this revision. (Show Details)

Simplify logic and add check for LR being live-out.

samtebbs marked 3 inline comments as done.Oct 9 2020, 8:02 AM
samtebbs added inline comments.
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
1089

Agreed that it can be simplified. Thanks for pointing that out!

llvm/test/CodeGen/Thumb2/LowOverheadLoops/it-block-chain-store.mir
146–147 ↗(On Diff #296983)

Indeed, I missing an important check. Thanks.

Sorry, I didn't realise this had been updated.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
1083

I seem to remember being concerned that though we've already chosen an InsertPt, we then compare reaching defs against Start instead... In the case where InsertPt != Start, it means there's a mov lr, after Start and so RDA.isReachingDefLiveOut(Start, ARM::LR) shouldn't be true and RDA.hasSameReachingDef(Start, &*FirstNonTerminator, ARM::LR) may also not be true. Would you mind making another patch to compare against InsertPt and then rebase this on top of that?

1092

nit: your indentations are too wide.

samtebbs updated this revision to Diff 298617.Oct 16 2020, 6:31 AM
samtebbs marked 4 inline comments as done.

Rebase on top of https://reviews.llvm.org/D89549 and format code.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
1083

Good idea. done.

1092

Cheers

And I can't see the latency of DLS having any real affect on the performance of a loop, ..

Unfortunately I'm not sure that is always true exactly, but I'm not against the general idea of moving the loop start closer loop, especially for DLSTP. If it simplifies tail predication and makes it more reliable, then that's certainly a good thing.

I have reverted the patch that this depends on (38f625d0d1360b0) because it's had issues for a while now and we could do with it being correct so that we have a better place work work from. That way we can re-do this whilst having a higher confidence that we are not introducing bugs.

samtebbs abandoned this revision.Oct 30 2020, 8:25 AM

The code that this change depends on has been reverted so I will close this and re-visit it once those changes have been re-worked.