This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Add a pass that re-arranges blocks when there is a backwards WLS branch
ClosedPublic

Authored by samtebbs on Dec 1 2020, 5:21 AM.

Details

Summary

Blocks can be laid out such that a t2WhileLoopStart branches backwards. This is forbidden by the architecture and so it fails to be converted into a low-overhead loop. This new pass checks for these cases and moves the target block, fixing any fall-through that would then be broken.

This change improves the iterations per megahertz in some widely-used industrial benchmarks, with no regressions.

Diff Detail

Event Timeline

samtebbs created this revision.Dec 1 2020, 5:21 AM
samtebbs requested review of this revision.Dec 1 2020, 5:21 AM
samtebbs edited the summary of this revision. (Show Details)Dec 1 2020, 5:23 AM
samparker added inline comments.Dec 1 2020, 5:53 AM
llvm/lib/Target/ARM/ARMConstantIslandPass.cpp
18 ↗(On Diff #308621)

left over from prototyping?

llvm/lib/Target/ARM/ARMLowOverheadLoopBlockPlacement.cpp
29 ↗(On Diff #308621)

I would hope that MachineFunction provides a nicer way to iterate through the blocks? for (auto MBB : MF)?

33 ↗(On Diff #308621)

Visiting these in reverse order will be faster as we're looking at the terminators... so I guess you could just iterate using the terminators() method :)

37 ↗(On Diff #308621)

I'm not sure this is what you want to be checking... We need to ensure that Target has a positive branch offset from WhileLoopStart, but that doesn't mean it needs to be the layout successor - I would have even expected that the loop preheader would be the layout successor in many cases? Am I missing something here?

llvm/lib/Target/ARM/ARMLowOverheadLoopBlockPlacement.h
7 ↗(On Diff #308621)

Do you need to include ARMSubtarget.h and Pass.h here?

samtebbs added inline comments.Dec 1 2020, 9:07 AM
llvm/lib/Target/ARM/ARMConstantIslandPass.cpp
18 ↗(On Diff #308621)

Yes, thanks. I originally had the constant island pass depend on the new pass.

llvm/lib/Target/ARM/ARMLowOverheadLoopBlockPlacement.cpp
29 ↗(On Diff #308621)

That does indeed work :)

33 ↗(On Diff #308621)

Ah yes, good idea.

37 ↗(On Diff #308621)

Aren't being a layout successor and positive offset linked? Is is possible to have one but not the other? E.g if the layout is A, B, C, D and the WLS is in B then C and D would have positive branch offsets and are also layout successors. I can't think of any situations in which the branch offset is positive but the target basic block comes before the WLS in the layout.

llvm/lib/Target/ARM/ARMLowOverheadLoopBlockPlacement.h
7 ↗(On Diff #308621)

I think Pass.h is left over from the guide I followed and Subtarget.h is probably left over from prototyping as well. Thanks for spotting these.

samparker added inline comments.Dec 2 2020, 12:57 AM
llvm/lib/Target/ARM/ARMLowOverheadLoopBlockPlacement.cpp
37 ↗(On Diff #308621)

Yes, there are linked but is that what you want? isLayoutSuccessor simply compares the iterators for the list of blocks, so in your example B->isLayoutSuccessor(D) == false. In basic block land, successor and predecessor will always means immediate child and parent in the list, whereas it sounds like you're expecting it to work like dominator information but what also wouldn't be suitable for explicitly checking target offsets.

samparker added inline comments.Dec 2 2020, 12:58 AM
llvm/lib/Target/ARM/ARMLowOverheadLoopBlockPlacement.cpp
37 ↗(On Diff #308621)

I can't seem to edit my above comment...

successor and predecessor will always means immediate child and parent in the list

I meant CFG, not list :)

I have some high-level questions:

  • Are we fighting another optimisations here, some sort of loop-rotate or is this just MBP reshulffling blocks in a way that is not good for us?
  • I know we don't nest WLSTPs for profitability reasons, but in theory we could. Not sure we need to check this though. But in general my impression is that some more test can be added, but perhaps you were still working on that.
  • I am wondering if some cost-modeling is required. For example, if the iteration count is very low, would that change things?

I have some high-level questions:

  • Are we fighting another optimisations here, some sort of loop-rotate or is this just MBP reshulffling blocks in a way that is not good for us?

Yeah MBP moves the loop blocks (preheader, body etc.) closer together but unfortunately moves the WLSs branch target above the WLS. So it does some good things but also some bad things. After looking at MBP I thought it would be simpler to make a target pass rather than juggle things around in MBP that could end up affecting other targets.

  • I know we don't nest WLSTPs for profitability reasons, but in theory we could. Not sure we need to check this though. But in general my impression is that some more test can be added, but perhaps you were still working on that.

I wasn't sure if it was possible or not, but thought I'd add the checks in there just in case. I'm OK with removing them if they definitely are unnecessary. I'm also working on testing a nested while loop, can you think of any other tests I should add?

  • I am wondering if some cost-modeling is required. For example, if the iteration count is very low, would that change things?

That is a good idea. It would be good to get an idea of the cycles saved by converting the while loop into a LOL compared to those needed by the branches that replace fallthrough.

llvm/lib/Target/ARM/ARMLowOverheadLoopBlockPlacement.cpp
37 ↗(On Diff #308621)

I had definitely misinterpreted what being a layout successor means then! Thanks for raising this, I'll have a go at using offsets.

samtebbs updated this revision to Diff 308934.Dec 2 2020, 4:29 AM

Use BBUtils to get BB offsets, loop over terminators instead of instructions and other clean-up.

samparker added inline comments.Dec 2 2020, 5:21 AM
llvm/lib/Target/ARM/ARMLowOverheadLoopBlockPlacement.cpp
49 ↗(On Diff #308934)

I think moving a block with invalidate your BBUtils info, but I'm hoping adjustBBOffsetsAfter can be used instead of recomputing everthing. We should also have a test which exercises multiple block rearrangements.

61 ↗(On Diff #308934)

Looks like you're duplicating work here:

  • iterating through BB twice to find t2WhileLoopStart
  • extra call to blockIsBefore?
88 ↗(On Diff #308934)

Keep the code together and have fixFallThrough as a lambda?

samtebbs added inline comments.Dec 2 2020, 7:00 AM
llvm/lib/Target/ARM/ARMLowOverheadLoopBlockPlacement.cpp
61 ↗(On Diff #308934)

Looks like you're duplicating work here:

  • iterating through BB twice to find t2WhileLoopStart

Reckon I should be caching where I find a t2WhileLoopStart to avoid looping where possible? I assume you're referring to the single loop in blockIsBefore and the single loop in runOnMachineFunction.

  • extra call to blockIsBefore?

The reason why there are two calls to blockIsBefore is because we don't care if moving the block results in a backwards branch if it is a backwards branch even without the move.

88 ↗(On Diff #308934)

Sounds good to me!

samparker added inline comments.Dec 2 2020, 7:18 AM
llvm/lib/Target/ARM/ARMLowOverheadLoopBlockPlacement.cpp
61 ↗(On Diff #308934)

I'd just try moving this logic into the main loop in runOnMachineFunction, should still be perfectly readable.

samtebbs updated this revision to Diff 308977.Dec 2 2020, 8:40 AM

Inline canMoveBasicBlock, update BB offsets and make fixFallthrough a lambda.

samparker added inline comments.Dec 3 2020, 2:49 AM
llvm/lib/Target/ARM/ARMLowOverheadLoopBlockPlacement.cpp
45 ↗(On Diff #308977)

Thanks for inlining this, I can now read and understand it better. I'm wondering whether it would it be simpler to move BB instead of Target? Or at least we should try to figure out which one is best to move as ideally we want the WhileLoopStart parent block to fallthrough to the preheader without the need for a branch. If the most of the preheaders already fallthrough to the loop header, then I think it makes sense to move the WhileLoopStart block to just before the preheader.

samtebbs updated this revision to Diff 309289.Dec 3 2020, 9:46 AM

Fix tests

Hello

Can we rename this pass? ARMLowOverheadLoopBlockPlacement is very long and like we discussed previously the idea was not to just handle low overhead loop instructions but other arm specific branches like CBZ. (The original idea was to add this to MachineBlockPlacement, but it doesn't know about branch ranges for one, making it difficult to add there). I think a fixup pass makes sense so long as we don't break the optimizations it has done. How about naming it something like ARMBlockPlacementPass or ARMBlockAjustmentPass or something like that?

Can you also try and make this look more like the other passes in the arm backend? I don't think that the header file is needed for example, and the initialize.. method is usually added in other places.

I would also put this before the outliner in the pass pipeline. Sometime soon after the existing MachineBlockPlacement.

samtebbs updated this revision to Diff 310166.Dec 8 2020, 6:10 AM

Rename and move earlier in the pipeline.

Looking good in general. I think we would need some perf numbers to see if we haven't missed something.
Some more remarks inlined.

llvm/lib/Target/ARM/ARMBlockPlacement.cpp
2

Think we need the LLVM copyright header here.

66

I think we are duplicating work here. I.e., we are visiting a block and terminators that we will visit later or have seen already. It shouldn't be a big deal, but the code is also a bit of a copy-paste of the loop starting at line 53.
Can we not do one scan of the blocks and record a map of WLSTP to blocks or something like that?

106

Can there be multiple terminators?

samtebbs edited the summary of this revision. (Show Details)Dec 10 2020, 2:41 AM
samtebbs edited the summary of this revision. (Show Details)
dmgreen added inline comments.Dec 14 2020, 2:32 AM
llvm/lib/Target/ARM/ARM.h
63

Add one of these for the new pass, and call it in LLVMInitializeARMTarget.

llvm/lib/Target/ARM/ARMBlockPlacement.cpp
37

I've not seen many things use RegisterPass before. Can this use the INITIALIZE_PASS like most other passes do?

llvm/lib/Target/ARM/ARMTargetMachine.cpp
551–553

Add the pass here I think. That way we will have better size estimates from shrinking T2 instructions (and IT blocks should not be a problem for the pass, as far as I understand).

samtebbs updated this revision to Diff 311583.Dec 14 2020, 7:16 AM
samtebbs marked 2 inline comments as done.

Move pass order and add copyright header

samtebbs added inline comments.Dec 14 2020, 7:44 AM
llvm/lib/Target/ARM/ARM.h
63

What is this needed for? It works as-is.

llvm/lib/Target/ARM/ARMBlockPlacement.cpp
37

The skeleton I followed is for the new pass manager.

66

I did think about that, but didn't think that looping over a list that I've gotten from a map would be more efficient than looping over a list that I've gotten from a MachineBasicBlock (i.e. WLSMap.get(Target) vs Target->getTerminators()). Perhaps I'm not understanding exactly what you want the map to represent.

106

There can indeed. It's a bit of a misnomer in my opinion. As far as I know an unconditional branch and conditional branch could both be terminators in a block.

llvm/lib/Target/ARM/ARMTargetMachine.cpp
551–553

Good idea.

samtebbs updated this revision to Diff 311604.Dec 14 2020, 8:32 AM

Fix incorrect branch opcodes

SjoerdMeijer added inline comments.Dec 14 2020, 8:36 AM
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
66

Okay, yeah, I tried to read this code a bit too quick and was vague.
Now reading this a bit more carefully, let me ask a more concrete question on line 77.

78

Are we here actually not analysing a loop nest? I.e., look for t2WhileLoopStart within a t2WhileLoopStart?

If so, I am not sure that is necessary, because as far as I know that's not something we support for performance reasons? Then, we could perhaps just add an assert on getLoopDepth on the MachineLoop, and just don't need the loop on 79. Maybe I'm overthinking this, but it's just the repetition that makes me wonder. Also, this case doesn't seem to be covered in the tests?

samtebbs added inline comments.Dec 14 2020, 9:51 AM
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
78

Indeed it is checking for a nested while loop. If that situation won't occur then I don't mind removing it, or replacing it with an assertion.

It is tested by the backwards_branch_nested function in the mir test file.

dmgreen added inline comments.Dec 15 2020, 12:52 AM
llvm/lib/Target/ARM/ARM.h
63

It's for when the pass has dependencies upon other passes. My understanding is that If this pass needs to know about things like MachineLoop or dominator trees (which is likely in the long run), it will need to have an initialize method that gets called in the right places.

llvm/lib/Target/ARM/ARMBlockPlacement.cpp
37

RegisterPass doesn't seem to be the new pass manager (and the backend does not support the new pass manager yet). It seems very old and isn't used in a lot of other places.

SjoerdMeijer added inline comments.Dec 15 2020, 1:23 AM
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
78

Ah yeah, thanks, I missed that test but see it now.
Keeping the test is a good thing, but looks like you need to think how you want to support nesting.
The current implementation is incomplete/not very generic as it only looks 2-deep.
So, all together, best and easiest to get rid of it I think (and probably a return is nicer than an assert).

If I am not mistaken I think some more tests are required with different top level loop nests (1-deep), testing the different combinations of forward/backward branches to see how reshuffling blocks of 2 loop nests interact.

samtebbs added inline comments.Dec 15 2020, 3:24 AM
llvm/lib/Target/ARM/ARM.h
63

Sounds good, I'll change that.

llvm/lib/Target/ARM/ARMBlockPlacement.cpp
37

Oh I see, I followed the pass tutorial for LLVM 12 so this should be up to date, but I can change things if that would be better.

78

That's a good idea. I can use machine loop info to get some more generic information about the loop structure.

samtebbs updated this revision to Diff 311879.Dec 15 2020, 5:43 AM

Use INITIALIZE_PASS technique

samtebbs marked 4 inline comments as done.Dec 15 2020, 5:44 AM
samtebbs updated this revision to Diff 314927.Jan 6 2021, 9:46 AM

Use MachineLoopInfo. I have removed the fallthrough fixing test since it seems to be impossible to create the needed fallthrough and keep it as a valid MachineLoop.

SjoerdMeijer added inline comments.Jan 6 2021, 12:35 PM
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
75

yeah, what the linter says.

76

I forgot what the exact difference is between getLoopPredecessor and getLoopPreheader, but I guess calling BB Preheader will help a bit here.

90

Sorry if I have missed something, but I am afraid I am still / again confused by this "nested loop analysis" business here.
I think one of my previous remarks still holds? I.e., this lacks some generality because we look only at the next loop down in the loop hierarchy? So, if we support nested hardware-loops, and let's assume a 3-deep loop nest and then this will fail. But again, now I am unsure again if we supported nested loops.
But if I assume we do support nested loops, then I don't see the problem, why can't we actually move some blocks? In other words, I think I am missing a proper problem description, perhaps because I haven't thought long enough about this.

llvm/test/CodeGen/Thumb2/block-placement.mir
64

Correct me if I am wrong, but the loops in this test are not nested? Looks like 2 loops at the same loop nest level?

samtebbs added inline comments.Jan 7 2021, 6:09 AM
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
76

Good idea.

SjoerdMeijer added a comment.EditedJan 11 2021, 2:35 AM

Here's my "pen-and-paper exercise" of the problem that we are solving here, and lets start with some examples.
In these examples, empty blocks means there are no instructions in it that are relevant to this problem.

bb1:
bb2:
  WLS bb1
bb3:

In this first example above, we want to reorder blocks so that we get bb2 -> bb1 -> bb3. This is the easiest case, because there's nothing in bb1 that interacts.
Interacting here means moving a block that also contains a WLS:

bb1:
  WLS xyz
bb2:
  WLS bb1
bb3:

If we want to move `bb1, we now find a WLS in the block that we are moving:

  • the WLS can be forward branch:
    • if it branches to bb2, then we would create a backward branch. So, we would fix one, and regress one, and don't really win anything, unless the one that we fix has is "more important to fix".
    • if it branches to some block after bb3, then moving bb1 to after bb2, we create 2 forward branches, and all is okay.
  • the WLS can be a backward branch:
    • if we move bb1, its backward branch remains backward, so that doesn't regress and we fix another and is thus a win.

If we increase the first example:

bb1:
bb2
bb3:
  WLS bb1
bb4:

then nothing changes and we can safely move bb1.

bb1:
bb2:
  WLS xyz
bb3:
  WLS bb1
bb4:

If we move bb1, but there is a forward/backward branch in next block, bb2 in this case:

  • if it branches back to bb1, we create a forward one, so we improve,
  • if branches forward, it will remain a forward branch.

So, I don't think we need consider WLS in blocks between the block that we are moving and its new place.

I think/hope this captures the decision making we need to do regarding the WLS instructions.

But I think this is not yet the complete picture; WLS is not the only instruction that need to check, as we have also the LE.
As a first example:

bb1:
bb2:
  LE bb1
bb3:
  WLS bb1
bb4:

If we would like move bb1 to create a forward branch for the WLS, but bb1 is a target for a LE in a successor block, then is that what we want? What are the restrictions for the LE instructions?

Summarising:

  • I think we need to analyse all blocks between the block that we would like to move to its new place.
  • We need to identify and analyse instructions in them that interacts with this, which are the WLS, and the LE?
  • Looks like there are cases where we can improve one WLS, and regress one, so that we need some decision making to choose (could be done in a follow up).
  • Terminology:
    • is "nesting" important? Or can we just forget it?
    • I had to remind myself again about the WLS semantics, i.e. WLS bb1 means something like: CMP, BCC, B, and the bb1 is the loop exit block. So, instead of calling this target, let's call this something with "exit".

Adding to my previous comment:

if it helps and triggers on your motivating example, a very first MVP of this new pass moves a block if i) the block that we are moving, and ii) any laid out succours blocks, don't contain WLS (or LE?).

samtebbs updated this revision to Diff 315754.Jan 11 2021, 4:20 AM

Change the validation approach back to what it was before

Thanks for that Sjoerd, I agree with your examples and comments. I have changed the approach back to check for new backwards WLSs as before.

SjoerdMeijer added inline comments.Jan 11 2021, 6:23 AM
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
74

Can you describe the algorithm here in comments?

81

Nit: think my preference would be to call this "LoopExit", just to to avoid any potential confusion that this jumps to the loop header.

96

Nit: would be best if we could rename Target2 to something more descriptive.
But perhaps LoopExit2 is good enough, but then I think an example would help me here:

bb1:       // LoopExit
  WLS xyz  // LoopExit2
bb2:       // Preheader
  WLS bb1
bb3:        // Header
97

To understand all cases for LoopExit2, can we first list them here:

  • the WLS can be forward branch:
    • if it branches to bb2, then we would create a backward branch. So, we would fix one, and regress one, and don't really win anything, unless the one that we fix has is "more important to fix".
    • if it branches to some block after bb3, then moving bb1 to after bb2, we create 2 forward branches, and all is okay.
  • the WLS can be a backward branch:
    • if we move bb1, its backward branch remains backward, so that doesn't regress and we fix another and is thus a win.

So, the one case that we recognise here is the first one. We could support this, i.e. perhaps could make a decision if this is a good thing to do, but don't at the moment. So we should put a TODO here explaining that.

106

And after this, we need to check for other instructions in blocks in between?

samtebbs updated this revision to Diff 315844.Jan 11 2021, 9:46 AM

Check for new forwards LE branches and address feedback

samtebbs marked 5 inline comments as done.Jan 11 2021, 9:49 AM
samtebbs added inline comments.
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
96

Good idea 👍🏻

97

I've added explanations like what you wrote. Let me know what you think.

106

Indeed. I forgot to add checks for LEs in between the loop exit and preheader, but have added them in the latest updated.

SjoerdMeijer added inline comments.Jan 11 2021, 12:01 PM
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
124

I am a bit struggling with reading the indentation, but was wondering if we could return false here? In that case, do we need CanMove?

131

nit: newline after bb3 and LoopExit? I guess you want to have:

// bb1:  -LoopExit
// bb2:
//   LE bb1 - Terminator
// bb3:
// ...
137

Have we checked that LoopExit < Preheader?

samtebbs marked 3 inline comments as done.Jan 12 2021, 5:59 AM
samtebbs added inline comments.
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
124

I don't think we can return as that would stop looking at other loops in the function.

131

Ah yes. I blame clang-format.

137

Yep, on line 85.

samtebbs updated this revision to Diff 316083.Jan 12 2021, 7:06 AM

Fix erroneous format of a comment

SjoerdMeijer added inline comments.Jan 12 2021, 9:02 AM
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
129

Nit: some punctuation required here, and also below.

213

Nit: some punctuation required in this comment and also the ones below.

214

I guess BBPrevious is a nullptr when the WLS is in the Entry block? Is this covered with a test? I guess that will never happen, so we will always fix this fall through, and never skip over which is then not covered by a test.
I am not even sure that makes sense, i.e. the WLS in the entry block, because then we would move the entry block. So, does this need to be an assert that BBprevious is not a nullptr?

samtebbs added inline comments.Jan 13 2021, 3:12 AM
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
214

I could make it an assertion and make sure it isn't the entry block in the main loop.

samtebbs updated this revision to Diff 316401.Jan 13 2021, 7:56 AM

Punctuation and check for entry block

SjoerdMeijer accepted this revision.Jan 13 2021, 8:00 AM

Nice one, thanks, LGTM.

llvm/lib/Target/ARM/ARMBlockPlacement.cpp
218

Nit: we can remove BBPrevious from this check because we have asserted this is set?

This revision is now accepted and ready to land.Jan 13 2021, 8:00 AM
samtebbs added inline comments.Jan 13 2021, 8:25 AM
llvm/lib/Target/ARM/ARMBlockPlacement.cpp
218

Ah yes, well spotted :D

This revision was landed with ongoing or failed builds.Jan 13 2021, 9:33 AM
This revision was automatically updated to reflect the committed changes.