Page MenuHomePhabricator

[MBP] Move a latch block with conditional exit and multi predecessors to top of loop
ClosedPublic

Authored by Carrot on Feb 13 2018, 2:15 PM.

Details

Summary

Current findBestLoopTop can find and move one kind of block to top, a latch block has one successor. Another common case is:

  • a latch block
  • it has two successors, one is loop header, another is exit
  • it has more than one predecessors

If it is below one of its predecessors P, only P can fall through to it, all other predecessors need a jump to it, and another conditional jump to loop header. If it is moved before loop header, all its predecessors jump to it, then fall through to loop header. So all its predecessors except P can reduce one taken branch.

Diff Detail

Repository
rL LLVM

Event Timeline

Carrot created this revision.Feb 13 2018, 2:15 PM

Please add a few new test cases first.

Carrot updated this revision to Diff 134325.Feb 14 2018, 2:56 PM

Add a new test case.

davidxl added inline comments.Feb 15 2018, 9:52 AM
test/CodeGen/X86/move_latch_to_loop_top.ll
38 ↗(On Diff #134325)

This is not the optimal rotation for the loop. The optimal rotation should be

true
latch
header
false

in which case there is one more fall through from true to latch.

I think the enhancement should shoot for getting the optimal layout.

Carrot updated this revision to Diff 135512.Feb 22 2018, 1:53 PM

Add more code to iteratively find new block that can be moved before old loop top block, and reduce taken branches. Now it can layout the test case optimally as David commented.

Carrot marked an inline comment as done.Feb 22 2018, 1:54 PM
davidxl added inline comments.Feb 26 2018, 9:20 AM
lib/CodeGen/MachineBlockPlacement.cpp
1758 ↗(On Diff #135512)

BottomBlock --> bottom block BB

1759 ↗(On Diff #135512)

the following case.

1768 ↗(On Diff #135512)

Add some arrows in the graph and explain more in text why it is not beneficial

1792 ↗(On Diff #135512)

Add more comment about the intention of this method:

The method checks if the reduced taken branches is less than increased taken branch (to the exit block when rotation happens). If yes, it returns true.

1822 ↗(On Diff #135512)

--> and it has more than one predecessors.

1828 ↗(On Diff #135512)

Add comment that the reduced taken branches will be compared with the increased taken branch to the loop exit block.

1864 ↗(On Diff #135512)

Why is this check needed?

1880 ↗(On Diff #135512)

It makes assumption that there is an existing fall through to the exit bb. If not, it is always beneficial to rotate.

test/CodeGen/X86/move_latch_to_loop_top.ll
74 ↗(On Diff #135512)

perhaps draw a diagram here.

Carrot updated this revision to Diff 136143.Feb 27 2018, 1:26 PM
Carrot marked 7 inline comments as done.
Carrot added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
1864 ↗(On Diff #135512)

Only two patterns are handled, none of the cases has more than 2 successors.

1880 ↗(On Diff #135512)

Yes.
But layout the loop body and rotate loop occur after this function, it is difficult to guess which BB will be put at the bottom of the loop. So to be conservative, assume the current candidate BB can be layout at the bottom, and fall through to the exit bb.

Carrot updated this revision to Diff 201075.May 23 2019, 2:50 PM

Update the patch to the current code base, also make two improvements:

  • Do more precise cost analysis based on the increased number of fallthrough.
  • Also enable findBestLoopTop when profile information is available.
davidxl added inline comments.May 24 2019, 4:19 PM
lib/CodeGen/MachineBlockPlacement.cpp
1959 ↗(On Diff #201075)

Move this checked into the caller function

1991 ↗(On Diff #201075)

This check is not necessary. See example at https://reviews.llvm.org/F8921829

If B6 is selected as the new loop top, the fall through frequencies can be increased from 99 to 150.

Carrot updated this revision to Diff 202299.May 30 2019, 2:04 PM
Carrot marked 2 inline comments as done.
Carrot edited the summary of this revision. (Show Details)

For all the test cases update, please also validate if they make sense or not if possible.

lib/CodeGen/MachineBlockPlacement.cpp
1808 ↗(On Diff #202299)

This part looks complete. Are all the paths covered by some tests?

1918 ↗(On Diff #202299)

The logic looks correct. Are all cases covered by tests?

1977 ↗(On Diff #202299)

The comment should be fixed.

1983 ↗(On Diff #202299)

more generally, you want largest pred edge frequency to be smaller than the new back edge frequency.

1985 ↗(On Diff #202299)

The 'else' here seems wrong. It needs to fall through to do the same check.

Carrot updated this revision to Diff 203015.Jun 4 2019, 1:54 PM
Carrot marked 6 inline comments as done.
Carrot added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
1808 ↗(On Diff #202299)

New tests added.

1918 ↗(On Diff #202299)

A lot of existing test cases are impacted by this patch, and this function is intensively tested by those test cases.

1983 ↗(On Diff #202299)

Again it is a heuristic before FallThroughGains is implemented. Now it can be removed.

1991 ↗(On Diff #201075)

You are right. This code was used as heuristic when I didn't quantitatively compute the number of fallthrough. Now that we have more precise cost model, it should be removed.

Carrot added a comment.Jun 7 2019, 3:31 PM

Some analysis of the test case changes.

test/CodeGen/AArch64/neg-imm.ll
The control flow graph looks like:

    entry
      |
      V
-->for.body
|     |\
|     | \
|     |  \
|     | if.then3
|     |  /
|     | /
|     |/
---for.inc
      |
      V
for.cond.cleanup

The original layout is:

entry
for.body
if.then3
for.inc
for.cond.cleanup

For each loop iteration there are two taken branches.

The new layout is:

entry
for.inc
for.body
if.then3
for.cond.cleanup

For each loop iterations there is one taken branch.

test/CodeGen/AMDGPU/optimize-negated-cond.ll
The control flow of function @negated_cond_dominated_blocks is:

  bb
   |
   V
  bb4 <--
  /\    |
 /  \   |
bb5 bb6 |
 \  /   |
  \/    |
  bb7 ---
   |
   V
  bb3

The original layout is:

bb
bb4
bb6
bb5
bb7
bb3

For each loop iteration there are two taken branches.

New layout is

bb
bb6
bb7
bb4
bb5
bb3

For each loop iterations there is one taken branch.

test/CodeGen/Hexagon/redundant-branching2.ll
It is also diamond shaped loop,

   |
  b3 <--
  /\   |
 /  \  |
b4  b5 |
 \  /  |
  \/   |
  b6----
   |

Original layout is

b3
b4
b5
b6

New layout is

b5
b6
b3
b4

The new layout can reduce 1 taken branch per iteration.

test/CodeGen/X86/widen_arith-*.ll
The control flow graph is:

    entry
      |
      V
-->forcond ---
|     |      |
|     V      |
---forbody   |
             |
             V
         afterfor

Original layout is:

entry
forbody
forcond
afterfor

New layout is:

entry
forcond
forbody
afterfor

It shouldn't have performance impact, but the new layout is more natrual, more readable.

davidxl accepted this revision.Jun 10 2019, 11:05 AM

lgtm.

Warning. With static profiling, the layout strategy based on the 'precise' cost model may be off. If for some reason this causes issues later, the change should be guarded with 'hasProfile' check.

This revision is now accepted and ready to land.Jun 10 2019, 11:05 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJun 14 2019, 4:05 PM

Do you have any benchmark numbers to show that this is generally profitable? From our downstream testing, it is not clear that this change is beneficial.

Do you have any benchmark numbers to show that this is generally profitable? From our downstream testing, it is not clear that this change is beneficial.

We got performance improvement in our internal search benchmark.