This is an archive of the discontinued LLVM Phabricator instance.

[MBP] Reduce code size by running tail merging in MBP
ClosedPublic

Authored by haicheng on May 15 2016, 2:22 PM.

Details

Summary

The code layout that TailMerging (inside BranchFolding) works on is not the final layout optimized based on the branch probability. Generally, after BlockPlacement, many new merging opportunities emerge. My motivation example (also the test case) is like this in ARM assembly

     b       L5
L1:
     mov     w9, w8
     b       L5
L2:
     mov     w9, w8
     b       L5
L3:
     mov     w9, w8
     b       L5
L4:
     mov     w9, w8
L5:
      ldr     x8, [x21,#624]

L1-L5 can only be branched into. The example can be reduced to

     b       L5
L4:
     mov     w9, w8
L5:
   ldr     x8, [x21,#624]

The predecessors of L1-L4 now all branch into L4. Branch Folding should be able to simplify the code in the tail-merge phase, but it fails. In this example, the tiny MBBs (L1-L4) in the Branch Folding pass are at the places where they are fallthroughs of their individual predecessors. Merging L1-L4 in the Branch Folding requires inserting extra unconditional branches which makes Tail Merging give up. After MBP, L1-L4 are no long fallthroughs and can be easily merged as shown in the example.

This patch calls Tail Merging when it finds MBP changes the branches and calls MBP again if Tail Merging merges anything. Tail merging updates MachineLoopInfo and MachineBlockFreqInfo so that MBP can use them later. The table below shows the number of instructions removed and the impact to the performance in a AArch64 device (plus is improvement) when running SPEC2006.

perf (%)static instruction count
INT
astar-0.49-7
bzip20.40-110
gcc-0.11-13,006
gobmk1.48-1,716
h264ref0.47-684
hmmer-0.32-391
libquantum0.90-4
mcf-0.14-4
omnetpp-0.58-1,980
perlbench1.53-4,176
sjeng-0.77-338
xalancbmk-0.55-4,183
FLOAT
soplex-0.22-395
dealII0.34-186
milc-0.16-34
namd-0.18-104
povray2.07-1,785
sphinx3-0.11-112

This patch also depends on three other trivial patches: D20177 (make it possible to optimize the branch directions after tail merging), D20184 (make it possible to use updated MachineBlockFreqInfo in MBP), and D19955 (make it possible to know the branches are updated by MBP or not)

One test case (arm-and-tst-peephole.ll) is slightly rewritten by reversing one branch condition. The probability of both branch directions are 50/50 so I think the modification is okay.

Diff Detail

Repository
rL LLVM

Event Timeline

haicheng updated this revision to Diff 57307.May 15 2016, 2:22 PM
haicheng retitled this revision from to [MBP] Reduce code size by running tail merging in MBP.
haicheng updated this object.
haicheng set the repository for this revision to rL LLVM.
haicheng updated this object.May 15 2016, 2:42 PM
haicheng added reviewers: gberry, mssimpso, mcrosier.
haicheng added subscribers: hfinkel, iteratee, deadalnix and 4 others.

I've noticed this bad pattern in my code, thanks for tackling this.

lib/CodeGen/MachineBlockPlacement.cpp
1350 ↗(On Diff #57307)

This is a bit confusing as this can return false in cases where basic blocks are scrambled but no branch was updated. Not that it is incorrect, but it goes against conventions. Also are we sure that this is the only cases that can crate opportunity for the optimization to take place ? Have you tried to just run the optimization every time ? I'd be interested to know if there is any changes.

1467 ↗(On Diff #57307)

Maybe you could use a unique pointer here.

1479 ↗(On Diff #57307)

Out of curiosity, do you have example of hardware doing this ?

1496 ↗(On Diff #57307)

Worth looping maybe ?

1518 ↗(On Diff #57307)

Remove once you use unique ptr.

majnemer added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
1479 ↗(On Diff #57307)

WebAssembly IIRC.

haicheng marked an inline comment as done.May 16 2016, 12:55 PM
haicheng added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
1479 ↗(On Diff #57307)

I know old AMD GPU had this requirement. Its IL has instructions like "if_xxx" and "endif".

sunfish added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
1479 ↗(On Diff #57307)

WebAssembly has flexible control flow and is ok with tail merging and most other optimizations, so we're no longer using the requiresStructuredCFG() hook.

haicheng added inline comments.May 18 2016, 2:54 PM
lib/CodeGen/MachineBlockPlacement.cpp
1350 ↗(On Diff #57307)

Thank you very much, Amaury. I did some research about this. In short, running the optimization every time can capture more cases and I think I will do it in my next version. Below is the detail of my findings

  1. Running the optimization every time captures five more cases in spec2000/2006. The same reason behind these cases is that the CFG is changed after the last time we call BranchFolding. These new opportunities exist before we start MBP.
  2. You are right that it is possible to change the layout without changing any branch. I managed to create such a case, but there is no new tail merging opportunity in this case.
  3. On average across spec2006, 74% of MFs need to update at least one branch to change the layout. The rest either does not need to change the layout or can change the layout without modifying any branch.
  4. I think updateTerminator() can capture all the changes of fallthrough MBBs which is the only thing interesting to me. In D19955, if updateTerminator() finds its layout successor is inconsistent with the current branch, it always returns true. So, if running tail merging every time is too much, my current approach can still find all the cases caused by the MBP (I definitely need to make some changes to comply with the convention).

I will look into your looping suggestion as next.

gberry resigned from this revision.May 23 2016, 1:45 PM
gberry removed a reviewer: gberry.

Removing myself as others seem to be handling review

Post Escha's comment.

lib/CodeGen/MachineBlockPlacement.cpp
1479 ↗(On Diff #57307)

Ours has that requirement. We don’t generally allow any CFG modification after structurization, which takes place in the pre-ISel phase. Generic block placement is fine though.

—escha

haicheng added inline comments.May 24 2016, 11:46 AM
lib/CodeGen/MachineBlockPlacement.cpp
1496 ↗(On Diff #57307)

I did some research on spec2006 about looping. Nine benchmarks can further remove instructions. gcc, omnetpp, and perlbench can remove 1000+ instructions. 83.5% MFs across spec2006 only need one iteration, 15.3% MFs need two iterations. The largest iteration number is five which is needed by 0.03% MFs.

It seems promising to pursue, but I think I will do the looping in the next patch after this one stays in the tree for a while.

Officially adding Amaury as a review.

haicheng updated this revision to Diff 58498.EditedMay 25 2016, 1:50 PM
haicheng edited edge metadata.
haicheng marked 10 inline comments as done.

Address Amaury's comments. Rebase, use unique_ptr, always call tail merging.

mcrosier accepted this revision.Jun 3 2016, 1:23 PM
mcrosier edited edge metadata.

Looks to be in good shape.

test/CodeGen/AArch64/tailmerging_in_mbp.ll
1 ↗(On Diff #58498)

Please add a space before RUN.

This revision is now accepted and ready to land.Jun 3 2016, 1:23 PM
deadalnix added inline comments.Jun 5 2016, 1:24 PM
lib/CodeGen/MachineBlockPlacement.cpp
1360 ↗(On Diff #58498)

Alright, that is some very good findings. I don't mind if this is taken care of now or in subsequent diffs, but it seems like there is some more juice to squeeze on that one.

This revision was automatically updated to reflect the committed changes.