The code layout that TailMerging (inside BranchFolding) works on is not the final layout optimized based on the branch probability. Generally, after BlockPlacement, many new merging opportunities emerge. My motivation example (also the test case) is like this in ARM assembly
     b       L5
L1:
     mov     w9, w8
     b       L5
L2:
     mov     w9, w8
     b       L5
L3:
     mov     w9, w8
     b       L5
L4:
     mov     w9, w8
L5:
      ldr     x8, [x21,#624]L1-L5 can only be branched into. The example can be reduced to
     b       L5
L4:
     mov     w9, w8
L5:
   ldr     x8, [x21,#624]The predecessors of L1-L4 now all branch into L4. Branch Folding should be able to simplify the code in the tail-merge phase, but it fails. In this example, the tiny MBBs (L1-L4) in the Branch Folding pass are at the places where they are fallthroughs of their individual predecessors. Merging L1-L4 in the Branch Folding requires inserting extra unconditional branches which makes Tail Merging give up. After MBP, L1-L4 are no long fallthroughs and can be easily merged as shown in the example.
This patch calls Tail Merging when it finds MBP changes the branches and calls MBP again if Tail Merging merges anything. Tail merging updates MachineLoopInfo and MachineBlockFreqInfo so that MBP can use them later. The table below shows the number of instructions removed and the impact to the performance in a AArch64 device (plus is improvement) when running SPEC2006.
| perf (%) | static instruction count | ||
| INT | |||
| astar | -0.49 | -7 | |
| bzip2 | 0.40 | -110 | |
| gcc | -0.11 | -13,006 | |
| gobmk | 1.48 | -1,716 | |
| h264ref | 0.47 | -684 | |
| hmmer | -0.32 | -391 | |
| libquantum | 0.90 | -4 | |
| mcf | -0.14 | -4 | |
| omnetpp | -0.58 | -1,980 | |
| perlbench | 1.53 | -4,176 | |
| sjeng | -0.77 | -338 | |
| xalancbmk | -0.55 | -4,183 | |
| FLOAT | |||
| soplex | -0.22 | -395 | |
| dealII | 0.34 | -186 | |
| milc | -0.16 | -34 | |
| namd | -0.18 | -104 | |
| povray | 2.07 | -1,785 | |
| sphinx3 | -0.11 | -112 | |
This patch also depends on three other trivial patches: D20177 (make it possible to optimize the branch directions after tail merging), D20184 (make it possible to use updated MachineBlockFreqInfo in MBP), and D19955 (make it possible to know the branches are updated by MBP or not)
One test case (arm-and-tst-peephole.ll) is slightly rewritten by reversing one branch condition. The probability of both branch directions are 50/50 so I think the modification is okay.
This is a bit confusing as this can return false in cases where basic blocks are scrambled but no branch was updated. Not that it is incorrect, but it goes against conventions. Also are we sure that this is the only cases that can crate opportunity for the optimization to take place ? Have you tried to just run the optimization every time ? I'd be interested to know if there is any changes.