A new basic block ordering improving existing MachineBlockPlacement.
The algorithm tries to find a layout of nodes (basic blocks) of a given CFG
optimizing jump locality and thus processor I-cache utilization. This is
achieved via increasing the number of fall-through jumps and co-locating
frequently executed nodes together. The name follows the underlying
optimization problem, Extended-TSP, which is a generalization of classical
(maximum) Traveling Salesmen Problem.
The algorithm is a greedy heuristic that works with chains (ordered lists)
of basic blocks. Initially all chains are isolated basic blocks. On every
iteration, we pick a pair of chains whose merging yields the biggest increase
in the ExtTSP value, which models how i-cache "friendly" a specific chain is.
A pair of chains giving the maximum gain is merged into a new chain. The
procedure stops when there is only one chain left, or when merging does not
increase ExtTSP. In the latter case, the remaining chains are sorted by
density in decreasing order.
An important aspect is the way two chains are merged. Unlike earlier
algorithms (e.g., based on the approach of Pettis-Hansen), two
chains, X and Y, are first split into three, X1, X2, and Y. Then we
consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y,
X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score.
This improves the quality of the final result (the search space is larger)
while keeping the implementation sufficiently fast.
Perf impact:
The perf improvement over the existing MachineBlockPlacement is in the range
of 0.5%-1.5% for medium-sized binaries (e.g, clang/gcc), depending on the PGO
mode (tested with various options of LLVM_BUILD_INSTRUMENTED and AutoFDO,
CSSPGO).
For large-scale fully-optimized binaries running in production, we measure 0.2%-0.8%
speedup (with AutoFDO/CSSPGO but excluding post-link optimizers).
For smaller (not front-end bound) binaries, we do not expect regressions; for SPEC17:
- 508.namd_r: 8.96+/-3.89 (win)
- 602.gcc: -0.93+/-0.47 (regression)
- 623.xalancbmk: 2.12+/-0.84 (win)
- 625.x264: 1.41+/-0.7 (win)
- other binaries are flat.
Performance of the alg:
We haven't seen a change of the build time for large binaries.
Perhaps make the name clearer: createCFGChainExtTSP()?