A new basic block ordering improving existing MachineBlockPlacement.

The algorithm tries to find a layout of nodes (basic blocks) of a given CFG

optimizing jump locality and thus processor I-cache utilization. This is

achieved via increasing the number of fall-through jumps and co-locating

frequently executed nodes together. The name follows the underlying

optimization problem, Extended-TSP, which is a generalization of classical

(maximum) Traveling Salesmen Problem.

The algorithm is a greedy heuristic that works with chains (ordered lists)

of basic blocks. Initially all chains are isolated basic blocks. On every

iteration, we pick a pair of chains whose merging yields the biggest increase

in the ExtTSP value, which models how i-cache "friendly" a specific chain is.

A pair of chains giving the maximum gain is merged into a new chain. The

procedure stops when there is only one chain left, or when merging does not

increase ExtTSP. In the latter case, the remaining chains are sorted by

density in decreasing order.

An important aspect is the way two chains are merged. Unlike earlier

algorithms (e.g., based on the approach of Pettis-Hansen), two

chains, X and Y, are first split into three, X1, X2, and Y. Then we

consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y,

X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score.

This improves the quality of the final result (the search space is larger)

while keeping the implementation sufficiently fast.

**Perf impact:**

The perf improvement over the existing MachineBlockPlacement is in the range

of 0.5%-1.5% for medium-sized binaries (e.g, clang/gcc), depending on the PGO

mode (tested with various options of LLVM_BUILD_INSTRUMENTED and AutoFDO,

CSSPGO).

For large-scale fully-optimized binaries running in production, we measure 0.2%-0.8%

speedup (with AutoFDO/CSSPGO but excluding post-link optimizers).

For smaller (not front-end bound) binaries, we do not expect regressions; for SPEC17:

- 508.namd_r: 8.96+/-3.89 (win)
- 602.gcc: -0.93+/-0.47 (regression)
- 623.xalancbmk: 2.12+/-0.84 (win)
- 625.x264: 1.41+/-0.7 (win)
- other binaries are flat.

**Performance of the alg:**

We haven't seen a change of the build time for large binaries.