Propeller code layout optimizations

This is part of the Propeller framework to do post link code layout optimizations. Please see the RFC here: https://groups.google.com/forum/#!msg/llvm-dev/ef3mKzAdJ7U/1shV64BYBAAJ and the detailed RFC doc here: https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

This is the last patch in the series of patches for propeller and depends on the previous propeller patches (D68062, D68063, and D68065)

This patch allows Propeller to performs intra-function basic block reordering based on the extended TSP model (https://arxiv.org/abs/1809.04676) and function reordering based on the HFSort (https://dl.acm.org/citation.cfm?id=3049858). It also adds the capability of splitting functions.

Specifically, Propeller.cpp in D68062 passes on the generated CFG profiles to routines in PropellerBBReordering.cpp and PropellerFuncReordering.cpp to compute a global ordering of all basic blocks.

The high-level description of the BB reordering and Function reordering algorithms are as follows.

**ExtTSP Basic Block Reordering Algorithm**

The ExtTSP (extended TSP) metric provides a score for every ordering of basic blocks in a function, by combining the gains from fall-throughs and short jumps.

Given an ordering of the basic blocks, for a function f, the ExtTSP score is computed as follows:

Sum_{e in f} Freq(e) * weight(e),

where weight(e) is computed as follows:

- 1
*if Distance(Src[e] , Sink[e]) = 0 (i.e. fallthrough)* - 0.1 * (1 - Distance(Src[e] , Sink[e]) /1024)
*if Src[e] < Sink[e] and 0 < Distance(Src[e], Sink[e]) < 1024 (i.e. short forward jump)* - 0.1 * (1 - Distance(Src[e] , Sink[e]) /640)
*if Src[e] > Sink[e] and 0 < Distance(Src[e], Sink[e]) < 640 (i.e. short backward jump)* - 0
*otherwise*

In short, it computes a weighted sum of frequencies of all edges in the control flow graph. Each edge gets its weight depending on whether the given ordering makes the edge a fallthrough, a short forward jump, or a short backward jump.

Although this problem is NP-hard like the regular TSP, an iterative greedy basic-block-chaining algorithm is used to find a close to optimal solution. This algorithm is described as follows.

Starting with one basic block sequence (BB chain) for every basic block, the algorithm iteratively joins BB chains together in order to maximize the extended TSP score of all the chains.

Initially, it finds all mutually-forced edges in the profiled CFG. These are all the edges which are -- based on the profile -- the only (executed) outgoing edge from their source node and the only (executed) incoming edges to their sink nodes. Next, the source and sink of all mutually-forced edges are attached together as fallthrough edges.

Then, at every iteration, the algorithm tries to merge a pair of BB chains which leads to the highest gain in the ExtTSP score. The algorithm extends the search space by considering splitting short (less than 128 bytes in binary size) BB chains into two chains and then merging these two chains with the other chain in four ways. After every merge, the new merge gains are updated. The algorithm repeats joining BB chains until no additional can be achieved. At this step, it sorts all the existing chains in decreasing order of their execution density, i.e., the total profiled frequency of the chain divided by its binary size.

**HFSort Function Reordering Algorithm**

The HFSort function reordering algorithm computes an optimal ordering. It goes through all functions in decreasing order of their execution density and for each one, finds its most likely caller (the function which calls it the most) and places the caller’s cluster right before the callee’s cluster. After all functions are considered, the HFSort algorithm orders all function clusters in decreasing order of their total execution density.

**Function Splitting**

After BB reordering and function reordering, for every function, Propeller goes through the BB chains in the computed BB order for that function, and for each chain, places the basic blocks of that chain at the hot or cold part of the layout depending on whether the chain’s total frequency is zero or not.

The ExtTSP BB reordering algorithm orders the BB chains decreasing order of their execution density. As a result, cold basic blocks are naturally placed at the end of the computed BB layout of every function.