Page MenuHomePhabricator

Propeller code layout optimizations

Authored by rahmanl on Sep 25 2019, 11:48 PM.



Propeller code layout optimizations

This is part of the Propeller framework to do post link code layout optimizations. Please see the RFC here:!msg/llvm-dev/ef3mKzAdJ7U/1shV64BYBAAJ and the detailed RFC doc here:

This is the last patch in the series of patches for propeller and depends on the previous propeller patches (D68062, D68063, and D68065)

This patch allows Propeller to performs intra-function basic block reordering based on the extended TSP model ( and function reordering based on the HFSort ( It also adds the capability of splitting functions.

Specifically, Propeller.cpp in D68062 passes on the generated CFG profiles to routines in PropellerBBReordering.cpp and PropellerFuncReordering.cpp to compute a global ordering of all basic blocks.

The high-level description of the BB reordering and Function reordering algorithms are as follows.

ExtTSP Basic Block Reordering Algorithm

The ExtTSP (extended TSP) metric provides a score for every ordering of basic blocks in a function, by combining the gains from fall-throughs and short jumps.
Given an ordering of the basic blocks, for a function f, the ExtTSP score is computed as follows:

Sum_{e in f} Freq(e) * weight(e),
where weight(e) is computed as follows:

  • 1 if Distance(Src[e] , Sink[e]) = 0 (i.e. fallthrough)
  • 0.1 * (1 - Distance(Src[e] , Sink[e]) /1024) if Src[e] < Sink[e] and 0 < Distance(Src[e], Sink[e]) < 1024 (i.e. short forward jump)
  • 0.1 * (1 - Distance(Src[e] , Sink[e]) /640) if Src[e] > Sink[e] and 0 < Distance(Src[e], Sink[e]) < 640 (i.e. short backward jump)
  • 0 otherwise

In short, it computes a weighted sum of frequencies of all edges in the control flow graph. Each edge gets its weight depending on whether the given ordering makes the edge a fallthrough, a short forward jump, or a short backward jump.

Although this problem is NP-hard like the regular TSP, an iterative greedy basic-block-chaining algorithm is used to find a close to optimal solution. This algorithm is described as follows.

Starting with one basic block sequence (BB chain) for every basic block, the algorithm iteratively joins BB chains together in order to maximize the extended TSP score of all the chains.

Initially, it finds all mutually-forced edges in the profiled CFG. These are all the edges which are -- based on the profile -- the only (executed) outgoing edge from their source node and the only (executed) incoming edges to their sink nodes. Next, the source and sink of all mutually-forced edges are attached together as fallthrough edges.

Then, at every iteration, the algorithm tries to merge a pair of BB chains which leads to the highest gain in the ExtTSP score. The algorithm extends the search space by considering splitting short (less than 128 bytes in binary size) BB chains into two chains and then merging these two chains with the other chain in four ways. After every merge, the new merge gains are updated. The algorithm repeats joining BB chains until no additional can be achieved. At this step, it sorts all the existing chains in decreasing order of their execution density, i.e., the total profiled frequency of the chain divided by its binary size.

HFSort Function Reordering Algorithm

The HFSort function reordering algorithm computes an optimal ordering. It goes through all functions in decreasing order of their execution density and for each one, finds its most likely caller (the function which calls it the most) and places the caller’s cluster right before the callee’s cluster. After all functions are considered, the HFSort algorithm orders all function clusters in decreasing order of their total execution density.

Function Splitting

After BB reordering and function reordering, for every function, Propeller goes through the BB chains in the computed BB order for that function, and for each chain, places the basic blocks of that chain at the hot or cold part of the layout depending on whether the chain’s total frequency is zero or not.
The ExtTSP BB reordering algorithm orders the BB chains decreasing order of their execution density. As a result, cold basic blocks are naturally placed at the end of the computed BB layout of every function.

Diff Detail

Event Timeline

rahmanl created this revision.Sep 25 2019, 11:48 PM
Bigcheese added inline comments.Sep 26 2019, 5:35 PM
85 ↗(On Diff #221887)

Could you use unique_ptr instead? I don't see where these are deleted.

110 ↗(On Diff #221887)

Dead code.

116–117 ↗(On Diff #221887)

Dead code.

120 ↗(On Diff #221887)

This should just be a function call. Using operator overloading here gains very little, and costs losing a useful function name which had to be replaced with a comment.

ruiu added a comment.Sep 26 2019, 9:54 PM

Let me start with high-level comments:

  • Your code looks unnecessarily different from existing code from the coding style viewpoint. Please read other files and follow the same coding style.
  • Please run clang-format-diff to format this patch.
  • It lacks comments. In particular, it needs a file comment explaining the algorithm. Once you successfully explain the algorithm, each function/data should become more self-explanatory.
1 ↗(On Diff #221887)

Please add a file comment describing the algorithm. An example of it is Just like that file, please explain the algorithm.

25 ↗(On Diff #221887)

Define lld::toString(NodeChain &) instead of overloading operator<<.

1 ↗(On Diff #221887)

Add the standard file header comment.

12–18 ↗(On Diff #221887)

It is not permitted to import identifiers from std by the LLVM coding style.

20 ↗(On Diff #221887)

Instead of doing this, add #include "lld/Common/LLVM.h" just like other files do.

35 ↗(On Diff #221887)

We use // comments.

39 ↗(On Diff #221887)

Is there any reason you can't use std::vector? std::list is generally slower than std::vector for most operations, though inserting/removing a new element in a middle of a long list is faster.

42 ↗(On Diff #221887)

Please read the other files in the same directory and follow the same style, so that this code doesn't unecessarily look different. This should be uint32_t Size = 0;

109 ↗(On Diff #221887)

Why do you need a virtual dtor?

24 ↗(On Diff #221887)

Do you need to define a dtor?

25–27 ↗(On Diff #221887)

Please run clang-format-diff on this patch.

@Bigcheese Friendly ping from a poor soul coming from D46228:) (Since email ping+devmeeting ping does not work)

rahmanl updated this revision to Diff 222871.Oct 2 2019, 10:45 AM

This update addresses the previous comments including:

1- Adding standard header comments.
2- Adding more comments to the code.
3- Fixing variable and function names to lower case.
rahmanl marked 15 inline comments as done.Oct 2 2019, 10:47 AM
rahmanl updated this revision to Diff 222925.Oct 2 2019, 3:49 PM

This update adds one more test which exercises the different flavors of layout optimization in propeller by toggling reorder-bb, reorder-funcs, and split-funcs.

ruiu added inline comments.Oct 3 2019, 1:32 AM
9 ↗(On Diff #222925)


27 ↗(On Diff #222925)

nit: add two spaces so that the second line of a list item is properly indented.

80–81 ↗(On Diff #222925)

What function are you using from stdio.h?

120 ↗(On Diff #222925)

You can use the foreach-style for loop here: for (std::pair<int64_t, std::unique_ptr<NodeChain>> Elem : Chains)

192 ↗(On Diff #222925)

src is they are ...?

619–622 ↗(On Diff #222925)

It took for a while to find this line. This is the output of this ordering algorithm, right? Is there any other output from this algorithm?

44–48 ↗(On Diff #222925)

Because they are always initialized by the constructor, they shouldn't have default values.

54–59 ↗(On Diff #222925)

In C++, it is generally preferred to initialize members from the beginning using an initializer list instead of updating members after they are constructed. So,

NodeChain(const ELFCFGNode *Node) : DelegateNode(Node), Size(Node->ShSize), Freq(Node->Freq) {
65–67 ↗(On Diff #222925)

Do you really have to define these functions, given that you can just access Nodes to do the same thing?

2–9 ↗(On Diff #222925)

Formatting doesn't seem correct.

rahmanl updated this revision to Diff 223990.Oct 8 2019, 11:05 PM

This update addresses previous comments from @ruiu.
Most importantly, it removes the propeller-align-basic-block functionality since it is not fit with the current code. I will add this functionality in a later patch.

rahmanl marked 6 inline comments as done.Oct 8 2019, 11:08 PM
rahmanl updated this revision to Diff 239456.Jan 21 2020, 4:41 PM

We have made major changes to the Propeller since the last update. The changes are documented here:

The major change for this patch is the inter-procedural reordering algorithm, along with other design changes for the reordering workflow.

rahmanl updated this revision to Diff 240054.Jan 23 2020, 4:49 PM

This update simplifies the update of chain and offset mapping after merging a new chain.
Affected file: PropellerNodeChainBuilder.cpp

rahmanl updated this revision to Diff 240255.Jan 24 2020, 11:14 AM

This update moves all the code layout optimizations under lld/ELF/Propeller/CodeLayout.

rahmanl updated this revision to Diff 241860.Jan 31 2020, 7:24 PM
rahmanl added a reviewer: MaskRay.
rahmanl removed a subscriber: MaskRay.

This update mainly changes the floating point variables in the code layout parameters to integers. 20% speedup is achieved by this.

This update also includes the removal of some functionalities and unnecessary code.

rahmanl updated this revision to Diff 242666.Feb 5 2020, 9:38 AM

This update changes class variable names to camelCase.

rahmanl updated this revision to Diff 242933.Feb 6 2020, 9:56 AM

This update changes the offset field to signed integer.

rahmanl abandoned this revision.Jun 11 2020, 2:26 PM

No longer pursued in LLD with BB-clusters ( D76954) as alternative route.