Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

ext-tsp basic block layout

Authored by spupyrev on Nov 8 2021, 10:48 AM.



A new basic block ordering improving existing MachineBlockPlacement.

The algorithm tries to find a layout of nodes (basic blocks) of a given CFG
optimizing jump locality and thus processor I-cache utilization. This is
achieved via increasing the number of fall-through jumps and co-locating
frequently executed nodes together. The name follows the underlying
optimization problem, Extended-TSP, which is a generalization of classical
(maximum) Traveling Salesmen Problem.

The algorithm is a greedy heuristic that works with chains (ordered lists)
of basic blocks. Initially all chains are isolated basic blocks. On every
iteration, we pick a pair of chains whose merging yields the biggest increase
in the ExtTSP value, which models how i-cache "friendly" a specific chain is.
A pair of chains giving the maximum gain is merged into a new chain. The
procedure stops when there is only one chain left, or when merging does not
increase ExtTSP. In the latter case, the remaining chains are sorted by
density in decreasing order.

An important aspect is the way two chains are merged. Unlike earlier
algorithms (e.g., based on the approach of Pettis-Hansen), two
chains, X and Y, are first split into three, X1, X2, and Y. Then we
consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y,
X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score.
This improves the quality of the final result (the search space is larger)
while keeping the implementation sufficiently fast.

Perf impact:
The perf improvement over the existing MachineBlockPlacement is in the range
of 0.5%-1.5% for medium-sized binaries (e.g, clang/gcc), depending on the PGO
mode (tested with various options of LLVM_BUILD_INSTRUMENTED and AutoFDO,
For large-scale fully-optimized binaries running in production, we measure 0.2%-0.8%
speedup (with AutoFDO/CSSPGO but excluding post-link optimizers).
For smaller (not front-end bound) binaries, we do not expect regressions; for SPEC17:

  • 508.namd_r: 8.96+/-3.89 (win)
  • 602.gcc: -0.93+/-0.47 (regression)
  • 623.xalancbmk: 2.12+/-0.84 (win)
  • 625.x264: 1.41+/-0.7 (win)
  • other binaries are flat.

Performance of the alg:
We haven't seen a change of the build time for large binaries.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Would it help to add remarks in a later diff to get some insights into the algorithm?

davidxl added inline comments.Nov 20 2021, 2:32 PM

Ok, we can keep ext-tsp as an cleanup pass.


This might be a bug in existing layout. By design, with profile data, in order to take b2 as the layout successor, the incoming edge weight needs > 2*10 = 20.


Add a cfg with ascii art.


Explain a little the expected output.


Add brief explanation of expected output.

hoy added inline comments.Nov 22 2021, 9:14 AM

The object is created but not used anywhere. Is that expected?


nit: please use enum class and add comment for each enum value.


This is an assert-only loop. Put it under #ifndef NDEBUG ?

rahmanl added inline comments.Nov 22 2021, 2:36 PM

I think we would never merge 0-score chains. If there is any additional benefit (code size) it can be incorporated into the score. But I am OK with this.


You can use something like the following function:

auto ComputeMergeGainWithOffsetAndTypes([&](int Offset, std::vector<MergeTypeTy> &merge_types) {
  if (Offset == 0 || Offset == PredChain.size())
  auto BB = ChainPred->blocks()[Offset - 1];
   // Skip the splitting if it breaks FT successors
   if (BB->ForcedSucc != nullptr)
   for (auto &MergeType: merge_types) {
      Gain.updateIfLessThan(computeMergeGain(ChainPred, ChainSucc, Jumps, Offset, MergeType));

Great. Please add a comment to explain why this splitting is not exercised.


This is missing the ascii cfg.


Please use the same names as in the CFG so this can be easily mapped. And please do this across all tests.


I am not an expert on this, but I feel like this is not needed. Does it make a difference?

spupyrev updated this revision to Diff 389226.Nov 23 2021, 9:02 AM
spupyrev marked 13 inline comments as done.

addressing comments:

  • adjusted tests;
  • enum class;
  • minor refactorings as suggested.

What object do you refer to?


This tests uses two different modes of ext-tsp, with and without "chain splitting" -- something that Rahman suggested. The "old" layout in MachineBlockPlacement isn't applied here.


The purpose of the test is to verify that chain splitting code is executed and that it is beneficial for the layout, that is, the optimized score is increased. I added the explanations.

Since the graph is large and randomly generated, the ascii representation is not very readable. In fact, it is not providing any insights, so let's skip it?

davidxl added inline comments.Nov 23 2021, 9:53 AM

The paper documents 6 types of branches and their weight K_{s,t}. Should these be parametized here?


The tsp score does not seem to be contiguous here. When Dist == 0, it should have the same score as the fall through case, but this produces 0.1* Count.


It is confusing to have both this method and calcExtTSPScore methods.


From the initialization, Order[Idx] == Idx, so is this map needed?


typo 'Increasing'


why is the count not used?


This function computes the 'fallthrough' maximizing TSP score, not the extTSP in the original paper. Where is the h function and the K coeffcients?


Can the reserve be combined with the vector declaration, or the intention is to avoid initialization?


ok. I was expecting to look into the test case and reason that improved score corresponds to improved layout. Looks like this is just a test to show increased score. Can this be done instead with any other tests (the score should be increased monotonically) ?

spupyrev updated this revision to Diff 389262.Nov 23 2021, 10:55 AM
spupyrev marked 5 inline comments as done.

David's comments (except for the large test)


Agree, that's something me and my colleagues have been thinking about lately.

The current function is tuned for the maximum performance of the resulting binary. It is not guaranteed/enforced that the function is contiguous. That said, i'd be super happy to see and review improvements of the objective in the future. There is likely a room for optimizations here


This function is also called when Order != Identity


oops. Refactored function to compute correct result

spupyrev updated this revision to Diff 389299.Nov 23 2021, 1:14 PM

changed "large" test into a human-readable one, added ascii for the CFG

davidxl added inline comments.Nov 23 2021, 2:10 PM

I guessed that with ForwardWeight == 0.1, a bias can be introduced against laying out larger successor as the fall through block, but it is not the case. Using different forward weight seems to produce the same decision for a simple diamond cfg case -- in order to for one successor to be a fallthrough, the branch probability needs to > 50%.

I wonder what role this weight plays in practice (performance)?


Probably provide another wrapper to compute the mapping and pass it in. The map can use the vector type, so for the case when Order == Identity, Order can be used for both.

spupyrev added inline comments.Nov 23 2021, 2:27 PM

I am not sure I understand the question, could you clarify?

The weights affect the resulting layout, which in turn may impact perf. However, changing the weights from 0.1 to say 0.2 likely won't make a difference, except for some corner cases.
IIRC, in my tests 3 years ago it was important to keep ForwardWeight<FallthroughWeight=1.0. Intuitively this makes sense, though the exact values may need to be re-tuned. (This is a separate and time-consuming project)

davidxl added inline comments.Nov 23 2021, 3:23 PM

You answer touched upon what I was asking. To clarify the question: are there performance data to show why 0.1 (instead of other values) is used. ExtTSP score can be considered as a locality score, and the forwardWeight is an attempt to model the cost of taken branch instruction. I wonder if there is a better way to model this in the objective function. No need to address it in this patch though.

davidxl added inline comments.Nov 24 2021, 12:22 PM

It is better to use vector type as the index space is dense?


vector type?


Why not computing the actual byte size of the block?


Is multi-graph assumed, why?


Can the following (computing NewBlockOrder) be folded into applyExtTspLayout? The NewOrder is not used anywhere else, so no need to be exposed.


Is NewIndex the same as BlockIndex computed in applyExtTsp? Why recomputing?

spupyrev updated this revision to Diff 391069.Dec 1 2021, 9:56 AM
spupyrev marked 4 inline comments as done.
  • changed DenseMap => std::vector
  • git rid of multi-edges

That would be great! Do you mind pointing to the right way of implementing this? I was not able to find anything meaningless in MachineBasicBlock


I am not fully understanding the suggestion.

The goal is here is to make applyExtTspLayout work with any type of nodes, not just MachineBasicBlock (potentially we want to apply it for function sorting). I tried creating a generic implementation with templates (and thus "hiding" NewOrder) but it turned out to be much messier than what it is now.


No, this is a new index after reordering.

davidxl added inline comments.Dec 1 2021, 3:30 PM

This is target dependent unfortunately -- an abstract interface can be added in MCCodeEmitter. The implementation should be similar to encodeInstruction.

This is beyond the scope of the patch, so perhaps add a TODO in the comment.

spupyrev updated this revision to Diff 391380.Dec 2 2021, 10:23 AM

adding a note on computing block sizes exactly

davidxl accepted this revision.Dec 2 2021, 10:31 AM

LGTM, please also wait for other reviewers lgtm.

This revision is now accepted and ready to land.Dec 2 2021, 10:31 AM
spupyrev updated this revision to Diff 391395.Dec 2 2021, 10:56 AM

adding 'mtriple' back to test options, as MachineBlockPlacement isn't deterministic between different platforms (so tests fail on Windows)

LGTM as well.

hoy accepted this revision.Dec 3 2021, 1:52 PM

lgtm except for some nits. Thanks.


nvm, the object pointed by FunctionChain is used in FunctionChain->merge.


nit: the distance in use isn't really in bytes? It is number of MR instructions?


nit: we usually use uint64_t for id type explicitly.


nit: no need to use #ifndef NDEBUG

spupyrev updated this revision to Diff 391747.Dec 3 2021, 2:56 PM

changed type of Chain.Id to uint64


As discussed in, we use block sizes in bytes, approximating it by num_mir_instr * 4.

Equivalently, we could similarly use the number of instructions in both places (by scaling the constants down by 4) but I don't really see a difference

This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Dec 6 2021, 4:02 PM

Looks like this broke CodeGen/X86/code_placement_ext_tsp.ll on arm macs:

Please take a look, and revert for now if it takes a while to fix.

thakis added a comment.Dec 6 2021, 4:06 PM

Also here which should've sent email long ago. Reverting...

spupyrev reopened this revision.Dec 6 2021, 4:27 PM
This revision is now accepted and ready to land.Dec 6 2021, 4:27 PM
spupyrev updated this revision to Diff 392227.Dec 6 2021, 4:37 PM

making the tests deterministic

This revision was automatically updated to reflect the committed changes.
MaskRay added inline comments.

This should be Jump->Target. See

Herald added a project: Restricted Project. · View Herald TranscriptSun, Sep 17, 1:10 AM
Herald added a subscriber: wlei. · View Herald Transcript