This is an archive of the discontinued LLVM Phabricator instance.

ext-tsp basic block layout
ClosedPublic

Authored by spupyrev on Nov 8 2021, 10:48 AM.

Details

Summary

A new basic block ordering improving existing MachineBlockPlacement.

The algorithm tries to find a layout of nodes (basic blocks) of a given CFG
optimizing jump locality and thus processor I-cache utilization. This is
achieved via increasing the number of fall-through jumps and co-locating
frequently executed nodes together. The name follows the underlying
optimization problem, Extended-TSP, which is a generalization of classical
(maximum) Traveling Salesmen Problem.

The algorithm is a greedy heuristic that works with chains (ordered lists)
of basic blocks. Initially all chains are isolated basic blocks. On every
iteration, we pick a pair of chains whose merging yields the biggest increase
in the ExtTSP value, which models how i-cache "friendly" a specific chain is.
A pair of chains giving the maximum gain is merged into a new chain. The
procedure stops when there is only one chain left, or when merging does not
increase ExtTSP. In the latter case, the remaining chains are sorted by
density in decreasing order.

An important aspect is the way two chains are merged. Unlike earlier
algorithms (e.g., based on the approach of Pettis-Hansen), two
chains, X and Y, are first split into three, X1, X2, and Y. Then we
consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y,
X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score.
This improves the quality of the final result (the search space is larger)
while keeping the implementation sufficiently fast.

Perf impact:
The perf improvement over the existing MachineBlockPlacement is in the range
of 0.5%-1.5% for medium-sized binaries (e.g, clang/gcc), depending on the PGO
mode (tested with various options of LLVM_BUILD_INSTRUMENTED and AutoFDO,
CSSPGO).
For large-scale fully-optimized binaries running in production, we measure 0.2%-0.8%
speedup (with AutoFDO/CSSPGO but excluding post-link optimizers).
For smaller (not front-end bound) binaries, we do not expect regressions; for SPEC17:

  • 508.namd_r: 8.96+/-3.89 (win)
  • 602.gcc: -0.93+/-0.47 (regression)
  • 623.xalancbmk: 2.12+/-0.84 (win)
  • 625.x264: 1.41+/-0.7 (win)
  • other binaries are flat.

Performance of the alg:
We haven't seen a change of the build time for large binaries.

Diff Detail

Event Timeline

spupyrev created this revision.Nov 8 2021, 10:48 AM
spupyrev requested review of this revision.Nov 8 2021, 10:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2021, 10:48 AM
spupyrev edited the summary of this revision. (Show Details)Nov 8 2021, 11:03 AM
wenlei added a subscriber: davidxl.Nov 8 2021, 11:19 AM

cc @davidxl We're bring the ext-tsp layout algorithm to llvm. the algorithm has been used for years in BOLT and proven to be effective, and it now shows good results with compiler PGO too.

thanks for working on this. I added myself as a reviewer too.

rahmanl added a subscriber: rahmanl.

Thanks for sending this to LLVM. I implemented your algorithm in Propeller and would be interested in reviewing as well.

Thanks, i am looking forward for your reviews and suggestions for improvements!
One goal of the diff is to unify and get rid of multiple implementations of the technique. We internally have two implementations (one in BOLT), which isn't ideal.

Can you try the new layout with CSPGO? I think the improvements should be higher there since it has more precise profiles.

Please also report memory overhead. My concern is that the Chain objects take up non-linear space.

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3466

This will give you the number of instructions which is not exactly the binary size of the block.

llvm/lib/Transforms/Utils/CodeLayout.cpp
52–55

Please consider adding parameters for these. I think it will be useful for users to tune based on their architecture.

94

Please define a struct for the Jump (Src, Sink, Weight).

153

The abstraction for block-edges is very loosely defined. I think defining a struct will help.

229

Do we really need to store a boolean here? It adds extra burden to make sure we keep it consistent after merging.

235

How do we ensure the obsolete chains are removed from this structure? Would it make life easier if we define it as a map from the Chain pointers (or Chain ids) ?

241

Maybe call this ChainEdge so it's not confused with a basic block edge/jump.

475

I think fallthrough may be a misnomer here. Normally, a block's fallthrough successor is its fallthrough block in the original layout regardless of whether these blocks have other outgoing edges.
I suggest renaming the concept. My suggestion is "mutually forced" or "mutually dominated" edges.

493

Please explain why you choose such block.

592

This computes the total score for the jumps in the jump list. Please change the comment accordingly.

635

I see the following two loops try to form a new fallthrough between ChainPred and ChainSucc and this is independent from the size of the ChainPred. This could be an independent optimization. Do you think you can do this as a later patch? That way, we can evaluate its benefit /overhead in a better way.

676–679

This seems to be an assertion that is unrelated to the purpose of this code block. Could you please remove or move it to a relevant position?

684

Why not doing X2_Y_X1 here?

696

Does this actually merge the chains together?

710

Is this the new score or the change in the score? ChainSucc's intra-chain jumps are not included in the Jumps vector.

712

This is inconsistent with the function comment. CurGain is not updated here, but rather in the caller of this function.

My suggestion is to add the following function to MergeGainTy and use it in the callsite. This way you can remove the first parameter from computeMergeGain and also computeMergeGain can return what it computes.

void updateIfLessThan(const MergeGainTy &OtherGain) {
      if (*this < OtherGain)
            * this = OtherGain;
}
721

Rename this to XSplitOffset.

767

It seems that the obsolete chains are not deallocated from the AllChains vector. Does this raise any memory concerns?

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
2

The interesting code path (splitting the chain) is not exercised by these tests. Please add a test for that.

davidxl added inline comments.Nov 9 2021, 11:01 AM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
575

Perhaps make the name clearer: createCFGChainExtTSP()?

3410

Why do the extTSP layout after the existing layout? It seems like a waste of compile time.

Or the intention is to keep the taildup functionality there? However the tail dup decisions depend on layout decisions, so it is probably better integrated with extTSP. Also taildup's profile update may not be ideal affecting later layout score computation.

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
2

The extTSP algorithm should be able to handle common loop rotation case. Can you add a test case about it?

14

We can reason that it requires double the weights to the merged block compared with the side branch of the triangular shaped CFG to make non-topological order to be profitable. In this case it is 100> 2* 40. Can you add a negative case that keeps the top order?

29

Fix typo

spupyrev updated this revision to Diff 387646.Nov 16 2021, 8:09 AM

Addressing Rahman's and David's comments:

  • appled (most of) suggested renamings;
  • added comments and clarifications;
  • introduced Jump class instead of pair<pair<Block, Block>, ExecutionCount>;
  • added new tests;
  • introduced params for alg options.
spupyrev marked 18 inline comments as done.Nov 16 2021, 8:21 AM

Re memory utilization:

I am not sure what exactly the concern is. We keep all the allocated objects as the fields of ExtTSPImpl; no new objects should be created. When we merge two chains, all the jumps from one chain are moved to another one; and one of the chains is cleared. So the overall space is proportional to (|blocks| + |jumps|), unless there is a bug.

To validate correctness, I ran the algorithm on large synthetic instances (with up to 100K blocks); the memory didn't go beyond a few tens of MBs. In fact, it is hard to exactly measure the memory consumption, as the runtime of the implementation is super-linear and doesn't scale to huge instances.

Re perf tests on IR/CSIR:

Here are my measurements for clang-10 binary with different PGO modes. Of course we should keep in mind that the numbers won't fully generalize to other binaries/benchmarks and only provide one data point.

Comparing clang-10 using 200 iterations on input1.cpp

[test] -mllvm -enable-ext-tsp-block-placement=1
[control] -mllvm -enable-ext-tsp-block-placement=0
-DLLVM_BUILD_INSTRUMENTED=FE
  task-clock      :         5942 (+- 0.19%) vs         6027 (+- 0.19%) [diff = -1.4170%]
  binary_size     :         3316628472      vs         3325095000      [diff = -0.2546%]
-DLLVM_BUILD_INSTRUMENTED=IR
  task-clock      :         5793 (+- 0.20%) vs         5876 (+- 0.20%) [diff = -1.4050%]
  binary_size     :         3305880848      vs         3325234728      [diff = -0.5820%]
-DLLVM_BUILD_INSTRUMENTED=CSIR
  task-clock      :         5705 (+- 0.26%) vs         5757 (+- 0.26%) [diff = -0.8985%]
  binary_size     :         3312445120      vs         3329071760      [diff = -0.4994%]
CSSPGO
  task-clock      :         6110 (+- 0.37%) vs         6226 (+- 0.37%) [diff = -1.8617%]
  binary_size     :         3995536616      vs         3980411480      [diff = +0.3799%]
AutoFDO
  task-clock      :         5711 (+- 0.19%) vs         5797 (+- 0.19%) [diff = -1.4882%]
  binary_size     :         3707959712      vs         3694537688      [diff = +0.3632%]
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3410

I don't have a strong opinion here but here is my intuition. Ext-tsp is designed to "improve" the existing layout with the help of profile data. When there are ties or when the profile data is absent, the algorithm fallbacks to the original ordering of basic blocks. Thus, we want the original order to be reasonable. So we first apply the existing machine block placement, which applies several sane tricks like loop rotation, and then improve it with ext-tsp.

Keeping the existing tail duplication also sounds reasonable to me; though this optimization could be applied separately. Btw, we are working on improving and replacing the existing taildup heuristic; i'll share diffs/prototypes if/when it produces good results.

llvm/lib/Transforms/Utils/CodeLayout.cpp
235

I think this was done for optimizing the performance. The tradeoff here is finding an adjacent edge (via getEdge(Chain *Other)) vs iterating over and appending the edges. (The former is done only once, when the two chains are merged). Empirically, std::vector+linear search was faster than std::unordered_map on my benchmarks by 10%-15%. Probably an alternative map implementation may work better; not sure if that's worth it though.

635

I don't see a clear advantage of moving this optimization to a separate diff. It is a relatively minor aspect of the algorithm and responsible for just a few lines of code. Not sure if the separation would noticeable simplify reviewing.

I do agree with another point here: The algorithm may benefit from some fine-tuning. I don't think the currently utilized parameters and options (e.g., how exactly we split/merge the chains) are optimal; they have been tuned a few years ago on a couple of workloads running on a particular hardware. I would really appreciate the community help.

For now, I'd however keep the params and optimizations as is, so that the existing customers do not see sudden regressions.

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
2

Added some (possibly naive) test. Let me know if you have something more complex in mind.

2

I added some toy test, let me know if you have something more complex in mind.

davidxl added inline comments.Nov 18 2021, 12:35 PM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3410

When you say extTSP relies on existing layout to be reasonable, do you have performance data to back it up? If the profile is not available, it is likely they are cold.

Also extTSP should be able to do 'loop rotation' by splitting during chain merge, are there cases it is not handled?

spupyrev updated this revision to Diff 388301.Nov 18 2021, 1:09 PM

changes:

  • a better way of estimating block sizes in the binary;
  • a small correction to how branches are reconstructed after reordering.
spupyrev added inline comments.Nov 18 2021, 1:28 PM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3410

Running ext-tsp as a postprocessing is a relatively minor "optimization" (that's why I said "I don't have strong opinion here..."). If you think we should run it instead of the existing algorithm, i am happy to do so.

It is mostly to avoid some corner cases, e.g.: A user supplies an incomplete (or even empty) sampling-based profile, which only covers a part of hot functions. In that case ext-tsp doesn't modify the layout but the existing MachineBlockPlacement can apply some tricks. For instr-PGO, that's not an issue.
Another scenario is when there are "ties", e.g., block orders with equal objective. For example, a function with 3 blocks and 2 equal-weight jumps, A->B, A->C. In that case, ext-tsp cannot decide between [A, B, C] and [A, C, B], and we probably want to use some other heuristics.

I am not aware of "simple" cases when ext-tsp produces sub-optimal result; it will rotate loops, if that's dictated by profile counts. However, it is a heuristic for an np-hard problem so we should not expect it to work optimally in 100% of cases.

Re memory utilization:

I am not sure what exactly the concern is. We keep all the allocated objects as the fields of ExtTSPImpl; no new objects should be created. When we merge two chains, all the jumps from one chain are moved to another one; and one of the chains is cleared. So the overall space is proportional to (|blocks| + |jumps|), unless there is a bug.

To validate correctness, I ran the algorithm on large synthetic instances (with up to 100K blocks); the memory didn't go beyond a few tens of MBs. In fact, it is hard to exactly measure the memory consumption, as the runtime of the implementation is super-linear and doesn't scale to huge instances.

Please see my inline comments. Since we keep all the Chains in the vector (they're not destructed), space usage is currently super-linear. We have a choice of deallocating the space (using shrink_to_fit, resize, or swap) and that's a time-space tradeoff. I am not sure which one is better, since this is already super-linear time, it doesn't hurt to make those calls and keep the space linear. Also, you can use llc on a large function to compare memory usage with and without -enable-ext-tsp-block-placement. It might very well be the case that the memory consumption remains the same. In that case, maybe you don't want to free up the storage either.

Re perf tests on IR/CSIR:

Here are my measurements for clang-10 binary with different PGO modes. Of course we should keep in mind that the numbers won't fully generalize to other binaries/benchmarks and only provide one data point.

Comparing clang-10 using 200 iterations on input1.cpp

[test] -mllvm -enable-ext-tsp-block-placement=1
[control] -mllvm -enable-ext-tsp-block-placement=0
-DLLVM_BUILD_INSTRUMENTED=FE
  task-clock      :         5942 (+- 0.19%) vs         6027 (+- 0.19%) [diff = -1.4170%]
  binary_size     :         3316628472      vs         3325095000      [diff = -0.2546%]
-DLLVM_BUILD_INSTRUMENTED=IR
  task-clock      :         5793 (+- 0.20%) vs         5876 (+- 0.20%) [diff = -1.4050%]
  binary_size     :         3305880848      vs         3325234728      [diff = -0.5820%]
-DLLVM_BUILD_INSTRUMENTED=CSIR
  task-clock      :         5705 (+- 0.26%) vs         5757 (+- 0.26%) [diff = -0.8985%]
  binary_size     :         3312445120      vs         3329071760      [diff = -0.4994%]
CSSPGO
  task-clock      :         6110 (+- 0.37%) vs         6226 (+- 0.37%) [diff = -1.8617%]
  binary_size     :         3995536616      vs         3980411480      [diff = +0.3799%]
AutoFDO
  task-clock      :         5711 (+- 0.19%) vs         5797 (+- 0.19%) [diff = -1.4882%]
  binary_size     :         3707959712      vs         3694537688      [diff = +0.3632%]

Thanks for sharing the detailed results.

llvm/lib/Transforms/Utils/CodeLayout.cpp
128

This can be removed if we initialize with zero.

138

Does this initialization serve a special purpose? We can make it zero if we always reject zero score for merging.

263–264

Although this clears the vector, the underlying storage is not released. To free up the storage, we would need additional effort. (shrink_to_fit, resize, or swap).

305

Same comment. Storage won't be released.

446
635

Agreed. I am mostly interested in separately analyzing the impact of this. I think the user can do so by setting the split parameter to be zero.

One suggestion to make this function more compact: Please consider adding a helper lambda function that given an offset and a list of merge types, does the necessary checks for the offset (like PredChain->blocks[Offset]->ForcedSucc != nullptr) and calls computeMergeGain for each merge type. I think we can make this function more readable and concise as well.

660

Please rename this to GetBestMergeGain for consistency with the function comment and to further separate it from computeMergeGain.

684

I am still puzzled by why we don't do X2_Y_X1. Is it not beneficial?

839

Change comment to clearly mention this is for putting the entry block at the beginning.

841–844

We can simply use return C1->isEntry()?

848–853

How about shortening this to return (D1 != D2) ? (D1 > D2) : (C1->id() < C2->id())?

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
2

Thanks for adding the test case. I was thinking of something less complex which exercises the splitting to improve fallthroughs. (Might be worthwhile to check if the old algorithm gets the optimal layout in this case).

foo
  |   \
  |     \10
  |       \
  |         v
  |17       foo.bb1
  |          / 
  |        /  10
  v      v 
foo.bb2
spupyrev updated this revision to Diff 388575.Nov 19 2021, 11:22 AM
spupyrev marked 12 inline comments as done.

addressing comments

llvm/lib/Transforms/Utils/CodeLayout.cpp
138

I feel like negative score is a bit better reflects the intention here: it indicates that the merge is disadvantageous for the quality or simply "invalid". A score of "0" means that the merge is neutral for perf, so it is up to the algorithm to decide whether such merge needs to be done.
At some point we're experimenting with merging of 0-score chains (e.g., two chains connected by a cold jump). It might be beneficial for certain cases, e.g., for code size. I guess we can play with this more in the future

263–264

I don't think that makes any measurable difference, as we're working with relatively small instances. But explicitly freeing up memory won't hurt, i guess.

635

Added an option to disable this extra optimization.

Not sure I understand the proposal here.

684

Yes, it is not giving much benefits. I checked a binary with ~100K instances: Adding this type of splits makes a difference for 4 CFGs but it is responsible for ~25% of splits. Probably keeping the split doesn't worth it.

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
2

Added this test (func4)

spupyrev added inline comments.Nov 19 2021, 3:36 PM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3410

I just tried to to measure the impact of changing the algorithm from being a post-processing step to replacing the existing approach (buildCFGChains). Here are the results:

benchmark1:   task-clock [delta: 24.27 ± 18.78, delta(%): 0.4010 ± 0.3103, p-value: 0.103827]
benchmark2:   task-clock [delta: 54.77 ± 20.91, delta(%): 0.9070 ± 0.3463, p-value: 0.000044]
benchmark3:   task-clock [delta: 2.43 ± 20.20, delta(%): 0.0279 ± 0.2314, p-value: 0.680156]

So for two benchmarks, the difference is non-statsig, for one, there is a ~0.9% regression. I feel like we should keep the implementation as is.

Would it help to add remarks in a later diff to get some insights into the algorithm?
https://llvm.org/docs/Remarks.html

davidxl added inline comments.Nov 20 2021, 2:32 PM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3410

Ok, we can keep ext-tsp as an cleanup pass.

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
304

This might be a bug in existing layout. By design, with profile data, in order to take b2 as the layout successor, the incoming edge weight needs > 2*10 = 20.

llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll
9 ↗(On Diff #388575)

Add a cfg with ascii art.

11 ↗(On Diff #388575)

Explain a little the expected output.

29 ↗(On Diff #388575)

Add brief explanation of expected output.

hoy added inline comments.Nov 22 2021, 9:14 AM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3573

The object is created but not used anywhere. Is that expected?

llvm/lib/Transforms/Utils/CodeLayout.cpp
111

nit: please use enum class and add comment for each enum value.

906

This is an assert-only loop. Put it under #ifndef NDEBUG ?

rahmanl added inline comments.Nov 22 2021, 2:36 PM
llvm/lib/Transforms/Utils/CodeLayout.cpp
138

I think we would never merge 0-score chains. If there is any additional benefit (code size) it can be incorporated into the score. But I am OK with this.

635

You can use something like the following function:

auto ComputeMergeGainWithOffsetAndTypes([&](int Offset, std::vector<MergeTypeTy> &merge_types) {
  if (Offset == 0 || Offset == PredChain.size())
    return;
  auto BB = ChainPred->blocks()[Offset - 1];
   // Skip the splitting if it breaks FT successors
   if (BB->ForcedSucc != nullptr)
     continue;
   for (auto &MergeType: merge_types) {
      Gain.updateIfLessThan(computeMergeGain(ChainPred, ChainSucc, Jumps, Offset, MergeType));
   }
}
684

Great. Please add a comment to explain why this splitting is not exercised.

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
5

This is missing the ascii cfg.

306

Please use the same names as in the CFG so this can be easily mapped. And please do this across all tests.

llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll
1 ↗(On Diff #388575)

I am not an expert on this, but I feel like this is not needed. Does it make a difference?

spupyrev updated this revision to Diff 389226.Nov 23 2021, 9:02 AM
spupyrev marked 13 inline comments as done.

addressing comments:

  • adjusted tests;
  • enum class;
  • minor refactorings as suggested.
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3573

What object do you refer to?

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
304

This tests uses two different modes of ext-tsp, with and without "chain splitting" -- something that Rahman suggested. The "old" layout in MachineBlockPlacement isn't applied here.

llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll
9 ↗(On Diff #388575)

The purpose of the test is to verify that chain splitting code is executed and that it is beneficial for the layout, that is, the optimized score is increased. I added the explanations.

Since the graph is large and randomly generated, the ascii representation is not very readable. In fact, it is not providing any insights, so let's skip it?

davidxl added inline comments.Nov 23 2021, 9:53 AM
llvm/lib/Transforms/Utils/CodeLayout.cpp
86

The paper documents 6 types of branches and their weight K_{s,t}. Should these be parametized here?

98

The tsp score does not seem to be contiguous here. When Dist == 0, it should have the same score as the fall through case, but this produces 0.1* Count.

640

It is confusing to have both this method and calcExtTSPScore methods.

927

From the initialization, Order[Idx] == Idx, so is this map needed?

936

typo 'Increasing'

938

why is the count not used?

940

This function computes the 'fallthrough' maximizing TSP score, not the extTSP in the original paper. Where is the h function and the K coeffcients?

948

Can the reserve be combined with the vector declaration, or the intention is to avoid initialization?

llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll
9 ↗(On Diff #388575)

ok. I was expecting to look into the test case and reason that improved score corresponds to improved layout. Looks like this is just a test to show increased score. Can this be done instead with any other tests (the score should be increased monotonically) ?

spupyrev updated this revision to Diff 389262.Nov 23 2021, 10:55 AM
spupyrev marked 5 inline comments as done.

David's comments (except for the large test)

llvm/lib/Transforms/Utils/CodeLayout.cpp
98

Agree, that's something me and my colleagues have been thinking about lately.

The current function is tuned for the maximum performance of the resulting binary. It is not guaranteed/enforced that the function is contiguous. That said, i'd be super happy to see and review improvements of the objective in the future. There is likely a room for optimizations here

927

This function is also called when Order != Identity

940

oops. Refactored function to compute correct result

spupyrev updated this revision to Diff 389299.Nov 23 2021, 1:14 PM

changed "large" test into a human-readable one, added ascii for the CFG

davidxl added inline comments.Nov 23 2021, 2:10 PM
llvm/lib/Transforms/Utils/CodeLayout.cpp
98

I guessed that with ForwardWeight == 0.1, a bias can be introduced against laying out larger successor as the fall through block, but it is not the case. Using different forward weight seems to produce the same decision for a simple diamond cfg case -- in order to for one successor to be a fallthrough, the branch probability needs to > 50%.

I wonder what role this weight plays in practice (performance)?

927

Probably provide another wrapper to compute the mapping and pass it in. The map can use the vector type, so for the case when Order == Identity, Order can be used for both.

spupyrev added inline comments.Nov 23 2021, 2:27 PM
llvm/lib/Transforms/Utils/CodeLayout.cpp
98

I am not sure I understand the question, could you clarify?

The weights affect the resulting layout, which in turn may impact perf. However, changing the weights from 0.1 to say 0.2 likely won't make a difference, except for some corner cases.
IIRC, in my tests 3 years ago it was important to keep ForwardWeight<FallthroughWeight=1.0. Intuitively this makes sense, though the exact values may need to be re-tuned. (This is a separate and time-consuming project)

davidxl added inline comments.Nov 23 2021, 3:23 PM
llvm/lib/Transforms/Utils/CodeLayout.cpp
98

You answer touched upon what I was asking. To clarify the question: are there performance data to show why 0.1 (instead of other values) is used. ExtTSP score can be considered as a locality score, and the forwardWeight is an attempt to model the cost of taken branch instruction. I wonder if there is a better way to model this in the objective function. No need to address it in this patch though.

davidxl added inline comments.Nov 24 2021, 12:22 PM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3459

It is better to use vector type as the index space is dense?

3461

vector type?

3474

Why not computing the actual byte size of the block?

3480

Is multi-graph assumed, why?

3494

Can the following (computing NewBlockOrder) be folded into applyExtTspLayout? The NewOrder is not used anywhere else, so no need to be exposed.

3531

Is NewIndex the same as BlockIndex computed in applyExtTsp? Why recomputing?

spupyrev updated this revision to Diff 391069.Dec 1 2021, 9:56 AM
spupyrev marked 4 inline comments as done.
  • changed DenseMap => std::vector
  • git rid of multi-edges
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3474

That would be great! Do you mind pointing to the right way of implementing this? I was not able to find anything meaningless in MachineBasicBlock

3494

I am not fully understanding the suggestion.

The goal is here is to make applyExtTspLayout work with any type of nodes, not just MachineBasicBlock (potentially we want to apply it for function sorting). I tried creating a generic implementation with templates (and thus "hiding" NewOrder) but it turned out to be much messier than what it is now.

3531

No, this is a new index after reordering.

davidxl added inline comments.Dec 1 2021, 3:30 PM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3474

This is target dependent unfortunately -- an abstract interface can be added in MCCodeEmitter. The implementation should be similar to encodeInstruction.

This is beyond the scope of the patch, so perhaps add a TODO in the comment.

spupyrev updated this revision to Diff 391380.Dec 2 2021, 10:23 AM

adding a note on computing block sizes exactly

davidxl accepted this revision.Dec 2 2021, 10:31 AM

LGTM, please also wait for other reviewers lgtm.

This revision is now accepted and ready to land.Dec 2 2021, 10:31 AM
spupyrev updated this revision to Diff 391395.Dec 2 2021, 10:56 AM

adding 'mtriple' back to test options, as MachineBlockPlacement isn't deterministic between different platforms (so tests fail on Windows)

LGTM as well.

hoy accepted this revision.Dec 3 2021, 1:52 PM

lgtm except for some nits. Thanks.

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3573

nvm, the object pointed by FunctionChain is used in FunctionChain->merge.

llvm/lib/Transforms/Utils/CodeLayout.cpp
61

nit: the distance in use isn't really in bytes? It is number of MR instructions?

274

nit: we usually use uint64_t for id type explicitly.

898

nit: no need to use #ifndef NDEBUG

spupyrev updated this revision to Diff 391747.Dec 3 2021, 2:56 PM

changed type of Chain.Id to uint64

llvm/lib/Transforms/Utils/CodeLayout.cpp
61

As discussed in https://reviews.llvm.org/D113424/new/#inline-1097352, we use block sizes in bytes, approximating it by num_mir_instr * 4.

Equivalently, we could similarly use the number of instructions in both places (by scaling the constants down by 4) but I don't really see a difference

This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Dec 6 2021, 4:02 PM

Looks like this broke CodeGen/X86/code_placement_ext_tsp.ll on arm macs:

http://45.33.8.238/macm1/23133/step_11.txt

Please take a look, and revert for now if it takes a while to fix.

thakis added a comment.Dec 6 2021, 4:06 PM

Also here https://lab.llvm.org/buildbot/#/builders/186/builds/2803 which should've sent email long ago. Reverting...

spupyrev reopened this revision.Dec 6 2021, 4:27 PM
This revision is now accepted and ready to land.Dec 6 2021, 4:27 PM
spupyrev updated this revision to Diff 392227.Dec 6 2021, 4:37 PM

making the tests deterministic

This revision was automatically updated to reflect the committed changes.
MaskRay added inline comments.
llvm/lib/Transforms/Utils/CodeLayout.cpp
717

This should be Jump->Target. See https://github.com/llvm/llvm-project/pull/66592

Herald added a project: Restricted Project. · View Herald TranscriptSep 17 2023, 1:10 AM
Herald added a subscriber: wlei. · View Herald Transcript
kosarev added a subscriber: kosarev.Oct 9 2023, 8:10 AM
kosarev added inline comments.
llvm/lib/Transforms/Utils/CodeLayout.cpp
405–406

This seems to fail on expensive checks, see https://github.com/llvm/llvm-project/issues/68594.

Do {Begin,End}{2,3} really form valid ranges when default to BlockIter() in the constructor?

spupyrev added inline comments.Oct 10 2023, 3:04 PM
llvm/lib/Transforms/Utils/CodeLayout.cpp
405–406

Can you teach me to reproduce the failure? I'm building with "-DCMAKE_BUILD_TYPE=Debug -DLLVM_ENABLE_ASSERTIONS=ON" and run "ninja check-all" but still don't see it

kosarev added inline comments.Oct 11 2023, 12:39 PM
llvm/lib/Transforms/Utils/CodeLayout.cpp
405–406

It fails on libc++'s consistency checks, which get enabled automatically on builds with LLVM's expensive checks turned on, -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON.

spupyrev added inline comments.Oct 11 2023, 3:56 PM
llvm/lib/Transforms/Utils/CodeLayout.cpp
405–406

hmm. I still cannot reproduce this. Here is my command:

cmake \
    -G Ninja \
    -DCMAKE_ASM_COMPILER=$MYCLANG \
    -DCMAKE_ASM_COMPILER_ID=Clang \
    -DCMAKE_BUILD_TYPE=Debug \
    -DCMAKE_CXX_COMPILER=$MYCLANG++ \
    -DCMAKE_C_COMPILER=$MYCLANG \
    -DLLVM_TARGETS_TO_BUILD="X86;AArch64" \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_LLD=ON \
    -DLLVM_ENABLE_PROJECTS="clang;lld;bolt" \
    -DCMAKE_EXE_LINKER_FLAGS="-L /usr/lib64" \
    -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON \
    ../llvm

     ninja check-all

Is something else missing?

kosarev added inline comments.Oct 12 2023, 7:30 AM
llvm/lib/Transforms/Utils/CodeLayout.cpp
405–406

Looks OK to me. I'd make sure the code is rebuilt from scratch and _GLIBCXX_DEBUG is defined, then look into why libc++ doesn't do the checks. I'm attaching a backtrace of one of the failures, in case it may be of any help.

That regardless, I think the nature of the problem is very clear, which is that when the Begin1/2 and End1/2 iterators are singular, they should not be used to form any ranges. Once we have a fix, I'd be happy to test it for you.

[ RUN      ] CodeLayout.ThreeFunctions
/usr/include/c++/11/debug/vector:617:
In function:
    std::debug::vector<_Tp, _Allocator>::iterator std::debug::vector<_Tp, 
    _Allocator>::insert(std::debug::vector<_Tp, _Allocator>::const_iterator, 
    _InputIterator, _InputIterator) [with _InputIterator = 
    gnu_debug::_Safe_iterator<gnu_cxx::normal_iterator<{anonymous}::NodeT* 
    const*, std::vector<{anonymous}::NodeT*, 
    std::allocator<{anonymous}::NodeT*> > >, std::
    debug::vector<{anonymous}::NodeT*>, std::random_access_iterator_tag>; 
    <template-parameter-2-2> = void; _Tp = {anonymous}::NodeT*; _Allocator = 
    std::allocator<{anonymous}::NodeT*>; std::debug::vector<_Tp, 
    _Allocator>::iterator = std::
    debug::vector<{anonymous}::NodeT*>::iterator; std::debug::vector<_Tp, 
    _Allocator>::const_iterator = std::
    debug::vector<{anonymous}::NodeT*>::const_iterator]

Error: function requires a valid iterator range [first, last).

Objects involved in the operation:
    iterator "first" @ 0x7ffe7dbd1c40 {
      state = singular;
    }
    iterator "last" @ 0x7ffe7dbd1c70 {
      state = singular;
    }
 #0 0x00007f5c10bc3b02 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/kosarev/labs/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:22
 #1 0x00007f5c10bc3f1e PrintStackTraceSignalHandler(void*) /home/kosarev/labs/llvm-project/llvm/lib/Support/Unix/Signals.inc:798:1
 #2 0x00007f5c10bc1363 llvm::sys::RunSignalHandlers() /home/kosarev/labs/llvm-project/llvm/lib/Support/Signals.cpp:105:20
 #3 0x00007f5c10bc339a SignalHandler(int) /home/kosarev/labs/llvm-project/llvm/lib/Support/Unix/Signals.inc:413:1
 #4 0x00007f5c10447520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #5 0x00007f5c1049b9fc __pthread_kill_implementation ./nptl/./nptl/pthread_kill.c:44:76
 #6 0x00007f5c1049b9fc __pthread_kill_internal ./nptl/./nptl/pthread_kill.c:78:10
 #7 0x00007f5c1049b9fc pthread_kill ./nptl/./nptl/pthread_kill.c:89:10
 #8 0x00007f5c10447476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #9 0x00007f5c1042d7f3 abort ./stdlib/./stdlib/abort.c:81:7
#10 0x00007f5c106d21fb std::__throw_bad_exception() (/lib/x86_64-linux-gnu/libstdc++.so.6+0xa51fb)
#11 0x00007f5c12dd8740 __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT**, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag> std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >::insert<__gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>, void>(__gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>, __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>, __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>) /usr/include/c++/11/debug/vector:617:4
#12 0x00007f5c12dcee70 (anonymous namespace)::MergedChain::getNodes() const /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:504:18
#13 0x00007f5c12dd5141 (anonymous namespace)::CDSortImpl::mergeChains((anonymous namespace)::ChainT*, (anonymous namespace)::ChainT*, unsigned long, (anonymous namespace)::MergeTypeT) /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1288:16
#14 0x00007f5c12dd4211 (anonymous namespace)::CDSortImpl::mergeChainPairs() /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1144:50
#15 0x00007f5c12dd302c (anonymous namespace)::CDSortImpl::run() /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1008:5
#16 0x00007f5c12dd615c llvm::codelayout::computeCacheDirectedLayout(llvm::codelayout::CDSortConfig const&, llvm::ArrayRef<unsigned long>, llvm::ArrayRef<unsigned long>, llvm::ArrayRef<llvm::codelayout::EdgeCount>, llvm::ArrayRef<unsigned long>) /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1435:3
#17 0x00007f5c12dd6333 llvm::codelayout::computeCacheDirectedLayout(llvm::ArrayRef<unsigned long>, llvm::ArrayRef<unsigned long>, llvm::ArrayRef<llvm::codelayout::EdgeCount>, llvm::ArrayRef<unsigned long>) /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1453:48
#18 0x00005638fc0abb97 (anonymous namespace)::CodeLayout_ThreeFunctions_Test::TestBody() /home/kosarev/labs/llvm-project/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp:18:78
#19 0x00007f5c13e2ef1a void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2612:28
#20 0x00007f5c13e1f20f void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2667:75
#21 0x00007f5c13df2bca testing::Test::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2694:30
#22 0x00007f5c13df350b testing::TestInfo::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2839:3
#23 0x00007f5c13df3eea testing::TestSuite::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:3017:35
#24 0x00007f5c13e01306 testing::internal::UnitTestImpl::RunAllTests() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:5921:41
#25 0x00007f5c13e315fd bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2614:1
#26 0x00007f5c13e20b7a bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2667:75
#27 0x00007f5c13dffc25 testing::UnitTest::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:5487:14
#28 0x00007f5c13efde7f RUN_ALL_TESTS() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/include/gtest/gtest.h:2317:80
#29 0x00007f5c13efdd61 main /home/kosarev/labs/llvm-project/third-party/unittest/UnitTestMain/TestMain.cpp:55:24
#30 0x00007f5c1042ed90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#31 0x00007f5c1042ee40 call_init ./csu/../csu/libc-start.c:128:20
#32 0x00007f5c1042ee40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#33 0x00005638fc0423e5 _start (/home/kosarev/labs/llvm-project/build/debug+expensive_checks/unittests/Transforms/Utils/./UtilsTests+0x1f3e5)
spupyrev added inline comments.Oct 12 2023, 11:18 AM
llvm/lib/Transforms/Utils/CodeLayout.cpp
405–406

Thanks for such a detailed analysis and for teaching me about singular iterators.
https://github.com/llvm/llvm-project/pull/68917