This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Utils/
-
llvm/
-
Transforms/
-
Utils/
-
CodeLayout.h
-
lib/
-
CodeGen/
13/21
MachineBlockPlacement.cpp
-
Transforms/Utils/
-
Utils/
-
CMakeLists.txt
49/65
CodeLayout.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
12/12
code_placement_ext_tsp.ll

Differential D113424

ext-tsp basic block layout
ClosedPublic

Authored by spupyrev on Nov 8 2021, 10:48 AM.

Download Raw Diff

Details

Reviewers

hoy
wenlei
davidxl
rahmanl

Commits

rGf573f6866e18: ext-tsp basic block layout
rGc68f71eb37c2: ext-tsp basic block layout

Summary

A new basic block ordering improving existing MachineBlockPlacement.

The algorithm tries to find a layout of nodes (basic blocks) of a given CFG
optimizing jump locality and thus processor I-cache utilization. This is
achieved via increasing the number of fall-through jumps and co-locating
frequently executed nodes together. The name follows the underlying
optimization problem, Extended-TSP, which is a generalization of classical
(maximum) Traveling Salesmen Problem.

The algorithm is a greedy heuristic that works with chains (ordered lists)
of basic blocks. Initially all chains are isolated basic blocks. On every
iteration, we pick a pair of chains whose merging yields the biggest increase
in the ExtTSP value, which models how i-cache "friendly" a specific chain is.
A pair of chains giving the maximum gain is merged into a new chain. The
procedure stops when there is only one chain left, or when merging does not
increase ExtTSP. In the latter case, the remaining chains are sorted by
density in decreasing order.

An important aspect is the way two chains are merged. Unlike earlier
algorithms (e.g., based on the approach of Pettis-Hansen), two
chains, X and Y, are first split into three, X1, X2, and Y. Then we
consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y,
X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score.
This improves the quality of the final result (the search space is larger)
while keeping the implementation sufficiently fast.

Perf impact:
The perf improvement over the existing MachineBlockPlacement is in the range
of 0.5%-1.5% for medium-sized binaries (e.g, clang/gcc), depending on the PGO
mode (tested with various options of LLVM_BUILD_INSTRUMENTED and AutoFDO,
CSSPGO).
For large-scale fully-optimized binaries running in production, we measure 0.2%-0.8%
speedup (with AutoFDO/CSSPGO but excluding post-link optimizers).
For smaller (not front-end bound) binaries, we do not expect regressions; for SPEC17:

508.namd_r: 8.96+/-3.89 (win)
602.gcc: -0.93+/-0.47 (regression)
623.xalancbmk: 2.12+/-0.84 (win)
625.x264: 1.41+/-0.7 (win)
other binaries are flat.

Performance of the alg:
We haven't seen a change of the build time for large binaries.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spupyrev created this revision.Nov 8 2021, 10:48 AM

Herald added subscribers: pengfei, hiraditya, mgorny. · View Herald TranscriptNov 8 2021, 10:48 AM

spupyrev requested review of this revision.Nov 8 2021, 10:48 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2021, 10:48 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

minor

spupyrev edited the summary of this revision. (Show Details)Nov 8 2021, 11:03 AM

cc @davidxl We're bring the ext-tsp layout algorithm to llvm. the algorithm has been used for years in BOLT and proven to be effective, and it now shows good results with compiler PGO too.

thanks for working on this. I added myself as a reviewer too.

Harbormaster completed remote builds in B133071: Diff 385567.Nov 8 2021, 12:08 PM

Thanks for sending this to LLVM. I implemented your algorithm in Propeller and would be interested in reviewing as well.

Thanks, i am looking forward for your reviews and suggestions for improvements!
One goal of the diff is to unify and get rid of multiple implementations of the technique. We internally have two implementations (one in BOLT), which isn't ideal.

Can you try the new layout with CSPGO? I think the improvements should be higher there since it has more precise profiles.

Please also report memory overhead. My concern is that the Chain objects take up non-linear space.

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3466	This will give you the number of instructions which is not exactly the binary size of the block.
llvm/lib/Transforms/Utils/CodeLayout.cpp
52–55	Please consider adding parameters for these. I think it will be useful for users to tune based on their architecture.
94	Please define a struct for the Jump (Src, Sink, Weight).
153	The abstraction for block-edges is very loosely defined. I think defining a struct will help.
229	Do we really need to store a boolean here? It adds extra burden to make sure we keep it consistent after merging.
235	How do we ensure the obsolete chains are removed from this structure? Would it make life easier if we define it as a map from the Chain pointers (or Chain ids) ?
241	Maybe call this ChainEdge so it's not confused with a basic block edge/jump.
475	I think fallthrough may be a misnomer here. Normally, a block's fallthrough successor is its fallthrough block in the original layout regardless of whether these blocks have other outgoing edges. I suggest renaming the concept. My suggestion is "mutually forced" or "mutually dominated" edges.
493	Please explain why you choose such block.
592	This computes the total score for the jumps in the jump list. Please change the comment accordingly.
635	I see the following two loops try to form a new fallthrough between ChainPred and ChainSucc and this is independent from the size of the ChainPred. This could be an independent optimization. Do you think you can do this as a later patch? That way, we can evaluate its benefit /overhead in a better way.
676–679	This seems to be an assertion that is unrelated to the purpose of this code block. Could you please remove or move it to a relevant position?
684	Why not doing X2_Y_X1 here?
696	Does this actually merge the chains together?
710	Is this the new score or the change in the score? `ChainSucc`'s intra-chain jumps are not included in the `Jumps` vector.
712	This is inconsistent with the function comment. CurGain is not updated here, but rather in the caller of this function. My suggestion is to add the following function to MergeGainTy and use it in the callsite. This way you can remove the first parameter from `computeMergeGain` and also `computeMergeGain` can return what it computes. void updateIfLessThan(const MergeGainTy &OtherGain) { if (this < OtherGain) this = OtherGain; }
721	Rename this to XSplitOffset.
767	It seems that the obsolete chains are not deallocated from the `AllChains` vector. Does this raise any memory concerns?
llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
2	The interesting code path (splitting the chain) is not exercised by these tests. Please add a test for that.

davidxl added inline comments.Nov 9 2021, 11:01 AM

llvm/lib/CodeGen/MachineBlockPlacement.cpp
575	Perhaps make the name clearer: createCFGChainExtTSP()?
3410	Why do the extTSP layout after the existing layout? It seems like a waste of compile time. Or the intention is to keep the taildup functionality there? However the tail dup decisions depend on layout decisions, so it is probably better integrated with extTSP. Also taildup's profile update may not be ideal affecting later layout score computation.
llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
2	The extTSP algorithm should be able to handle common loop rotation case. Can you add a test case about it?
14	We can reason that it requires double the weights to the merged block compared with the side branch of the triangular shaped CFG to make non-topological order to be profitable. In this case it is 100> 2* 40. Can you add a negative case that keeps the top order?
29	Fix typo

Addressing Rahman's and David's comments:

appled (most of) suggested renamings;
added comments and clarifications;
introduced Jump class instead of pair<pair<Block, Block>, ExecutionCount>;
added new tests;
introduced params for alg options.

Re memory utilization:

I am not sure what exactly the concern is. We keep all the allocated objects as the fields of ExtTSPImpl; no new objects should be created. When we merge two chains, all the jumps from one chain are moved to another one; and one of the chains is cleared. So the overall space is proportional to (|blocks| + |jumps|), unless there is a bug.

To validate correctness, I ran the algorithm on large synthetic instances (with up to 100K blocks); the memory didn't go beyond a few tens of MBs. In fact, it is hard to exactly measure the memory consumption, as the runtime of the implementation is super-linear and doesn't scale to huge instances.

Re perf tests on IR/CSIR:

Here are my measurements for clang-10 binary with different PGO modes. Of course we should keep in mind that the numbers won't fully generalize to other binaries/benchmarks and only provide one data point.

Comparing clang-10 using 200 iterations on input1.cpp

[test] -mllvm -enable-ext-tsp-block-placement=1
[control] -mllvm -enable-ext-tsp-block-placement=0
-DLLVM_BUILD_INSTRUMENTED=FE
  task-clock      :         5942 (+- 0.19%) vs         6027 (+- 0.19%) [diff = -1.4170%]
  binary_size     :         3316628472      vs         3325095000      [diff = -0.2546%]
-DLLVM_BUILD_INSTRUMENTED=IR
  task-clock      :         5793 (+- 0.20%) vs         5876 (+- 0.20%) [diff = -1.4050%]
  binary_size     :         3305880848      vs         3325234728      [diff = -0.5820%]
-DLLVM_BUILD_INSTRUMENTED=CSIR
  task-clock      :         5705 (+- 0.26%) vs         5757 (+- 0.26%) [diff = -0.8985%]
  binary_size     :         3312445120      vs         3329071760      [diff = -0.4994%]
CSSPGO
  task-clock      :         6110 (+- 0.37%) vs         6226 (+- 0.37%) [diff = -1.8617%]
  binary_size     :         3995536616      vs         3980411480      [diff = +0.3799%]
AutoFDO
  task-clock      :         5711 (+- 0.19%) vs         5797 (+- 0.19%) [diff = -1.4882%]
  binary_size     :         3707959712      vs         3694537688      [diff = +0.3632%]

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3410	I don't have a strong opinion here but here is my intuition. Ext-tsp is designed to "improve" the existing layout with the help of profile data. When there are ties or when the profile data is absent, the algorithm fallbacks to the original ordering of basic blocks. Thus, we want the original order to be reasonable. So we first apply the existing machine block placement, which applies several sane tricks like loop rotation, and then improve it with ext-tsp. Keeping the existing tail duplication also sounds reasonable to me; though this optimization could be applied separately. Btw, we are working on improving and replacing the existing taildup heuristic; i'll share diffs/prototypes if/when it produces good results.
llvm/lib/Transforms/Utils/CodeLayout.cpp
235	I think this was done for optimizing the performance. The tradeoff here is finding an adjacent edge (via getEdge(Chain *Other)) vs iterating over and appending the edges. (The former is done only once, when the two chains are merged). Empirically, std::vector+linear search was faster than std::unordered_map on my benchmarks by 10%-15%. Probably an alternative map implementation may work better; not sure if that's worth it though.
635	I don't see a clear advantage of moving this optimization to a separate diff. It is a relatively minor aspect of the algorithm and responsible for just a few lines of code. Not sure if the separation would noticeable simplify reviewing. I do agree with another point here: The algorithm may benefit from some fine-tuning. I don't think the currently utilized parameters and options (e.g., how exactly we split/merge the chains) are optimal; they have been tuned a few years ago on a couple of workloads running on a particular hardware. I would really appreciate the community help. For now, I'd however keep the params and optimizations as is, so that the existing customers do not see sudden regressions.
llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
2	Added some (possibly naive) test. Let me know if you have something more complex in mind.
2	I added some toy test, let me know if you have something more complex in mind.

Harbormaster completed remote builds in B134531: Diff 387646.Nov 16 2021, 8:48 AM

davidxl added inline comments.Nov 18 2021, 12:35 PM

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3410	When you say extTSP relies on existing layout to be reasonable, do you have performance data to back it up? If the profile is not available, it is likely they are cold. Also extTSP should be able to do 'loop rotation' by splitting during chain merge, are there cases it is not handled?

changes:

a better way of estimating block sizes in the binary;
a small correction to how branches are reconstructed after reordering.

spupyrev added inline comments.Nov 18 2021, 1:28 PM

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3410	Running ext-tsp as a postprocessing is a relatively minor "optimization" (that's why I said "I don't have strong opinion here..."). If you think we should run it instead of the existing algorithm, i am happy to do so. It is mostly to avoid some corner cases, e.g.: A user supplies an incomplete (or even empty) sampling-based profile, which only covers a part of hot functions. In that case ext-tsp doesn't modify the layout but the existing MachineBlockPlacement can apply some tricks. For instr-PGO, that's not an issue. Another scenario is when there are "ties", e.g., block orders with equal objective. For example, a function with 3 blocks and 2 equal-weight jumps, A->B, A->C. In that case, ext-tsp cannot decide between [A, B, C] and [A, C, B], and we probably want to use some other heuristics. I am not aware of "simple" cases when ext-tsp produces sub-optimal result; it will rotate loops, if that's dictated by profile counts. However, it is a heuristic for an np-hard problem so we should not expect it to work optimally in 100% of cases.

Harbormaster completed remote builds in B134972: Diff 388301.Nov 18 2021, 1:55 PM

In D113424#3135015, @spupyrev wrote:

Re memory utilization:

I am not sure what exactly the concern is. We keep all the allocated objects as the fields of ExtTSPImpl; no new objects should be created. When we merge two chains, all the jumps from one chain are moved to another one; and one of the chains is cleared. So the overall space is proportional to (|blocks| + |jumps|), unless there is a bug.

To validate correctness, I ran the algorithm on large synthetic instances (with up to 100K blocks); the memory didn't go beyond a few tens of MBs. In fact, it is hard to exactly measure the memory consumption, as the runtime of the implementation is super-linear and doesn't scale to huge instances.

Please see my inline comments. Since we keep all the Chains in the vector (they're not destructed), space usage is currently super-linear. We have a choice of deallocating the space (using shrink_to_fit, resize, or swap) and that's a time-space tradeoff. I am not sure which one is better, since this is already super-linear time, it doesn't hurt to make those calls and keep the space linear. Also, you can use llc on a large function to compare memory usage with and without -enable-ext-tsp-block-placement. It might very well be the case that the memory consumption remains the same. In that case, maybe you don't want to free up the storage either.

Re perf tests on IR/CSIR:

Comparing clang-10 using 200 iterations on input1.cpp

[test] -mllvm -enable-ext-tsp-block-placement=1
[control] -mllvm -enable-ext-tsp-block-placement=0
-DLLVM_BUILD_INSTRUMENTED=FE
  task-clock      :         5942 (+- 0.19%) vs         6027 (+- 0.19%) [diff = -1.4170%]
  binary_size     :         3316628472      vs         3325095000      [diff = -0.2546%]
-DLLVM_BUILD_INSTRUMENTED=IR
  task-clock      :         5793 (+- 0.20%) vs         5876 (+- 0.20%) [diff = -1.4050%]
  binary_size     :         3305880848      vs         3325234728      [diff = -0.5820%]
-DLLVM_BUILD_INSTRUMENTED=CSIR
  task-clock      :         5705 (+- 0.26%) vs         5757 (+- 0.26%) [diff = -0.8985%]
  binary_size     :         3312445120      vs         3329071760      [diff = -0.4994%]
CSSPGO
  task-clock      :         6110 (+- 0.37%) vs         6226 (+- 0.37%) [diff = -1.8617%]
  binary_size     :         3995536616      vs         3980411480      [diff = +0.3799%]
AutoFDO
  task-clock      :         5711 (+- 0.19%) vs         5797 (+- 0.19%) [diff = -1.4882%]
  binary_size     :         3707959712      vs         3694537688      [diff = +0.3632%]

Thanks for sharing the detailed results.

llvm/lib/Transforms/Utils/CodeLayout.cpp
128	This can be removed if we initialize with zero.
138	Does this initialization serve a special purpose? We can make it zero if we always reject zero score for merging.
263–264	Although this clears the vector, the underlying storage is not released. To free up the storage, we would need additional effort. (shrink_to_fit, resize, or swap).
305	Same comment. Storage won't be released.
446
635	Agreed. I am mostly interested in separately analyzing the impact of this. I think the user can do so by setting the split parameter to be zero. One suggestion to make this function more compact: Please consider adding a helper lambda function that given an offset and a list of merge types, does the necessary checks for the offset (like `PredChain->blocks[Offset]->ForcedSucc != nullptr`) and calls `computeMergeGain` for each merge type. I think we can make this function more readable and concise as well.
660	Please rename this to GetBestMergeGain for consistency with the function comment and to further separate it from computeMergeGain.
684	I am still puzzled by why we don't do X2_Y_X1. Is it not beneficial?
839	Change comment to clearly mention this is for putting the entry block at the beginning.
841–844	We can simply use `return C1->isEntry()`?
848–853	How about shortening this to `return (D1 != D2) ? (D1 > D2) : (C1->id() < C2->id())`?
llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
2	Thanks for adding the test case. I was thinking of something less complex which exercises the splitting to improve fallthroughs. (Might be worthwhile to check if the old algorithm gets the optimal layout in this case). foo \| \ \| \10 \| \ \| v \|17 foo.bb1 \| / \| / 10 v v foo.bb2

addressing comments

llvm/lib/Transforms/Utils/CodeLayout.cpp
138	I feel like negative score is a bit better reflects the intention here: it indicates that the merge is disadvantageous for the quality or simply "invalid". A score of "0" means that the merge is neutral for perf, so it is up to the algorithm to decide whether such merge needs to be done. At some point we're experimenting with merging of 0-score chains (e.g., two chains connected by a cold jump). It might be beneficial for certain cases, e.g., for code size. I guess we can play with this more in the future
263–264	I don't think that makes any measurable difference, as we're working with relatively small instances. But explicitly freeing up memory won't hurt, i guess.
635	Added an option to disable this extra optimization. Not sure I understand the proposal here.
684	Yes, it is not giving much benefits. I checked a binary with ~100K instances: Adding this type of splits makes a difference for 4 CFGs but it is responsible for ~25% of splits. Probably keeping the split doesn't worth it.
llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
2	Added this test (func4)

Harbormaster completed remote builds in B135169: Diff 388575.Nov 19 2021, 12:49 PM

spupyrev added inline comments.Nov 19 2021, 3:36 PM

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3410	I just tried to to measure the impact of changing the algorithm from being a post-processing step to replacing the existing approach (buildCFGChains). Here are the results: benchmark1: task-clock [delta: 24.27 ± 18.78, delta(%): 0.4010 ± 0.3103, p-value: 0.103827] benchmark2: task-clock [delta: 54.77 ± 20.91, delta(%): 0.9070 ± 0.3463, p-value: 0.000044] benchmark3: task-clock [delta: 2.43 ± 20.20, delta(%): 0.0279 ± 0.2314, p-value: 0.680156] So for two benchmarks, the difference is non-statsig, for one, there is a ~0.9% regression. I feel like we should keep the implementation as is.

Would it help to add remarks in a later diff to get some insights into the algorithm?
https://llvm.org/docs/Remarks.html

davidxl added inline comments.Nov 20 2021, 2:32 PM

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3410	Ok, we can keep ext-tsp as an cleanup pass.
llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
304	This might be a bug in existing layout. By design, with profile data, in order to take b2 as the layout successor, the incoming edge weight needs > 2*10 = 20.
llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll
9 ↗	(On Diff #388575)	Add a cfg with ascii art.
11 ↗	(On Diff #388575)	Explain a little the expected output.
29 ↗	(On Diff #388575)	Add brief explanation of expected output.

hoy added inline comments.Nov 22 2021, 9:14 AM

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3573	The object is created but not used anywhere. Is that expected?
llvm/lib/Transforms/Utils/CodeLayout.cpp
111	nit: please use enum class and add comment for each enum value.
906	This is an assert-only loop. Put it under `#ifndef NDEBUG` ?

rahmanl added inline comments.Nov 22 2021, 2:36 PM

llvm/lib/Transforms/Utils/CodeLayout.cpp
138	I think we would never merge 0-score chains. If there is any additional benefit (code size) it can be incorporated into the score. But I am OK with this.
635	You can use something like the following function: auto ComputeMergeGainWithOffsetAndTypes([&](int Offset, std::vector<MergeTypeTy> &merge_types) { if (Offset == 0 \|\| Offset == PredChain.size()) return; auto BB = ChainPred->blocks()[Offset - 1]; // Skip the splitting if it breaks FT successors if (BB->ForcedSucc != nullptr) continue; for (auto &MergeType: merge_types) { Gain.updateIfLessThan(computeMergeGain(ChainPred, ChainSucc, Jumps, Offset, MergeType)); } }
684	Great. Please add a comment to explain why this splitting is not exercised.
llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
5	This is missing the ascii cfg.
306	Please use the same names as in the CFG so this can be easily mapped. And please do this across all tests.
llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll
1 ↗	(On Diff #388575)	I am not an expert on this, but I feel like this is not needed. Does it make a difference?

addressing comments:

adjusted tests;
enum class;
minor refactorings as suggested.

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3573	What object do you refer to?
llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
304	This tests uses two different modes of ext-tsp, with and without "chain splitting" -- something that Rahman suggested. The "old" layout in MachineBlockPlacement isn't applied here.
llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll
9 ↗	(On Diff #388575)	The purpose of the test is to verify that chain splitting code is executed and that it is beneficial for the layout, that is, the optimized score is increased. I added the explanations. Since the graph is large and randomly generated, the ascii representation is not very readable. In fact, it is not providing any insights, so let's skip it?

davidxl added inline comments.Nov 23 2021, 9:53 AM

llvm/lib/Transforms/Utils/CodeLayout.cpp
86	The paper documents 6 types of branches and their weight K_{s,t}. Should these be parametized here?
98	The tsp score does not seem to be contiguous here. When Dist == 0, it should have the same score as the fall through case, but this produces 0.1* Count.
640	It is confusing to have both this method and calcExtTSPScore methods.
927	From the initialization, Order[Idx] == Idx, so is this map needed?
936	typo 'Increasing'
938	why is the count not used?
940	This function computes the 'fallthrough' maximizing TSP score, not the extTSP in the original paper. Where is the h function and the K coeffcients?
948	Can the reserve be combined with the vector declaration, or the intention is to avoid initialization?
llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll
9 ↗	(On Diff #388575)	ok. I was expecting to look into the test case and reason that improved score corresponds to improved layout. Looks like this is just a test to show increased score. Can this be done instead with any other tests (the score should be increased monotonically) ?

David's comments (except for the large test)

llvm/lib/Transforms/Utils/CodeLayout.cpp
98	Agree, that's something me and my colleagues have been thinking about lately. The current function is tuned for the maximum performance of the resulting binary. It is not guaranteed/enforced that the function is contiguous. That said, i'd be super happy to see and review improvements of the objective in the future. There is likely a room for optimizations here
927	This function is also called when Order != Identity
940	oops. Refactored function to compute correct result

changed "large" test into a human-readable one, added ascii for the CFG

davidxl added inline comments.Nov 23 2021, 2:10 PM

llvm/lib/Transforms/Utils/CodeLayout.cpp
98	I guessed that with ForwardWeight == 0.1, a bias can be introduced against laying out larger successor as the fall through block, but it is not the case. Using different forward weight seems to produce the same decision for a simple diamond cfg case -- in order to for one successor to be a fallthrough, the branch probability needs to > 50%. I wonder what role this weight plays in practice (performance)?
927	Probably provide another wrapper to compute the mapping and pass it in. The map can use the vector type, so for the case when Order == Identity, Order can be used for both.

spupyrev added inline comments.Nov 23 2021, 2:27 PM

llvm/lib/Transforms/Utils/CodeLayout.cpp
98	I am not sure I understand the question, could you clarify? The weights affect the resulting layout, which in turn may impact perf. However, changing the weights from 0.1 to say 0.2 likely won't make a difference, except for some corner cases. IIRC, in my tests 3 years ago it was important to keep `ForwardWeight<FallthroughWeight=1.0`. Intuitively this makes sense, though the exact values may need to be re-tuned. (This is a separate and time-consuming project)

davidxl added inline comments.Nov 23 2021, 3:23 PM

llvm/lib/Transforms/Utils/CodeLayout.cpp
98	You answer touched upon what I was asking. To clarify the question: are there performance data to show why 0.1 (instead of other values) is used. ExtTSP score can be considered as a locality score, and the forwardWeight is an attempt to model the cost of taken branch instruction. I wonder if there is a better way to model this in the objective function. No need to address it in this patch though.

Harbormaster completed remote builds in B135707: Diff 389299.Nov 23 2021, 4:12 PM

davidxl added inline comments.Nov 24 2021, 12:22 PM

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3459	It is better to use vector type as the index space is dense?
3461	vector type?
3474	Why not computing the actual byte size of the block?
3480	Is multi-graph assumed, why?
3494	Can the following (computing NewBlockOrder) be folded into applyExtTspLayout? The NewOrder is not used anywhere else, so no need to be exposed.
3531	Is NewIndex the same as BlockIndex computed in applyExtTsp? Why recomputing?

changed DenseMap => std::vector
git rid of multi-edges

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3474	That would be great! Do you mind pointing to the right way of implementing this? I was not able to find anything meaningless in MachineBasicBlock
3494	I am not fully understanding the suggestion. The goal is here is to make applyExtTspLayout work with any type of nodes, not just MachineBasicBlock (potentially we want to apply it for function sorting). I tried creating a generic implementation with templates (and thus "hiding" NewOrder) but it turned out to be much messier than what it is now.
3531	No, this is a new index after reordering.

Harbormaster completed remote builds in B136953: Diff 391069.Dec 1 2021, 11:25 AM

davidxl added inline comments.Dec 1 2021, 3:30 PM

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3474	This is target dependent unfortunately -- an abstract interface can be added in MCCodeEmitter. The implementation should be similar to encodeInstruction. This is beyond the scope of the patch, so perhaps add a TODO in the comment.

adding a note on computing block sizes exactly

LGTM, please also wait for other reviewers lgtm.

This revision is now accepted and ready to land.Dec 2 2021, 10:31 AM

adding 'mtriple' back to test options, as MachineBlockPlacement isn't deterministic between different platforms (so tests fail on Windows)

LGTM as well.

Harbormaster completed remote builds in B137186: Diff 391395.Dec 2 2021, 11:36 AM

lgtm except for some nits. Thanks.

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3573	nvm, the object pointed by `FunctionChain` is used in `FunctionChain->merge`.
llvm/lib/Transforms/Utils/CodeLayout.cpp
61	nit: the distance in use isn't really in bytes? It is number of MR instructions?
274	nit: we usually use `uint64_t` for id type explicitly.
898	nit: no need to use #ifndef NDEBUG

changed type of Chain.Id to uint64

llvm/lib/Transforms/Utils/CodeLayout.cpp
61	As discussed in https://reviews.llvm.org/D113424/new/#inline-1097352, we use block sizes in bytes, approximating it by num_mir_instr * 4. Equivalently, we could similarly use the number of instructions in both places (by scaling the constants down by 4) but I don't really see a difference

Harbormaster completed remote builds in B137441: Diff 391747.Dec 3 2021, 3:43 PM

Closed by commit rGc68f71eb37c2: ext-tsp basic block layout (authored by spupyrev). · Explain WhyDec 6 2021, 8:58 AM

This revision was automatically updated to reflect the committed changes.

spupyrev added a commit: rGc68f71eb37c2: ext-tsp basic block layout.

thakis mentioned this in D115139: [Coroutines] Handle CallBrInst in SalvageDebugInfo.Dec 6 2021, 3:56 PM

Looks like this broke CodeGen/X86/code_placement_ext_tsp.ll on arm macs:

http://45.33.8.238/macm1/23133/step_11.txt

Please take a look, and revert for now if it takes a while to fix.

Also here https://lab.llvm.org/buildbot/#/builders/186/builds/2803 which should've sent email long ago. Reverting...

thakis added a reverting change: rG3678326d2839: Revert "ext-tsp basic block layout".Dec 6 2021, 4:10 PM

spupyrev reopened this revision.Dec 6 2021, 4:27 PM

This revision is now accepted and ready to land.Dec 6 2021, 4:27 PM

making the tests deterministic

Harbormaster completed remote builds in B137783: Diff 392227.Dec 6 2021, 5:20 PM

Closed by commit rGf573f6866e18: ext-tsp basic block layout (authored by spupyrev). · Explain WhyDec 7 2021, 7:31 AM

This revision was automatically updated to reflect the committed changes.

spupyrev added a commit: rGf573f6866e18: ext-tsp basic block layout.

spupyrev mentioned this in D115255: fixing a broken ext-tsp test.Dec 7 2021, 8:26 AM

MaskRay added a subscriber: MaskRay.Sep 17 2023, 1:10 AM

MaskRay added inline comments.

llvm/lib/Transforms/Utils/CodeLayout.cpp
717	This should be `Jump->Target`. See https://github.com/llvm/llvm-project/pull/66592

Herald added a project: Restricted Project. · View Herald TranscriptSep 17 2023, 1:10 AM

Herald added a subscriber: wlei. · View Herald Transcript

kosarev added a subscriber: kosarev.Oct 9 2023, 8:10 AM

kosarev added inline comments.

llvm/lib/Transforms/Utils/CodeLayout.cpp
405–406	This seems to fail on expensive checks, see https://github.com/llvm/llvm-project/issues/68594. Do `{Begin,End}{2,3}` really form valid ranges when default to `BlockIter()` in the constructor?

spupyrev added inline comments.Oct 10 2023, 3:04 PM

llvm/lib/Transforms/Utils/CodeLayout.cpp
405–406	Can you teach me to reproduce the failure? I'm building with "-DCMAKE_BUILD_TYPE=Debug -DLLVM_ENABLE_ASSERTIONS=ON" and run "ninja check-all" but still don't see it

kosarev added inline comments.Oct 11 2023, 12:39 PM

llvm/lib/Transforms/Utils/CodeLayout.cpp
405–406	It fails on libc++'s consistency checks, which get enabled automatically on builds with LLVM's expensive checks turned on, `-DLLVM_ENABLE_EXPENSIVE_CHECKS=ON`.

spupyrev added inline comments.Oct 11 2023, 3:56 PM

llvm/lib/Transforms/Utils/CodeLayout.cpp

405–406

hmm. I still cannot reproduce this. Here is my command:

cmake \
    -G Ninja \
    -DCMAKE_ASM_COMPILER=$MYCLANG \
    -DCMAKE_ASM_COMPILER_ID=Clang \
    -DCMAKE_BUILD_TYPE=Debug \
    -DCMAKE_CXX_COMPILER=$MYCLANG++ \
    -DCMAKE_C_COMPILER=$MYCLANG \
    -DLLVM_TARGETS_TO_BUILD="X86;AArch64" \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_LLD=ON \
    -DLLVM_ENABLE_PROJECTS="clang;lld;bolt" \
    -DCMAKE_EXE_LINKER_FLAGS="-L /usr/lib64" \
    -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON \
    ../llvm

     ninja check-all

Is something else missing?

kosarev added inline comments.Oct 12 2023, 7:30 AM

llvm/lib/Transforms/Utils/CodeLayout.cpp

405–406

Looks OK to me. I'd make sure the code is rebuilt from scratch and _GLIBCXX_DEBUG is defined, then look into why libc++ doesn't do the checks. I'm attaching a backtrace of one of the failures, in case it may be of any help.

That regardless, I think the nature of the problem is very clear, which is that when the Begin1/2 and End1/2 iterators are singular, they should not be used to form any ranges. Once we have a fix, I'd be happy to test it for you.

[ RUN      ] CodeLayout.ThreeFunctions
/usr/include/c++/11/debug/vector:617:
In function:
    std::debug::vector<_Tp, _Allocator>::iterator std::debug::vector<_Tp, 
    _Allocator>::insert(std::debug::vector<_Tp, _Allocator>::const_iterator, 
    _InputIterator, _InputIterator) [with _InputIterator = 
    gnu_debug::_Safe_iterator<gnu_cxx::normal_iterator<{anonymous}::NodeT* 
    const*, std::vector<{anonymous}::NodeT*, 
    std::allocator<{anonymous}::NodeT*> > >, std::
    debug::vector<{anonymous}::NodeT*>, std::random_access_iterator_tag>; 
    <template-parameter-2-2> = void; _Tp = {anonymous}::NodeT*; _Allocator = 
    std::allocator<{anonymous}::NodeT*>; std::debug::vector<_Tp, 
    _Allocator>::iterator = std::
    debug::vector<{anonymous}::NodeT*>::iterator; std::debug::vector<_Tp, 
    _Allocator>::const_iterator = std::
    debug::vector<{anonymous}::NodeT*>::const_iterator]

Error: function requires a valid iterator range [first, last).

Objects involved in the operation:
    iterator "first" @ 0x7ffe7dbd1c40 {
      state = singular;
    }
    iterator "last" @ 0x7ffe7dbd1c70 {
      state = singular;
    }
 #0 0x00007f5c10bc3b02 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/kosarev/labs/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:22
 #1 0x00007f5c10bc3f1e PrintStackTraceSignalHandler(void*) /home/kosarev/labs/llvm-project/llvm/lib/Support/Unix/Signals.inc:798:1
 #2 0x00007f5c10bc1363 llvm::sys::RunSignalHandlers() /home/kosarev/labs/llvm-project/llvm/lib/Support/Signals.cpp:105:20
 #3 0x00007f5c10bc339a SignalHandler(int) /home/kosarev/labs/llvm-project/llvm/lib/Support/Unix/Signals.inc:413:1
 #4 0x00007f5c10447520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #5 0x00007f5c1049b9fc __pthread_kill_implementation ./nptl/./nptl/pthread_kill.c:44:76
 #6 0x00007f5c1049b9fc __pthread_kill_internal ./nptl/./nptl/pthread_kill.c:78:10
 #7 0x00007f5c1049b9fc pthread_kill ./nptl/./nptl/pthread_kill.c:89:10
 #8 0x00007f5c10447476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #9 0x00007f5c1042d7f3 abort ./stdlib/./stdlib/abort.c:81:7
#10 0x00007f5c106d21fb std::__throw_bad_exception() (/lib/x86_64-linux-gnu/libstdc++.so.6+0xa51fb)
#11 0x00007f5c12dd8740 __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT**, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag> std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >::insert<__gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>, void>(__gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>, __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>, __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>) /usr/include/c++/11/debug/vector:617:4
#12 0x00007f5c12dcee70 (anonymous namespace)::MergedChain::getNodes() const /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:504:18
#13 0x00007f5c12dd5141 (anonymous namespace)::CDSortImpl::mergeChains((anonymous namespace)::ChainT*, (anonymous namespace)::ChainT*, unsigned long, (anonymous namespace)::MergeTypeT) /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1288:16
#14 0x00007f5c12dd4211 (anonymous namespace)::CDSortImpl::mergeChainPairs() /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1144:50
#15 0x00007f5c12dd302c (anonymous namespace)::CDSortImpl::run() /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1008:5
#16 0x00007f5c12dd615c llvm::codelayout::computeCacheDirectedLayout(llvm::codelayout::CDSortConfig const&, llvm::ArrayRef<unsigned long>, llvm::ArrayRef<unsigned long>, llvm::ArrayRef<llvm::codelayout::EdgeCount>, llvm::ArrayRef<unsigned long>) /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1435:3
#17 0x00007f5c12dd6333 llvm::codelayout::computeCacheDirectedLayout(llvm::ArrayRef<unsigned long>, llvm::ArrayRef<unsigned long>, llvm::ArrayRef<llvm::codelayout::EdgeCount>, llvm::ArrayRef<unsigned long>) /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1453:48
#18 0x00005638fc0abb97 (anonymous namespace)::CodeLayout_ThreeFunctions_Test::TestBody() /home/kosarev/labs/llvm-project/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp:18:78
#19 0x00007f5c13e2ef1a void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2612:28
#20 0x00007f5c13e1f20f void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2667:75
#21 0x00007f5c13df2bca testing::Test::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2694:30
#22 0x00007f5c13df350b testing::TestInfo::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2839:3
#23 0x00007f5c13df3eea testing::TestSuite::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:3017:35
#24 0x00007f5c13e01306 testing::internal::UnitTestImpl::RunAllTests() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:5921:41
#25 0x00007f5c13e315fd bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2614:1
#26 0x00007f5c13e20b7a bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2667:75
#27 0x00007f5c13dffc25 testing::UnitTest::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:5487:14
#28 0x00007f5c13efde7f RUN_ALL_TESTS() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/include/gtest/gtest.h:2317:80
#29 0x00007f5c13efdd61 main /home/kosarev/labs/llvm-project/third-party/unittest/UnitTestMain/TestMain.cpp:55:24
#30 0x00007f5c1042ed90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#31 0x00007f5c1042ee40 call_init ./csu/../csu/libc-start.c:128:20
#32 0x00007f5c1042ee40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#33 0x00005638fc0423e5 _start (/home/kosarev/labs/llvm-project/build/debug+expensive_checks/unittests/Transforms/Utils/./UtilsTests+0x1f3e5)

spupyrev added inline comments.Oct 12 2023, 11:18 AM

llvm/lib/Transforms/Utils/CodeLayout.cpp
405–406	Thanks for such a detailed analysis and for teaching me about singular iterators. https://github.com/llvm/llvm-project/pull/68917

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Utils/

CodeLayout.h

59 lines

lib/

CodeGen/

MachineBlockPlacement.cpp

154 lines

Transforms/

Utils/

CMakeLists.txt

1 line

CodeLayout.cpp

905 lines

test/

CodeGen/

X86/

code_placement_ext_tsp.ll

222 lines

Diff 385565

llvm/include/llvm/Transforms/Utils/CodeLayout.h

This file was added.

				//===- CodeLayout.h - Code layout/placement algorithms ---------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// Declares methods and data structures for code layout algorithms.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TRANSFORMS_UTILS_CODELAYOUT_H
				#define LLVM_TRANSFORMS_UTILS_CODELAYOUT_H

				#include "llvm/ADT/DenseMap.h"

				#include <vector>

				namespace llvm {

				class MachineBasicBlock;

				/// Find a layout of nodes (basic blocks) of a given CFG optimizing jump
				/// locality and thus processor I-cache utilization. This is achieved via
				/// increasing the number of fall-through jumps and co-locating frequently
				/// executed nodes together.
				/// The nodes are assumed to be indexed by integers from [0, \|V\|) so that the
				/// current order is the identity permutation.
				/// \p NodeSizes: The sizes of the nodes (in bytes).
				/// \p NodeCounts: The execution counts of the nodes in the profile.
				/// \p EdgeCounts: The execution counts of every edge (jump) in the profile. The
				/// map also defines the edges in CFG and should include 0-count edges.
				/// \returns The best block order found.
				std::vector<uint64_t> applyExtTspLayout(
				const DenseMap<uint64_t, uint64_t> &NodeSizes,
				const DenseMap<uint64_t, uint64_t> &NodeCounts,
				const DenseMap<std::pair<uint64_t, uint64_t>, uint64_t> &EdgeCounts);

				/// Estimate the "quality" of a given node order in CFG. The higher the score,
				/// the better the order is. The score is designed to reflect the locality of
				/// the given order, which is anti-correlated with the number of I-cache misses
				/// in a typical execution of the function.
				uint64_t calcExtTspScore(
				const std::vector<uint64_t> &Order,
				const DenseMap<uint64_t, uint64_t> &NodeSizes,
				const DenseMap<uint64_t, uint64_t> &NodeCounts,
				const DenseMap<std::pair<uint64_t, uint64_t>, uint64_t> &EdgeCounts);

				/// Estimate the "quality" of the current node order in CFG.
				uint64_t calcExtTspScore(
				const DenseMap<uint64_t, uint64_t> &NodeSizes,
				const DenseMap<uint64_t, uint64_t> &NodeCounts,
				const DenseMap<std::pair<uint64_t, uint64_t>, uint64_t> &EdgeCounts);

				} // end namespace llvm

				#endif // LLVM_TRANSFORMS_UTILS_CODELAYOUT_H

llvm/lib/CodeGen/MachineBlockPlacement.cpp

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
#include "llvm/Support/BlockFrequency.h"		#include "llvm/Support/BlockFrequency.h"
#include "llvm/Support/BranchProbability.h"		#include "llvm/Support/BranchProbability.h"
#include "llvm/Support/CodeGen.h"		#include "llvm/Support/CodeGen.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Compiler.h"		#include "llvm/Support/Compiler.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"
		#include "llvm/Transforms/Utils/CodeLayout.h"
#include <algorithm>		#include <algorithm>
#include <cassert>		#include <cassert>
#include <cstdint>		#include <cstdint>
#include <iterator>		#include <iterator>
#include <memory>		#include <memory>
#include <string>		#include <string>
#include <tuple>		#include <tuple>
#include <utility>		#include <utility>
#include <vector>		#include <vector>

		#include <sstream>

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "block-placement"		#define DEBUG_TYPE "block-placement"

STATISTIC(NumCondBranches, "Number of conditional branches");		STATISTIC(NumCondBranches, "Number of conditional branches");
STATISTIC(NumUncondBranches, "Number of unconditional branches");		STATISTIC(NumUncondBranches, "Number of unconditional branches");
STATISTIC(CondBranchTakenFreq,		STATISTIC(CondBranchTakenFreq,
"Potential frequency of taking conditional branches");		"Potential frequency of taking conditional branches");
▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines
// Heuristic for triangle chains.		// Heuristic for triangle chains.
static cl::opt<unsigned> TriangleChainCount(		static cl::opt<unsigned> TriangleChainCount(
"triangle-chain-count",		"triangle-chain-count",
cl::desc("Number of triangle-shaped-CFG's that need to be in a row for the "		cl::desc("Number of triangle-shaped-CFG's that need to be in a row for the "
"triangle tail duplication heuristic to kick in. 0 to disable."),		"triangle tail duplication heuristic to kick in. 0 to disable."),
cl::init(2),		cl::init(2),
cl::Hidden);		cl::Hidden);

		static cl::opt<bool> EnableExtTspBlockPlacement(
		"enable-ext-tsp-block-placement", cl::Hidden, cl::init(false),
		cl::desc("Enable machine block placement based on the ext-tsp model, "
		"optimizing I-cache utilization."));

namespace llvm {		namespace llvm {
extern cl::opt<unsigned> StaticLikelyProb;		extern cl::opt<unsigned> StaticLikelyProb;
extern cl::opt<unsigned> ProfileLikelyProb;		extern cl::opt<unsigned> ProfileLikelyProb;

// Internal option used to control BFI display only after MBP pass.		// Internal option used to control BFI display only after MBP pass.
// Defined in CodeGen/MachineBlockFrequencyInfo.cpp:		// Defined in CodeGen/MachineBlockFrequencyInfo.cpp:
// -view-block-layout-with-bfi=		// -view-block-layout-with-bfi=
extern cl::opt<GVDAGType> ViewBlockLayoutWithBFI;		extern cl::opt<GVDAGType> ViewBlockLayoutWithBFI;
▲ Show 20 Lines • Show All 348 Lines • ▼ Show 20 Lines	#endif
bool canTailDuplicateUnplacedPreds(		bool canTailDuplicateUnplacedPreds(
const MachineBasicBlock BB, MachineBasicBlock Succ,		const MachineBasicBlock BB, MachineBasicBlock Succ,
const BlockChain &Chain, const BlockFilterSet *BlockFilter);		const BlockChain &Chain, const BlockFilterSet *BlockFilter);

/// Find chains of triangles to tail-duplicate where a global analysis works,		/// Find chains of triangles to tail-duplicate where a global analysis works,
/// but a local analysis would not find them.		/// but a local analysis would not find them.
void precomputeTriangleChains();		void precomputeTriangleChains();

		/// Apply a post-processing step optimizing block placement.
		void applyExtTsp();

		/// Modify the existing block placement in the function and adjust all jumps.
		void assignBlockOrder(const std::vector<const MachineBasicBlock *> &NewOrder);

		/// Create a single CFG chain from the current block order.
		void createCFGChain();
		davidxlUnsubmitted Done Reply Inline Actions Perhaps make the name clearer: createCFGChainExtTSP()? davidxl: Perhaps make the name clearer: createCFGChainExtTSP()?

public:		public:
static char ID; // Pass identification, replacement for typeid		static char ID; // Pass identification, replacement for typeid

MachineBlockPlacement() : MachineFunctionPass(ID) {		MachineBlockPlacement() : MachineFunctionPass(ID) {
initializeMachineBlockPlacementPass(*PassRegistry::getPassRegistry());		initializeMachineBlockPlacementPass(*PassRegistry::getPassRegistry());
}		}

bool runOnMachineFunction(MachineFunction &F) override;		bool runOnMachineFunction(MachineFunction &F) override;
▲ Show 20 Lines • Show All 2,814 Lines • ▼ Show 20 Lines	if (BF.OptimizeFunction(MF, TII, MF.getSubtarget().getRegisterInfo(), MLI,
// Must redo the post-dominator tree if blocks were changed.		// Must redo the post-dominator tree if blocks were changed.
if (MPDT)		if (MPDT)
MPDT->runOnMachineFunction(MF);		MPDT->runOnMachineFunction(MF);
ChainAllocator.DestroyAll();		ChainAllocator.DestroyAll();
buildCFGChains();		buildCFGChains();
}		}
}		}

		// Apply a post-processing optimizing block placement.
		if (MF.size() >= 3 && EnableExtTspBlockPlacement) {
		// Find a new placement and modify the layout of the blocks in the function.
		applyExtTsp();
		davidxlUnsubmitted Not Done Reply Inline Actions Why do the extTSP layout after the existing layout? It seems like a waste of compile time. Or the intention is to keep the taildup functionality there? However the tail dup decisions depend on layout decisions, so it is probably better integrated with extTSP. Also taildup's profile update may not be ideal affecting later layout score computation. davidxl: Why do the extTSP layout after the existing layout? It seems like a waste of compile time. Or…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions I don't have a strong opinion here but here is my intuition. Ext-tsp is designed to "improve" the existing layout with the help of profile data. When there are ties or when the profile data is absent, the algorithm fallbacks to the original ordering of basic blocks. Thus, we want the original order to be reasonable. So we first apply the existing machine block placement, which applies several sane tricks like loop rotation, and then improve it with ext-tsp. Keeping the existing tail duplication also sounds reasonable to me; though this optimization could be applied separately. Btw, we are working on improving and replacing the existing taildup heuristic; i'll share diffs/prototypes if/when it produces good results. spupyrev: I don't have a strong opinion here but here is my intuition. Ext-tsp is designed to "improve"…
		davidxlUnsubmitted Not Done Reply Inline Actions When you say extTSP relies on existing layout to be reasonable, do you have performance data to back it up? If the profile is not available, it is likely they are cold. Also extTSP should be able to do 'loop rotation' by splitting during chain merge, are there cases it is not handled? davidxl: When you say extTSP relies on existing layout to be reasonable, do you have performance data to…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions Running ext-tsp as a postprocessing is a relatively minor "optimization" (that's why I said "I don't have strong opinion here..."). If you think we should run it instead of the existing algorithm, i am happy to do so. It is mostly to avoid some corner cases, e.g.: A user supplies an incomplete (or even empty) sampling-based profile, which only covers a part of hot functions. In that case ext-tsp doesn't modify the layout but the existing MachineBlockPlacement can apply some tricks. For instr-PGO, that's not an issue. Another scenario is when there are "ties", e.g., block orders with equal objective. For example, a function with 3 blocks and 2 equal-weight jumps, A->B, A->C. In that case, ext-tsp cannot decide between [A, B, C] and [A, C, B], and we probably want to use some other heuristics. I am not aware of "simple" cases when ext-tsp produces sub-optimal result; it will rotate loops, if that's dictated by profile counts. However, it is a heuristic for an np-hard problem so we should not expect it to work optimally in 100% of cases. spupyrev: Running ext-tsp as a postprocessing is a relatively minor "optimization" (that's why I said "I…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions I just tried to to measure the impact of changing the algorithm from being a post-processing step to replacing the existing approach (buildCFGChains). Here are the results: benchmark1: task-clock [delta: 24.27 ± 18.78, delta(%): 0.4010 ± 0.3103, p-value: 0.103827] benchmark2: task-clock [delta: 54.77 ± 20.91, delta(%): 0.9070 ± 0.3463, p-value: 0.000044] benchmark3: task-clock [delta: 2.43 ± 20.20, delta(%): 0.0279 ± 0.2314, p-value: 0.680156] So for two benchmarks, the difference is non-statsig, for one, there is a ~0.9% regression. I feel like we should keep the implementation as is. spupyrev: I just tried to to measure the impact of changing the algorithm from being a post-processing…
		davidxlUnsubmitted Not Done Reply Inline Actions Ok, we can keep ext-tsp as an cleanup pass. davidxl: Ok, we can keep ext-tsp as an cleanup pass.

		// Re-create CFG chain so that we can optimizeBranches and alignBlocks.
		createCFGChain();
		}

optimizeBranches();		optimizeBranches();
alignBlocks();		alignBlocks();

BlockToChain.clear();		BlockToChain.clear();
ComputedEdges.clear();		ComputedEdges.clear();
ChainAllocator.DestroyAll();		ChainAllocator.DestroyAll();

if (AlignAllBlock)		if (AlignAllBlock)
Show All 10 Lines	else if (AlignAllNonFallThruBlocks) {
}		}
}		}
if (ViewBlockLayoutWithBFI != GVDT_None &&		if (ViewBlockLayoutWithBFI != GVDT_None &&
(ViewBlockFreqFuncName.empty() \|\|		(ViewBlockFreqFuncName.empty() \|\|
F->getFunction().getName().equals(ViewBlockFreqFuncName))) {		F->getFunction().getName().equals(ViewBlockFreqFuncName))) {
MBFI->view("MBP." + MF.getName(), false);		MBFI->view("MBP." + MF.getName(), false);
}		}


// We always return true as we have no way to track whether the final order		// We always return true as we have no way to track whether the final order
// differs from the original order.		// differs from the original order.
return true;		return true;
}		}

		void MachineBlockPlacement::applyExtTsp() {
		// Prepare data; blocks are indexed by their index in the current ordering.
		DenseMap<const MachineBasicBlock *, uint64_t> BlockIndex;
		BlockIndex.reserve(F->size());
		std::vector<const MachineBasicBlock *> CurrentBlockOrder;
		CurrentBlockOrder.reserve(F->size());
		size_t NumBlocks = 0;
		for (const MachineBasicBlock &MBB : *F) {
		BlockIndex[&MBB] = NumBlocks++;
		CurrentBlockOrder.push_back(&MBB);
		}

		DenseMap<uint64_t, uint64_t> BlockSizes;
		davidxlUnsubmitted Done Reply Inline Actions It is better to use vector type as the index space is dense? davidxl: It is better to use vector type as the index space is dense?
		BlockSizes.reserve(F->size());
		DenseMap<uint64_t, uint64_t> BlockCounts;
		davidxlUnsubmitted Done Reply Inline Actions vector type? davidxl: vector type?
		BlockCounts.reserve(F->size());
		DenseMap<std::pair<uint64_t, uint64_t>, uint64_t> JumpCounts;
		for (MachineBasicBlock &MBB : *F) {
		BlockFrequency BlockFreq = MBFI->getBlockFreq(&MBB);
		BlockSizes[BlockIndex[&MBB]] = MBB.size();
		rahmanlUnsubmitted Done Reply Inline Actions This will give you the number of instructions which is not exactly the binary size of the block. rahmanl: This will give you the number of instructions which is not exactly the binary size of the block.
		BlockCounts[BlockIndex[&MBB]] = BlockFreq.getFrequency();
		for (MachineBasicBlock *Succ : MBB.successors()) {
		auto EP = MBPI->getEdgeProbability(&MBB, Succ);
		BlockFrequency EdgeFreq = BlockFreq * EP;
		auto Edge = std::make_pair(BlockIndex[&MBB], BlockIndex[Succ]);
		JumpCounts[Edge] += EdgeFreq.getFrequency();
		}
		}
		davidxlUnsubmitted Not Done Reply Inline Actions Why not computing the actual byte size of the block? davidxl: Why not computing the actual byte size of the block?
		spupyrevAuthorUnsubmitted Done Reply Inline Actions That would be great! Do you mind pointing to the right way of implementing this? I was not able to find anything meaningless in MachineBasicBlock spupyrev: That would be great! Do you mind pointing to the right way of implementing this? I was not able…
		davidxlUnsubmitted Not Done Reply Inline Actions This is target dependent unfortunately -- an abstract interface can be added in MCCodeEmitter. The implementation should be similar to encodeInstruction. This is beyond the scope of the patch, so perhaps add a TODO in the comment. davidxl: This is target dependent unfortunately -- an abstract interface can be added in MCCodeEmitter.

		LLVM_DEBUG(dbgs() << "Applying ext-tsp layout for \|V\| = " << F->size()
		<< " with profile = " << F->getFunction().hasProfileData()
		<< " (" << F->getName().str() << ")"
		<< "\n");
		LLVM_DEBUG(dbgs() << " original layout score: "
		davidxlUnsubmitted Done Reply Inline Actions Is multi-graph assumed, why? davidxl: Is multi-graph assumed, why?
		<< calcExtTspScore(BlockSizes, BlockCounts, JumpCounts)
		<< "\n");

		// Run the layout algorithm.
		auto NewOrder = applyExtTspLayout(BlockSizes, BlockCounts, JumpCounts);
		std::vector<const MachineBasicBlock *> NewBlockOrder;
		NewBlockOrder.reserve(F->size());
		for (uint64_t Node : NewOrder) {
		NewBlockOrder.push_back(CurrentBlockOrder[Node]);
		}
		LLVM_DEBUG(
		dbgs() << " optimized layout score: "
		<< calcExtTspScore(NewOrder, BlockSizes, BlockCounts, JumpCounts)
		<< "\n");
		davidxlUnsubmitted Not Done Reply Inline Actions Can the following (computing NewBlockOrder) be folded into applyExtTspLayout? The NewOrder is not used anywhere else, so no need to be exposed. davidxl: Can the following (computing NewBlockOrder) be folded into applyExtTspLayout? The NewOrder is…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions I am not fully understanding the suggestion. The goal is here is to make applyExtTspLayout work with any type of nodes, not just MachineBasicBlock (potentially we want to apply it for function sorting). I tried creating a generic implementation with templates (and thus "hiding" NewOrder) but it turned out to be much messier than what it is now. spupyrev: I am not fully understanding the suggestion. The goal is here is to make applyExtTspLayout…

		// Assign new block order.
		assignBlockOrder(NewBlockOrder);
		}

		void MachineBlockPlacement::assignBlockOrder(
		const std::vector<const MachineBasicBlock *> &NewBlockOrder) {
		assert(F->size() == NewBlockOrder.size() && "Incorrect size of block order");
		F->RenumberBlocks();

		bool HasChanges = false;
		for (size_t I = 0; I < NewBlockOrder.size(); I++) {
		if (NewBlockOrder[I] != F->getBlockNumbered(I)) {
		HasChanges = true;
		break;
		}
		}
		// Stop early if the new block order is identical to the existing one.
		if (!HasChanges)
		return;

		SmallVector<MachineBasicBlock *, 4> PrevFallThroughs(F->getNumBlockIDs());
		for (auto &MBB : *F) {
		PrevFallThroughs[MBB.getNumber()] = MBB.getFallThrough();
		}

		// Sort basic blocks in the function according to the computed order.
		DenseMap<const MachineBasicBlock *, size_t> NewIndex;
		for (const MachineBasicBlock *MBB : NewBlockOrder) {
		NewIndex[MBB] = NewIndex.size();
		}
		F->sort([&](MachineBasicBlock &L, MachineBasicBlock &R) {
		return NewIndex[&L] < NewIndex[&R];
		});

		// Update basic block branches by inserting explicit fallthrough branches
		// when required and re-optimize branches when possible.
		davidxlUnsubmitted Done Reply Inline Actions Is NewIndex the same as BlockIndex computed in applyExtTsp? Why recomputing? davidxl: Is NewIndex the same as BlockIndex computed in applyExtTsp? Why recomputing?
		spupyrevAuthorUnsubmitted Done Reply Inline Actions No, this is a new index after reordering. spupyrev: No, this is a new index after reordering.
		const TargetInstrInfo *TII = F->getSubtarget().getInstrInfo();
		SmallVector<MachineOperand, 4> Cond;
		for (auto &MBB : *F) {
		MachineFunction::iterator NextMBB = std::next(MBB.getIterator());
		MachineFunction::iterator EndIt = MBB.getParent()->end();
		auto *FTMBB = PrevFallThroughs[MBB.getNumber()];
		// If this block had a fallthrough before we need an explicit unconditional
		// branch to that block if the fallthrough block is not adjacent to the
		// block in the new order.
		if (FTMBB && NextMBB != EndIt && &*NextMBB != FTMBB)
		TII->insertUnconditionalBranch(MBB, FTMBB, MBB.findBranchDebugLoc());

		// It might be possible to optimize branches by flipping the condition.
		Cond.clear();
		MachineBasicBlock TBB = nullptr, FBB = nullptr;
		if (TII->analyzeBranch(MBB, TBB, FBB, Cond))
		continue;
		MBB.updateTerminator(FTMBB);
		}

		#ifndef NDEBUG
		// Make sure we correctly constructed all branches.
		F->verify(this, "After optimized block reordering");
		#endif
		}

		void MachineBlockPlacement::createCFGChain() {
		BlockToChain.clear();
		ComputedEdges.clear();
		ChainAllocator.DestroyAll();

		MachineBasicBlock *HeadBB = &F->front();
		BlockChain *FunctionChain =
		new (ChainAllocator.Allocate()) BlockChain(BlockToChain, HeadBB);

		for (MachineBasicBlock &MBB : *F) {
		if (HeadBB == &MBB)
		continue; // Ignore head of the chain
		FunctionChain->merge(&MBB, nullptr);
		}
		}

		hoyUnsubmitted Not Done Reply Inline Actions The object is created but not used anywhere. Is that expected? hoy: The object is created but not used anywhere. Is that expected?
		spupyrevAuthorUnsubmitted Done Reply Inline Actions What object do you refer to? spupyrev: What object do you refer to?
		hoyUnsubmitted Not Done Reply Inline Actions nvm, the object pointed by `FunctionChain` is used in `FunctionChain->merge`. hoy: nvm, the object pointed by `FunctionChain` is used in `FunctionChain->merge`.
namespace {		namespace {

/// A pass to compute block placement statistics.		/// A pass to compute block placement statistics.
///		///
/// A separate pass to compute interesting statistics for evaluating block		/// A separate pass to compute interesting statistics for evaluating block
/// placement. This is separate from the actual placement pass so that they can		/// placement. This is separate from the actual placement pass so that they can
/// be computed in the absence of any placement transformations or when using		/// be computed in the absence of any placement transformations or when using
/// alternative placement strategies.		/// alternative placement strategies.
▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

llvm/lib/Transforms/Utils/CMakeLists.txt

	add_llvm_component_library(LLVMTransformUtils			add_llvm_component_library(LLVMTransformUtils
	AddDiscriminators.cpp			AddDiscriminators.cpp
	AMDGPUEmitPrintf.cpp			AMDGPUEmitPrintf.cpp
	ASanStackFrameLayout.cpp			ASanStackFrameLayout.cpp
	AssumeBundleBuilder.cpp			AssumeBundleBuilder.cpp
	BasicBlockUtils.cpp			BasicBlockUtils.cpp
	BreakCriticalEdges.cpp			BreakCriticalEdges.cpp
	BuildLibCalls.cpp			BuildLibCalls.cpp
	BypassSlowDivision.cpp			BypassSlowDivision.cpp
	CallPromotionUtils.cpp			CallPromotionUtils.cpp
	CallGraphUpdater.cpp			CallGraphUpdater.cpp
	CanonicalizeAliases.cpp			CanonicalizeAliases.cpp
	CanonicalizeFreezeInLoops.cpp			CanonicalizeFreezeInLoops.cpp
	CloneFunction.cpp			CloneFunction.cpp
	CloneModule.cpp			CloneModule.cpp
	CodeExtractor.cpp			CodeExtractor.cpp
				CodeLayout.cpp
	CodeMoverUtils.cpp			CodeMoverUtils.cpp
	CtorUtils.cpp			CtorUtils.cpp
	Debugify.cpp			Debugify.cpp
	DemoteRegToStack.cpp			DemoteRegToStack.cpp
	EntryExitInstrumenter.cpp			EntryExitInstrumenter.cpp
	EscapeEnumerator.cpp			EscapeEnumerator.cpp
	Evaluator.cpp			Evaluator.cpp
	FixIrreducible.cpp			FixIrreducible.cpp
	▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

llvm/lib/Transforms/Utils/CodeLayout.cpp

This file was added.

//===- CodeLayout.cpp - Implementation of code layout algorithms ----------===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// ExtTSP - layout of basic blocks with i-cache optimization.

// The algorithm tries to find a layout of nodes (basic blocks) of a given CFG

// optimizing jump locality and thus processor I-cache utilization. This is

// achieved via increasing the number of fall-through jumps and co-locating

// frequently executed nodes together. The name follows the underlying

// optimization problem, Extended-TSP, which is a generalization of classical

// (maximum) Traveling Salesmen Problem.

// The algorithm is a greedy heuristic that works with chains (ordered lists)

// of basic blocks. Initially all chains are isolated basic blocks. On every

// iteration, we pick a pair of chains whose merging yields the biggest increase

// in the ExtTSP score, which models how i-cache "friendly" a specific chain is.

// A pair of chains giving the maximum gain is merged into a new chain. The

// procedure stops when there is only one chain left, or when merging does not

// increase ExtTSP. In the latter case, the remaining chains are sorted by

// density in the decreasing order.

// An important aspect is the way two chains are merged. Unlike earlier

// algorithms (e.g., based on the approach of Pettis-Hansen), two

// chains, X and Y, are first split into three, X1, X2, and Y. Then we

// consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y,

// X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score.

// This improves the quality of the final result (the search space is larger)

// while keeping the implementation sufficiently fast.

// Reference:

// * A. Newell and S. Pupyrev, Improved Basic Block Reordering,

// IEEE Transactions on Computers, 2020

//===----------------------------------------------------------------------===//

#include "llvm/Transforms/Utils/CodeLayout.h"

#include "llvm/Support/Debug.h"

using namespace llvm;

#define DEBUG_TYPE "code-layout"

namespace {

// Algorithm-specific constants. The values are tuned for the best performance

// of large-scale front-end bound binaries.

const double ForwardWeight = 0.1;

const double BackwardWeight = 0.1;

const size_t ForwardDistance = 1024;

const size_t BackwardDistance = 640;

// The maximum size of a chain for splitting. Larger values of the threshold

rahmanlUnsubmitted

Done

Please consider adding parameters for these. I think it will be useful for users to tune based on their architecture.

rahmanl: Please consider adding parameters for these. I think it will be useful for users to tune based…

// may yield better quality at the cost of worsen run-time.

const size_t ChainSplitThreshold = 128;

// Epsilon for comparison of doubles.

constexpr double EPS = 1e-8;

class Block;

hoyUnsubmitted

Not Done

nit: the distance in use isn't really in bytes? It is number of MR instructions?

hoy: nit: the distance in use isn't really in bytes? It is number of MR instructions?

spupyrevAuthorUnsubmitted

Done

As discussed in https://reviews.llvm.org/D113424/new/#inline-1097352, we use block sizes in bytes, approximating it by num_mir_instr * 4.

Equivalently, we could similarly use the number of instructions in both places (by scaling the constants down by 4) but I don't really see a difference

spupyrev: As discussed in https://reviews.llvm.org/D113424/new/#inline-1097352, we use block sizes in…

class Chain;

class Edge;

// Calculate the Ext-TSP score, which quantifies the expected number of i-cache

// misses for a given ordering of basic blocks

double extTSPScore(uint64_t SrcAddr, uint64_t SrcSize, uint64_t DstAddr,

uint64_t Count) {

// Fallthrough

if (SrcAddr + SrcSize == DstAddr) {

// Assume that FallthroughWeight = 1.0 after normalization

return static_cast<double>(Count);

}

// Forward

if (SrcAddr + SrcSize < DstAddr) {

const auto Dist = DstAddr - (SrcAddr + SrcSize);

if (Dist <= ForwardDistance) {

double Prob = 1.0 - static_cast<double>(Dist) / ForwardDistance;

return ForwardWeight * Prob * Count;

}

return 0;

}

// Backward

const auto Dist = SrcAddr + SrcSize - DstAddr;

if (Dist <= BackwardDistance) {

double Prob = 1.0 - static_cast<double>(Dist) / BackwardDistance;

davidxlUnsubmitted

Not Done

The paper documents 6 types of branches and their weight K_{s,t}. Should these be parametized here?

davidxl: The paper documents 6 types of branches and their weight K_{s,t}. Should these be parametized…

return BackwardWeight * Prob * Count;

}

return 0;

}

using BlockPair = std::pair<Block *, Block *>;

using JumpList = std::vector<std::pair<BlockPair, uint64_t>>;

using BlockIter = std::vector<Block *>::const_iterator;

rahmanlUnsubmitted

Done

Please define a struct for the Jump (Src, Sink, Weight).

rahmanl: Please define a struct for the Jump (Src, Sink, Weight).

enum MergeTypeTy {

X_Y = 0,

X1_Y_X2 = 1,

davidxlUnsubmitted

Not Done

The tsp score does not seem to be contiguous here. When Dist == 0, it should have the same score as the fall through case, but this produces 0.1* Count.

davidxl: The tsp score does not seem to be contiguous here. When Dist == 0, it should have the same…

spupyrevAuthorUnsubmitted

Done

Agree, that's something me and my colleagues have been thinking about lately.

The current function is tuned for the maximum performance of the resulting binary. It is not guaranteed/enforced that the function is contiguous. That said, i'd be super happy to see and review improvements of the objective in the future. There is likely a room for optimizations here

spupyrev: Agree, that's something me and my colleagues have been thinking about lately. The current…

davidxlUnsubmitted

Not Done

I guessed that with ForwardWeight == 0.1, a bias can be introduced against laying out larger successor as the fall through block, but it is not the case. Using different forward weight seems to produce the same decision for a simple diamond cfg case -- in order to for one successor to be a fallthrough, the branch probability needs to > 50%.

I wonder what role this weight plays in practice (performance)?

davidxl: I guessed that with ForwardWeight == 0.1, a bias can be introduced against laying out larger…

spupyrevAuthorUnsubmitted

Done

I am not sure I understand the question, could you clarify?

The weights affect the resulting layout, which in turn may impact perf. However, changing the weights from 0.1 to say 0.2 likely won't make a difference, except for some corner cases.
IIRC, in my tests 3 years ago it was important to keep ForwardWeight<FallthroughWeight=1.0. Intuitively this makes sense, though the exact values may need to be re-tuned. (This is a separate and time-consuming project)

spupyrev: I am not sure I understand the question, could you clarify? The weights affect the resulting…

davidxlUnsubmitted

Not Done

You answer touched upon what I was asking. To clarify the question: are there performance data to show why 0.1 (instead of other values) is used. ExtTSP score can be considered as a locality score, and the forwardWeight is an attempt to model the cost of taken branch instruction. I wonder if there is a better way to model this in the objective function. No need to address it in this patch though.

davidxl: You answer touched upon what I was asking. To clarify the question: are there performance data…

Y_X2_X1 = 2,

X2_X1_Y = 3,

};

class MergeGainTy {

public:

explicit MergeGainTy() {}

explicit MergeGainTy(double Score, size_t MergeOffset, MergeTypeTy MergeType)

: Score(Score), MergeOffset(MergeOffset), MergeType(MergeType) {}

double score() const { return Score; }

size_t mergeOffset() const { return MergeOffset; }

hoyUnsubmitted

Done

nit: please use enum class and add comment for each enum value.

hoy: nit: please use enum class and add comment for each enum value.

MergeTypeTy mergeType() const { return MergeType; }

// returns 'true' iff Other is preferred other this

bool operator<(const MergeGainTy &Other) const {

return (Other.Score > EPS && Other.Score > Score + EPS);

}

private:

double Score{-1.0};

size_t MergeOffset{0};

MergeTypeTy MergeType{MergeTypeTy::X_Y};

};

/// A node in CFG corresponding to a basic block.

/// The class wraps several mutable fields utilized in the ExtTSP algorithm.

class Block {

rahmanlUnsubmitted

Done

This can be removed if we initialize with zero.

rahmanl: This can be removed if we initialize with zero.

public:

Block(const Block &) = delete;

Block(Block &&) = default;

Block &operator=(const Block &) = delete;

Block &operator=(Block &&) = default;

// The original index of the node in CFG.

size_t Index{0};

// The index of the block in the current chain.

size_t CurIndex{0};

rahmanlUnsubmitted

Done

Does this initialization serve a special purpose? We can make it zero if we always reject zero score for merging.

rahmanl: Does this initialization serve a special purpose? We can make it zero if we always reject zero…

spupyrevAuthorUnsubmitted

Done

I feel like negative score is a bit better reflects the intention here: it indicates that the merge is disadvantageous for the quality or simply "invalid". A score of "0" means that the merge is neutral for perf, so it is up to the algorithm to decide whether such merge needs to be done.
At some point we're experimenting with merging of 0-score chains (e.g., two chains connected by a cold jump). It might be beneficial for certain cases, e.g., for code size. I guess we can play with this more in the future

spupyrev: I feel like negative score is a bit better reflects the intention here: it indicates that the…

rahmanlUnsubmitted

Done

I think we would never merge 0-score chains. If there is any additional benefit (code size) it can be incorporated into the score. But I am OK with this.

rahmanl: I think we would never merge 0-score chains. If there is any additional benefit (code size) it…

// Size of the block in the binary.

uint64_t Size{0};

// Execution count of the block in the binary.

uint64_t ExecutionCount{0};

// Current chain of the basic block.

Chain *CurChain{nullptr};

// An offset of the block in the current chain.

mutable uint64_t EstimatedAddr{0};

// Fallthrough successor of the node in CFG.

Block *FallthroughSucc{nullptr};

// Fallthrough predecessor of the node in CFG.

Block *FallthroughPred{nullptr};

// Outgoing jumps from the block.

std::vector<std::pair<Block *, uint64_t>> OutJumps;

// Incoming jumps to the block.

rahmanlUnsubmitted

Done

The abstraction for block-edges is very loosely defined. I think defining a struct will help.

rahmanl: The abstraction for block-edges is very loosely defined. I think defining a struct will help.

std::vector<std::pair<Block *, uint64_t>> InJumps;

public:

explicit Block(size_t Index, uint64_t Size_, uint64_t EC)

: Index(Index), Size(Size_), ExecutionCount(EC) {}

};

/// A chain (ordered sequence) of CFG nodes.

class Chain {

public:

Chain(const Chain &) = delete;

Chain(Chain &&) = default;

Chain &operator=(const Chain &) = delete;

Chain &operator=(Chain &&) = default;

explicit Chain(size_t Id, Block *Block)

: Id(Id), IsEntry(Block->Index == 0), Score(0), Blocks(1, Block) {}

size_t id() const { return Id; }

bool isEntryPoint() const { return IsEntry; }

double score() const { return Score; }

void setScore(double NewScore) { Score = NewScore; }

const std::vector<Block *> &blocks() const { return Blocks; }

const std::vector<std::pair<Chain *, Edge *>> &edges() const { return Edges; }

Edge *getEdge(Chain *Other) const {

for (auto It : Edges) {

if (It.first == Other)

return It.second;

}

return nullptr;

}

void removeEdge(Chain *Other) {

auto It = Edges.begin();

while (It != Edges.end()) {

if (It->first == Other) {

Edges.erase(It);

return;

}

It++;

}

void addEdge(Chain *Other, Edge *Edge) {

Edges.push_back(std::make_pair(Other, Edge));

}

void merge(Chain *Other, const std::vector<Block *> &MergedBlocks) {

Blocks = MergedBlocks;

IsEntry |= Other->IsEntry;

// Update the block's chains

for (size_t Idx = 0; Idx < Blocks.size(); Idx++) {

Blocks[Idx]->CurChain = this;

Blocks[Idx]->CurIndex = Idx;

}

void mergeEdges(Chain *Other);

void clear() {

Blocks.clear();

Edges.clear();

}

private:

// Unique chain identifier.

size_t Id;

// Whether the chain starts with the entry basic block.

bool IsEntry;

// Cached ext-tsp score for the chain.

rahmanlUnsubmitted

Done

Do we really need to store a boolean here? It adds extra burden to make sure we keep it consistent after merging.

rahmanl: Do we really need to store a boolean here? It adds extra burden to make sure we keep it…

double Score;

// Blocks of the chain.

std::vector<Block *> Blocks;

// Adjacent chains and corresponding edges (lists of jumps).

std::vector<std::pair<Chain *, Edge *>> Edges;

};

rahmanlUnsubmitted

Done

How do we ensure the obsolete chains are removed from this structure? Would it make life easier if we define it as a map from the Chain pointers (or Chain ids) ?

rahmanl: How do we ensure the obsolete chains are removed from this structure? Would it make life easier…

spupyrevAuthorUnsubmitted

Done

I think this was done for optimizing the performance. The tradeoff here is finding an adjacent edge (via getEdge(Chain *Other)) vs iterating over and appending the edges. (The former is done only once, when the two chains are merged). Empirically, std::vector+linear search was faster than std::unordered_map on my benchmarks by 10%-15%. Probably an alternative map implementation may work better; not sure if that's worth it though.

spupyrev: I think this was done for optimizing the performance. The tradeoff here is finding an adjacent…

/// An edge in CFG reprsenting jumps between two chains.

/// When blocks are merged into chains, the edges are combined too so that

/// there is always at most one edge between a pair of chains

class Edge {

public:

rahmanlUnsubmitted

Done

Maybe call this ChainEdge so it's not confused with a basic block edge/jump.

rahmanl: Maybe call this ChainEdge so it's not confused with a basic block edge/jump.

Edge(const Edge &) = delete;

Edge(Edge &&) = default;

Edge &operator=(const Edge &) = delete;

Edge &operator=(Edge &&) = default;

explicit Edge(Block *SrcBlock, Block *DstBlock, uint64_t EC)

: SrcChain(SrcBlock->CurChain), DstChain(DstBlock->CurChain),

Jumps(1, std::make_pair(std::make_pair(SrcBlock, DstBlock), EC)) {}

const JumpList &jumps() const { return Jumps; }

void changeEndpoint(Chain *From, Chain *To) {

if (From == SrcChain)

SrcChain = To;

if (From == DstChain)

DstChain = To;

}

void appendJump(Block *SrcBlock, Block *DstBlock, uint64_t EC) {

Jumps.push_back(std::make_pair(std::make_pair(SrcBlock, DstBlock), EC));

}

void moveJumps(Edge *Other) {

rahmanlUnsubmitted

Done

Although this clears the vector, the underlying storage is not released. To free up the storage, we would need additional effort. (shrink_to_fit, resize, or swap).

rahmanl: Although this clears the vector, the underlying storage is not released. To free up the storage…

spupyrevAuthorUnsubmitted

Done

I don't think that makes any measurable difference, as we're working with relatively small instances. But explicitly freeing up memory won't hurt, i guess.

spupyrev: I don't think that makes any measurable difference, as we're working with relatively small…

Jumps.insert(Jumps.end(), Other->Jumps.begin(), Other->Jumps.end());

Other->Jumps.clear();

}

bool hasCachedMergeGain(Chain *Src, Chain *Dst) const {

return Src == SrcChain ? CacheValidForward : CacheValidBackward;

}

MergeGainTy getCachedMergeGain(Chain *Src, Chain *Dst) const {

return Src == SrcChain ? CachedGainForward : CachedGainBackward;

hoyUnsubmitted

Not Done

nit: we usually use uint64_t for id type explicitly.

hoy: nit: we usually use `uint64_t` for id type explicitly.

}

void setCachedMergeGain(Chain *Src, Chain *Dst, MergeGainTy MergeGain) {

if (Src == SrcChain) {

CachedGainForward = MergeGain;

CacheValidForward = true;

} else {

CachedGainBackward = MergeGain;

CacheValidBackward = true;

}

void invalidateCache() {

CacheValidForward = false;

CacheValidBackward = false;

}

private:

// Source chain.

Chain *SrcChain{nullptr};

// Destination chain.

Chain *DstChain{nullptr};

// Original jumps in the binary with correspinding execution counts.

JumpList Jumps;

// Cached ext-tsp value for merging the pair of chains.

// Since the gain of merging (Src, Dst) and (Dst, Src) might be different,

// we store both values here.

MergeGainTy CachedGainForward;

MergeGainTy CachedGainBackward;

// Whether the cached value must be recomputed.

bool CacheValidForward{false};

rahmanlUnsubmitted

Done

Same comment. Storage won't be released.

rahmanl: Same comment. Storage won't be released.

bool CacheValidBackward{false};

};

void Chain::mergeEdges(Chain *Other) {

assert(this != Other && "cannot merge a chain with itself");

// Update edges adjacent to chain Other

for (auto EdgeIt : Other->Edges) {

const auto DstChain = EdgeIt.first;

const auto DstEdge = EdgeIt.second;

const auto TargetChain = DstChain == Other ? this : DstChain;

auto CurEdge = getEdge(TargetChain);

if (CurEdge == nullptr) {

DstEdge->changeEndpoint(Other, this);

this->addEdge(TargetChain, DstEdge);

if (DstChain != this && DstChain != Other) {

DstChain->addEdge(this, DstEdge);

}

} else {

CurEdge->moveJumps(DstEdge);

}

// Cleanup leftover edge

if (DstChain != Other) {

DstChain->removeEdge(Other);

}

/// A wrapper around three chains of basic blocks; it is used to avoid extra

/// instantiation of the vectors.

class MergedChain {

public:

MergedChain(BlockIter Begin1, BlockIter End1, BlockIter Begin2 = BlockIter(),

BlockIter End2 = BlockIter(), BlockIter Begin3 = BlockIter(),

BlockIter End3 = BlockIter())

: Begin1(Begin1), End1(End1), Begin2(Begin2), End2(End2), Begin3(Begin3),

End3(End3) {}

template <typename F> void forEach(const F &Func) const {

for (auto It = Begin1; It != End1; It++)

Func(*It);

for (auto It = Begin2; It != End2; It++)

Func(*It);

for (auto It = Begin3; It != End3; It++)

Func(*It);

}

std::vector<Block *> getBlocks() const {

std::vector<Block *> Result;

Result.reserve(std::distance(Begin1, End1) + std::distance(Begin2, End2) +

std::distance(Begin3, End3));

Result.insert(Result.end(), Begin1, End1);

Result.insert(Result.end(), Begin2, End2);

Result.insert(Result.end(), Begin3, End3);

return Result;

}

const Block *getFirstBlock() const { return *Begin1; }

private:

BlockIter Begin1;

BlockIter End1;

BlockIter Begin2;

BlockIter End2;

BlockIter Begin3;

BlockIter End3;

};

/// The implementation of the ExtTSP algorithm.

class ExtTSPImpl {

using NodeOrder = std::vector<uint64_t>;

using NodeSizeMap = DenseMap<uint64_t, uint64_t>;

using NodeCountMap = DenseMap<uint64_t, uint64_t>;

using EdgeT = std::pair<uint64_t, uint64_t>;

using EdgeCountMap = DenseMap<EdgeT, uint64_t>;

public:

ExtTSPImpl(size_t NumNodes, const NodeSizeMap &NodeSizes,

const NodeCountMap &NodeCounts, const EdgeCountMap &EdgeCounts)

: NumNodes(NumNodes) {

initialize(NodeSizes, NodeCounts, EdgeCounts);

}

/// Run the algorithm and return an ordering of basic block.

void run(std::vector<uint64_t> &Result) {

// Pass 1: Merge blocks with their fallthrough successors

mergeFallthroughs();

// Pass 2: Merge pairs of chains while improving the ExtTSP objective

mergeChainPairs();

// Pass 3: Merge cold blocks to reduce code size

mergeColdChains();

// Collect blocks from all chains

concatChains(Result);

}

private:

/// Initialize algorithm's data structures.

void initialize(const NodeSizeMap &NodeSizes, const NodeCountMap &NodeCounts,

kosarevUnsubmitted

Not Done

This seems to fail on expensive checks, see https://github.com/llvm/llvm-project/issues/68594.

Do {Begin,End}{2,3} really form valid ranges when default to BlockIter() in the constructor?

kosarev: This seems to fail on expensive checks, see <https://github.com/llvm/llvm-project/issues/68594>.

spupyrevAuthorUnsubmitted

Done

Can you teach me to reproduce the failure? I'm building with "-DCMAKE_BUILD_TYPE=Debug -DLLVM_ENABLE_ASSERTIONS=ON" and run "ninja check-all" but still don't see it

spupyrev: Can you teach me to reproduce the failure? I'm building with "-DCMAKE_BUILD_TYPE=Debug…

kosarevUnsubmitted

Not Done

It fails on libc++'s consistency checks, which get enabled automatically on builds with LLVM's expensive checks turned on, -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON.

kosarev: It fails on libc++'s consistency checks, which get enabled automatically on builds with LLVM's…

spupyrevAuthorUnsubmitted

Done

hmm. I still cannot reproduce this. Here is my command:

cmake \
    -G Ninja \
    -DCMAKE_ASM_COMPILER=$MYCLANG \
    -DCMAKE_ASM_COMPILER_ID=Clang \
    -DCMAKE_BUILD_TYPE=Debug \
    -DCMAKE_CXX_COMPILER=$MYCLANG++ \
    -DCMAKE_C_COMPILER=$MYCLANG \
    -DLLVM_TARGETS_TO_BUILD="X86;AArch64" \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_LLD=ON \
    -DLLVM_ENABLE_PROJECTS="clang;lld;bolt" \
    -DCMAKE_EXE_LINKER_FLAGS="-L /usr/lib64" \
    -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON \
    ../llvm

     ninja check-all

Is something else missing?

spupyrev: hmm. I still cannot reproduce this. Here is my command: ``` cmake \ -G Ninja \…

kosarevUnsubmitted

Not Done

[ RUN      ] CodeLayout.ThreeFunctions
/usr/include/c++/11/debug/vector:617:
In function:
    std::debug::vector<_Tp, _Allocator>::iterator std::debug::vector<_Tp, 
    _Allocator>::insert(std::debug::vector<_Tp, _Allocator>::const_iterator, 
    _InputIterator, _InputIterator) [with _InputIterator = 
    gnu_debug::_Safe_iterator<gnu_cxx::normal_iterator<{anonymous}::NodeT* 
    const*, std::vector<{anonymous}::NodeT*, 
    std::allocator<{anonymous}::NodeT*> > >, std::
    debug::vector<{anonymous}::NodeT*>, std::random_access_iterator_tag>; 
    <template-parameter-2-2> = void; _Tp = {anonymous}::NodeT*; _Allocator = 
    std::allocator<{anonymous}::NodeT*>; std::debug::vector<_Tp, 
    _Allocator>::iterator = std::
    debug::vector<{anonymous}::NodeT*>::iterator; std::debug::vector<_Tp, 
    _Allocator>::const_iterator = std::
    debug::vector<{anonymous}::NodeT*>::const_iterator]

Error: function requires a valid iterator range [first, last).

Objects involved in the operation:
    iterator "first" @ 0x7ffe7dbd1c40 {
      state = singular;
    }
    iterator "last" @ 0x7ffe7dbd1c70 {
      state = singular;
    }
 #0 0x00007f5c10bc3b02 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/kosarev/labs/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:22
 #1 0x00007f5c10bc3f1e PrintStackTraceSignalHandler(void*) /home/kosarev/labs/llvm-project/llvm/lib/Support/Unix/Signals.inc:798:1
 #2 0x00007f5c10bc1363 llvm::sys::RunSignalHandlers() /home/kosarev/labs/llvm-project/llvm/lib/Support/Signals.cpp:105:20
 #3 0x00007f5c10bc339a SignalHandler(int) /home/kosarev/labs/llvm-project/llvm/lib/Support/Unix/Signals.inc:413:1
 #4 0x00007f5c10447520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #5 0x00007f5c1049b9fc __pthread_kill_implementation ./nptl/./nptl/pthread_kill.c:44:76
 #6 0x00007f5c1049b9fc __pthread_kill_internal ./nptl/./nptl/pthread_kill.c:78:10
 #7 0x00007f5c1049b9fc pthread_kill ./nptl/./nptl/pthread_kill.c:89:10
 #8 0x00007f5c10447476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #9 0x00007f5c1042d7f3 abort ./stdlib/./stdlib/abort.c:81:7
#10 0x00007f5c106d21fb std::__throw_bad_exception() (/lib/x86_64-linux-gnu/libstdc++.so.6+0xa51fb)
#11 0x00007f5c12dd8740 __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT**, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag> std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >::insert<__gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>, void>(__gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>, __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>, __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<(anonymous namespace)::NodeT* const*, std::__cxx1998::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> > >, std::__debug::vector<(anonymous namespace)::NodeT*, std::allocator<(anonymous namespace)::NodeT*> >, std::random_access_iterator_tag>) /usr/include/c++/11/debug/vector:617:4
#12 0x00007f5c12dcee70 (anonymous namespace)::MergedChain::getNodes() const /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:504:18
#13 0x00007f5c12dd5141 (anonymous namespace)::CDSortImpl::mergeChains((anonymous namespace)::ChainT*, (anonymous namespace)::ChainT*, unsigned long, (anonymous namespace)::MergeTypeT) /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1288:16
#14 0x00007f5c12dd4211 (anonymous namespace)::CDSortImpl::mergeChainPairs() /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1144:50
#15 0x00007f5c12dd302c (anonymous namespace)::CDSortImpl::run() /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1008:5
#16 0x00007f5c12dd615c llvm::codelayout::computeCacheDirectedLayout(llvm::codelayout::CDSortConfig const&, llvm::ArrayRef<unsigned long>, llvm::ArrayRef<unsigned long>, llvm::ArrayRef<llvm::codelayout::EdgeCount>, llvm::ArrayRef<unsigned long>) /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1435:3
#17 0x00007f5c12dd6333 llvm::codelayout::computeCacheDirectedLayout(llvm::ArrayRef<unsigned long>, llvm::ArrayRef<unsigned long>, llvm::ArrayRef<llvm::codelayout::EdgeCount>, llvm::ArrayRef<unsigned long>) /home/kosarev/labs/llvm-project/llvm/lib/Transforms/Utils/CodeLayout.cpp:1453:48
#18 0x00005638fc0abb97 (anonymous namespace)::CodeLayout_ThreeFunctions_Test::TestBody() /home/kosarev/labs/llvm-project/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp:18:78
#19 0x00007f5c13e2ef1a void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2612:28
#20 0x00007f5c13e1f20f void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2667:75
#21 0x00007f5c13df2bca testing::Test::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2694:30
#22 0x00007f5c13df350b testing::TestInfo::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2839:3
#23 0x00007f5c13df3eea testing::TestSuite::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:3017:35
#24 0x00007f5c13e01306 testing::internal::UnitTestImpl::RunAllTests() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:5921:41
#25 0x00007f5c13e315fd bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2614:1
#26 0x00007f5c13e20b7a bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:2667:75
#27 0x00007f5c13dffc25 testing::UnitTest::Run() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/src/gtest.cc:5487:14
#28 0x00007f5c13efde7f RUN_ALL_TESTS() /home/kosarev/labs/llvm-project/third-party/unittest/googletest/include/gtest/gtest.h:2317:80
#29 0x00007f5c13efdd61 main /home/kosarev/labs/llvm-project/third-party/unittest/UnitTestMain/TestMain.cpp:55:24
#30 0x00007f5c1042ed90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#31 0x00007f5c1042ee40 call_init ./csu/../csu/libc-start.c:128:20
#32 0x00007f5c1042ee40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#33 0x00005638fc0423e5 _start (/home/kosarev/labs/llvm-project/build/debug+expensive_checks/unittests/Transforms/Utils/./UtilsTests+0x1f3e5)

kosarev: Looks OK to me. I'd make sure the code is rebuilt from scratch and `_GLIBCXX_DEBUG` is defined…

spupyrevAuthorUnsubmitted

Done

Thanks for such a detailed analysis and for teaching me about singular iterators.
https://github.com/llvm/llvm-project/pull/68917

spupyrev: Thanks for such a detailed analysis and for teaching me about singular iterators. https…

const EdgeCountMap &EdgeCounts) {

// Initialize blocks

AllBlocks.reserve(NumNodes);

for (uint64_t Node = 0; Node < NumNodes; Node++) {

uint64_t Size = std::max<uint64_t>(NodeSizes.find(Node)->second, 1ULL);

uint64_t ExecutionCount = NodeCounts.find(Node)->second;

// The execution count of the entry block is set to at least 1

if (Node == 0 && ExecutionCount == 0)

ExecutionCount = 1;

AllBlocks.emplace_back(Node, Size, ExecutionCount);

}

// Initialize edges for the blocks and compute their total in/out weights

SuccNodes = std::vector<std::vector<uint64_t>>(NumNodes);

PredNodes = std::vector<std::vector<uint64_t>>(NumNodes);

size_t NumEdges = 0;

for (auto It : EdgeCounts) {

auto Pred = It.first.first;

auto Succ = It.first.second;

// Ignore self-edges

if (Pred == Succ)

continue;

SuccNodes[Pred].push_back(Succ);

PredNodes[Succ].push_back(Pred);

auto Count = It.second;

if (Count > 0) {

auto &Block = AllBlocks[Pred];

auto &SuccBlock = AllBlocks[Succ];

SuccBlock.InJumps.push_back(std::make_pair(&Block, Count));

Block.OutJumps.push_back(std::make_pair(&SuccBlock, Count));

NumEdges++;

}

// Initialize chains

AllChains.reserve(NumNodes);

HotChains.reserve(NumNodes);

for (auto &Block : AllBlocks) {

AllChains.emplace_back(Block.Index, &Block);

rahmanlUnsubmitted

Done

private:

- /// Initialize algorithm's data structures.

+ /// Initialize the algorithm's data structures.

void initialize(const NodeSizeMap &NodeSizes, const NodeCountMap &NodeCounts,

rahmanl:

Block.CurChain = &AllChains.back();

if (Block.ExecutionCount > 0) {

HotChains.push_back(&AllChains.back());

}

// Initialize edges

AllEdges.reserve(NumEdges);

for (auto &Block : AllBlocks) {

for (auto &Jump : Block.OutJumps) {

const auto SuccBlock = Jump.first;

auto CurEdge = Block.CurChain->getEdge(SuccBlock->CurChain);

// this edge is already present in the graph

if (CurEdge != nullptr) {

assert(SuccBlock->CurChain->getEdge(Block.CurChain) != nullptr);

CurEdge->appendJump(&Block, SuccBlock, Jump.second);

continue;

}

// this is a new edge

AllEdges.emplace_back(&Block, SuccBlock, Jump.second);

Block.CurChain->addEdge(SuccBlock->CurChain, &AllEdges.back());

SuccBlock->CurChain->addEdge(Block.CurChain, &AllEdges.back());

}

assert(AllEdges.size() <= NumEdges && "Incorrect number of created edges");

}

/// For a pair of blocks, A and B, block B is the fallthrough successor of A,

/// if (i) all jumps (based on profile) from A goes to B and (ii) all jumps

rahmanlUnsubmitted

Done

I think fallthrough may be a misnomer here. Normally, a block's fallthrough successor is its fallthrough block in the original layout regardless of whether these blocks have other outgoing edges.
I suggest renaming the concept. My suggestion is "mutually forced" or "mutually dominated" edges.

rahmanl: I think fallthrough may be a misnomer here. Normally, a block's fallthrough successor is its…

/// to B are from A. Such blocks should be adjacent in the optimal ordering;

/// the method finds and merges such pairs of blocks.

void mergeFallthroughs() {

// Find fallthroughs based on edge weights

for (auto &Block : AllBlocks) {

if (SuccNodes[Block.Index].size() == 1 &&

PredNodes[SuccNodes[Block.Index][0]].size() == 1 &&

SuccNodes[Block.Index][0] != 0) {

size_t SuccIndex = SuccNodes[Block.Index][0];

Block.FallthroughSucc = &AllBlocks[SuccIndex];

AllBlocks[SuccIndex].FallthroughPred = &Block;

}

// There might be 'cycles' in the fallthrough dependencies (since profile

// data isn't 100% accurate).

// Break the cycles by choosing the block with smallest index as the tail

for (auto &Block : AllBlocks) {

rahmanlUnsubmitted

Done

Please explain why you choose such block.

rahmanl: Please explain why you choose such block.

if (Block.FallthroughSucc == nullptr || Block.FallthroughPred == nullptr)

continue;

auto SuccBlock = Block.FallthroughSucc;

while (SuccBlock != nullptr && SuccBlock != &Block) {

SuccBlock = SuccBlock->FallthroughSucc;

}

if (SuccBlock == nullptr)

continue;

// Break the cycle

AllBlocks[Block.FallthroughPred->Index].FallthroughSucc = nullptr;

Block.FallthroughPred = nullptr;

}

// Merge blocks with their fallthrough successors

for (auto &Block : AllBlocks) {

if (Block.FallthroughPred == nullptr &&

Block.FallthroughSucc != nullptr) {

auto CurBlock = &Block;

while (CurBlock->FallthroughSucc != nullptr) {

const auto NextBlock = CurBlock->FallthroughSucc;

mergeChains(Block.CurChain, NextBlock->CurChain, 0, MergeTypeTy::X_Y);

CurBlock = NextBlock;

}

/// Merge pairs of chains while improving the ExtTSP objective.

void mergeChainPairs() {

/// Deterministically compare pairs of chains

auto compareChainPairs = [](const Chain *A1, const Chain *B1,

const Chain *A2, const Chain *B2) {

if (A1 != A2)

return A1->id() < A2->id();

return B1->id() < B2->id();

};

while (HotChains.size() > 1) {

Chain *BestChainPred = nullptr;

Chain *BestChainSucc = nullptr;

auto BestGain = MergeGainTy();

// Iterate over all pairs of chains

for (auto ChainPred : HotChains) {

// Get candidates for merging with the current chain

for (auto EdgeIter : ChainPred->edges()) {

auto ChainSucc = EdgeIter.first;

auto ChainEdge = EdgeIter.second;

// Ignore loop edges

if (ChainPred == ChainSucc)

continue;

// Compute the gain of merging the two chains

auto CurGain = mergeGain(ChainPred, ChainSucc, ChainEdge);

if (CurGain.score() <= EPS)

continue;

if (BestGain < CurGain ||

(std::abs(CurGain.score() - BestGain.score()) < EPS &&

compareChainPairs(ChainPred, ChainSucc, BestChainPred,

BestChainSucc))) {

BestGain = CurGain;

BestChainPred = ChainPred;

BestChainSucc = ChainSucc;

}

// Stop merging when there is no improvement

if (BestGain.score() <= EPS)

break;

// Merge the best pair of chains

mergeChains(BestChainPred, BestChainSucc, BestGain.mergeOffset(),

BestGain.mergeType());

}

/// Merge cold blocks to reduce code size.

void mergeColdChains() {

for (size_t SrcBB = 0; SrcBB < NumNodes; SrcBB++) {

// Iterating over neighbors in the reverse order to make sure original

// fallthrough jumps are merged first

size_t NumSuccs = SuccNodes[SrcBB].size();

for (size_t Idx = 0; Idx < NumSuccs; Idx++) {

auto DstBB = SuccNodes[SrcBB][NumSuccs - Idx - 1];

auto SrcChain = AllBlocks[SrcBB].CurChain;

auto DstChain = AllBlocks[DstBB].CurChain;

if (SrcChain != DstChain && !DstChain->isEntryPoint() &&

SrcChain->blocks().back()->Index == SrcBB &&

DstChain->blocks().front()->Index == DstBB) {

mergeChains(SrcChain, DstChain, 0, MergeTypeTy::X_Y);

}

/// Compute ExtTSP score for a given order of basic blocks.

double score(const MergedChain &MergedBlocks, const JumpList &Jumps) const {

rahmanlUnsubmitted

Done

This computes the total score for the jumps in the jump list. Please change the comment accordingly.

rahmanl: This computes the total score for the jumps in the jump list. Please change the comment…

if (Jumps.empty())

return 0.0;

uint64_t CurAddr = 0;

MergedBlocks.forEach([&](const Block *BB) {

BB->EstimatedAddr = CurAddr;

CurAddr += BB->Size;

});

double Score = 0;

for (auto &Jump : Jumps) {

const auto SrcBlock = Jump.first.first;

const auto DstBlock = Jump.first.second;

Score += extTSPScore(SrcBlock->EstimatedAddr, SrcBlock->Size,

DstBlock->EstimatedAddr, Jump.second);

}

return Score;

}

/// Compute the gain of merging two chains.

///

/// The function considers all possible ways of merging two chains and

/// computes the one having the largest increase in ExtTSP objective. The

/// result is a pair with the first element being the gain and the second

/// element being the corresponding merging type.

MergeGainTy mergeGain(Chain *ChainPred, Chain *ChainSucc, Edge *Edge) const {

if (Edge->hasCachedMergeGain(ChainPred, ChainSucc)) {

return Edge->getCachedMergeGain(ChainPred, ChainSucc);

}

// Precompute jumps between ChainPred and ChainSucc

auto Jumps = Edge->jumps();

auto EdgePP = ChainPred->getEdge(ChainPred);

if (EdgePP != nullptr)

Jumps.insert(Jumps.end(), EdgePP->jumps().begin(), EdgePP->jumps().end());

assert(Jumps.size() > 0 && "trying to merge chains w/o jumps");

MergeGainTy Gain = MergeGainTy();

// Try to concatenate two chains w/o splitting

Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, 0,

MergeTypeTy::X_Y);

// Attach (a part of) ChainPred before the first block of ChainSucc

for (auto &Jump : ChainSucc->blocks().front()->InJumps) {

rahmanlUnsubmitted

Done

I see the following two loops try to form a new fallthrough between ChainPred and ChainSucc and this is independent from the size of the ChainPred. This could be an independent optimization. Do you think you can do this as a later patch? That way, we can evaluate its benefit /overhead in a better way.

rahmanl: I see the following two loops try to form a new fallthrough between ChainPred and ChainSucc and…

spupyrevAuthorUnsubmitted

Done

I don't see a clear advantage of moving this optimization to a separate diff. It is a relatively minor aspect of the algorithm and responsible for just a few lines of code. Not sure if the separation would noticeable simplify reviewing.

I do agree with another point here: The algorithm may benefit from some fine-tuning. I don't think the currently utilized parameters and options (e.g., how exactly we split/merge the chains) are optimal; they have been tuned a few years ago on a couple of workloads running on a particular hardware. I would really appreciate the community help.

For now, I'd however keep the params and optimizations as is, so that the existing customers do not see sudden regressions.

spupyrev: I don't see a clear advantage of moving this optimization to a separate diff. It is a…

rahmanlUnsubmitted

Done

Agreed. I am mostly interested in separately analyzing the impact of this. I think the user can do so by setting the split parameter to be zero.

One suggestion to make this function more compact: Please consider adding a helper lambda function that given an offset and a list of merge types, does the necessary checks for the offset (like PredChain->blocks[Offset]->ForcedSucc != nullptr) and calls computeMergeGain for each merge type. I think we can make this function more readable and concise as well.

rahmanl: Agreed. I am mostly interested in separately analyzing the impact of this. I think the user can…

spupyrevAuthorUnsubmitted

Done

Added an option to disable this extra optimization.

Not sure I understand the proposal here.

spupyrev: Added an option to disable this extra optimization. Not sure I understand the proposal here.

rahmanlUnsubmitted

Done

You can use something like the following function:

auto ComputeMergeGainWithOffsetAndTypes([&](int Offset, std::vector<MergeTypeTy> &merge_types) {
  if (Offset == 0 || Offset == PredChain.size())
    return;
  auto BB = ChainPred->blocks()[Offset - 1];
   // Skip the splitting if it breaks FT successors
   if (BB->ForcedSucc != nullptr)
     continue;
   for (auto &MergeType: merge_types) {
      Gain.updateIfLessThan(computeMergeGain(ChainPred, ChainSucc, Jumps, Offset, MergeType));
   }
}

rahmanl: You can use something like the following function: lang=c++ auto…

const auto SrcBlock = Jump.first;

if (SrcBlock->CurChain != ChainPred)

continue;

if (SrcBlock->FallthroughSucc != nullptr)

continue;

davidxlUnsubmitted

Not Done

It is confusing to have both this method and calcExtTSPScore methods.

davidxl: It is confusing to have both this method and calcExtTSPScore methods.

size_t Offset = SrcBlock->CurIndex + 1;

if (Offset == ChainPred->blocks().size())

continue;

Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, Offset,

MergeTypeTy::X1_Y_X2);

Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, Offset,

MergeTypeTy::X2_X1_Y);

}

// Attach (a part of) ChainPred after the last block of ChainSucc

for (auto &Jump : ChainSucc->blocks().back()->OutJumps) {

const auto DstBlock = Jump.first;

if (DstBlock->CurChain != ChainPred)

continue;

if (DstBlock->FallthroughPred != nullptr)

continue;

size_t Offset = DstBlock->CurIndex;

if (Offset == 0)

continue;

rahmanlUnsubmitted

Done

Please rename this to GetBestMergeGain for consistency with the function comment and to further separate it from computeMergeGain.

rahmanl: Please rename this to GetBestMergeGain for consistency with the function comment and to further…

Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, Offset,

MergeTypeTy::X1_Y_X2);

Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, Offset,

MergeTypeTy::Y_X2_X1);

}

// Try to break ChainPred in various ways and concatenate with ChainSucc

if (ChainPred->blocks().size() <= ChainSplitThreshold) {

for (size_t Offset = 1; Offset < ChainPred->blocks().size(); Offset++) {

auto BB1 = ChainPred->blocks()[Offset - 1];

// Skip the splitting if it breaks FT successors

if (BB1->FallthroughSucc != nullptr) {

#ifndef NDEBUG

auto BB2 = ChainPred->blocks()[Offset];

assert(BB1->FallthroughSucc == BB2 && "Fallthrough not preserved");

#endif

continue;

rahmanlUnsubmitted

Done

This seems to be an assertion that is unrelated to the purpose of this code block. Could you please remove or move it to a relevant position?

rahmanl: This seems to be an assertion that is unrelated to the purpose of this code block. Could you…

}

// Try to split the chain in different ways

Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, Offset,

MergeTypeTy::X1_Y_X2);

rahmanlUnsubmitted

Not Done

Why not doing X2_Y_X1 here?

rahmanl: Why not doing X2_Y_X1 here?

rahmanlUnsubmitted

Not Done

I am still puzzled by why we don't do X2_Y_X1. Is it not beneficial?

rahmanl: I am still puzzled by why we don't do X2_Y_X1. Is it not beneficial?

spupyrevAuthorUnsubmitted

Done

Yes, it is not giving much benefits. I checked a binary with ~100K instances: Adding this type of splits makes a difference for 4 CFGs but it is responsible for ~25% of splits. Probably keeping the split doesn't worth it.

spupyrev: Yes, it is not giving much benefits. I checked a binary with ~100K instances: Adding this type…

rahmanlUnsubmitted

Done

Great. Please add a comment to explain why this splitting is not exercised.

rahmanl: Great. Please add a comment to explain why this splitting is not exercised.

Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, Offset,

MergeTypeTy::Y_X2_X1);

Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, Offset,

MergeTypeTy::X2_X1_Y);

}

Edge->setCachedMergeGain(ChainPred, ChainSucc, Gain);

return Gain;

}

/// Merge two chains and update the best gain.

MergeGainTy computeMergeGain(const MergeGainTy &CurGain,

rahmanlUnsubmitted

Done

Does this actually merge the chains together?

rahmanl: Does this actually merge the chains together?

const Chain *ChainPred, const Chain *ChainSucc,

const JumpList &Jumps, size_t MergeOffset,

MergeTypeTy MergeType) const {

auto MergedBlocks = mergeBlocks(ChainPred->blocks(), ChainSucc->blocks(),

MergeOffset, MergeType);

// Do not allow a merge that does not preserve the original entry block

if ((ChainPred->isEntryPoint() || ChainSucc->isEntryPoint()) &&

MergedBlocks.getFirstBlock()->Index != 0)

return CurGain;

// The gain for the new chain

const auto NewScore = score(MergedBlocks, Jumps) - ChainPred->score();

auto NewGain = MergeGainTy(NewScore, MergeOffset, MergeType);

rahmanlUnsubmitted

Done

Is this the new score or the change in the score? ChainSucc's intra-chain jumps are not included in the Jumps vector.

rahmanl: Is this the new score or the change in the score? `ChainSucc`'s intra-chain jumps are not…

return CurGain < NewGain ? NewGain : CurGain;

}

rahmanlUnsubmitted

Done

This is inconsistent with the function comment. CurGain is not updated here, but rather in the caller of this function.

My suggestion is to add the following function to MergeGainTy and use it in the callsite. This way you can remove the first parameter from computeMergeGain and also computeMergeGain can return what it computes.

void updateIfLessThan(const MergeGainTy &OtherGain) {
      if (*this < OtherGain)
            * this = OtherGain;
}

rahmanl: This is inconsistent with the function comment. CurGain is not updated here, but rather in the…

/// Merge two chains of blocks respecting a given merge 'type' and 'offset'.

///

/// If MergeType == 0, then the result is a concatentation of two chains.

/// Otherwise, the first chain is cut into two sub-chains at the offset,

MaskRayUnsubmitted

Not Done

This should be Jump->Target. See https://github.com/llvm/llvm-project/pull/66592

MaskRay: This should be `Jump->Target`. See https://github.com/llvm/llvm-project/pull/66592

/// and merged using all possible ways of concatenating three chains.

MergedChain mergeBlocks(const std::vector<Block *> &X,

const std::vector<Block *> &Y, size_t MergeOffset,

MergeTypeTy MergeType) const {

rahmanlUnsubmitted

Not Done

Rename this to XSplitOffset.

rahmanl: Rename this to XSplitOffset.

// Split the first chain, X, into X1 and X2

BlockIter BeginX1 = X.begin();

BlockIter EndX1 = X.begin() + MergeOffset;

BlockIter BeginX2 = X.begin() + MergeOffset;

BlockIter EndX2 = X.end();

BlockIter BeginY = Y.begin();

BlockIter EndY = Y.end();

// Construct a new chain from the three existing ones

switch (MergeType) {

case MergeTypeTy::X_Y:

return MergedChain(BeginX1, EndX2, BeginY, EndY);

case MergeTypeTy::X1_Y_X2:

return MergedChain(BeginX1, EndX1, BeginY, EndY, BeginX2, EndX2);

case MergeTypeTy::Y_X2_X1:

return MergedChain(BeginY, EndY, BeginX2, EndX2, BeginX1, EndX1);

case MergeTypeTy::X2_X1_Y:

return MergedChain(BeginX2, EndX2, BeginX1, EndX1, BeginY, EndY);

}

llvm_unreachable("unexpected chain merge type");

}

/// Merge chain From into chain Into, update the list of active chains,

/// adjacency information, and the corresponding cached values.

void mergeChains(Chain *Into, Chain *From, size_t MergeOffset,

MergeTypeTy MergeType) {

assert(Into != From && "a chain cannot be merged with itself");

// Merge the blocks

auto MergedBlocks =

mergeBlocks(Into->blocks(), From->blocks(), MergeOffset, MergeType);

Into->merge(From, MergedBlocks.getBlocks());

Into->mergeEdges(From);

From->clear();

// Update cached ext-tsp score for the new chain

auto SelfEdge = Into->getEdge(Into);

if (SelfEdge != nullptr) {

MergedBlocks = MergedChain(Into->blocks().begin(), Into->blocks().end());

Into->setScore(score(MergedBlocks, SelfEdge->jumps()));

}

// Remove chain From from the list of active chains

auto Iter = std::remove(HotChains.begin(), HotChains.end(), From);

HotChains.erase(Iter, HotChains.end());

rahmanlUnsubmitted

Done

It seems that the obsolete chains are not deallocated from the AllChains vector. Does this raise any memory concerns?

rahmanl: It seems that the obsolete chains are not deallocated from the `AllChains` vector. Does this…

// Invalidate caches

for (auto EdgeIter : Into->edges()) {

EdgeIter.second->invalidateCache();

}

/// Concatenate all chains into a final order of blocks.

void concatChains(std::vector<uint64_t> &Order) {

// Collect chains and calculate basic stats (for their ordering)

std::vector<Chain *> SortedChains;

DenseMap<const Chain *, double> ChainDensity;

for (auto &Chain : AllChains) {

if (!Chain.blocks().empty()) {

SortedChains.push_back(&Chain);

// using doubles to avoid overflow of ExecutionCount

double Size = 0;

double ExecutionCount = 0;

for (auto Block : Chain.blocks()) {

Size += static_cast<double>(Block->Size);

ExecutionCount += static_cast<double>(Block->ExecutionCount);

}

assert(Size > 0 && "a chain of zero size");

ChainDensity[&Chain] = ExecutionCount / Size;

}

// Sorting chains by density in the decreasing order

std::stable_sort(SortedChains.begin(), SortedChains.end(),

[&](const Chain *C1, const Chain *C2) {

// Original entry point to the front

if (C1->isEntryPoint() != C2->isEntryPoint()) {

if (C1->isEntryPoint())

return true;

if (C2->isEntryPoint())

return false;

}

const double D1 = ChainDensity[C1];

const double D2 = ChainDensity[C2];

if (D1 != D2)

return D1 > D2;

// Making the order deterministic

return C1->id() < C2->id();

});

// Collect the basic blocks in the order specified by their chains

Order.reserve(NumNodes);

for (auto Chain : SortedChains) {

for (auto Block : Chain->blocks()) {

Order.push_back(Block->Index);

}

private:

/// The number of nodes in the graph.

const size_t NumNodes;

/// Successors of each node.

std::vector<std::vector<uint64_t>> SuccNodes;

/// Predecessors of each node.

std::vector<std::vector<uint64_t>> PredNodes;

/// All basic blocks.

std::vector<Block> AllBlocks;

/// All chains of basic blocks.

std::vector<Chain> AllChains;

/// Active chains. The vector gets updated at runtime when chains are merged.

rahmanlUnsubmitted

Done

Change comment to clearly mention this is for putting the entry block at the beginning.

rahmanl: Change comment to clearly mention this is for putting the entry block at the beginning.

std::vector<Chain *> HotChains;

/// All edges between chains.

std::vector<Edge> AllEdges;

};

rahmanlUnsubmitted

Done

We can simply use return C1->isEntry()?

rahmanl: We can simply use `return C1->isEntry()`?

} // end of anonymous namespace

std::vector<uint64_t> llvm::applyExtTspLayout(

const DenseMap<uint64_t, uint64_t> &NodeSizes,

const DenseMap<uint64_t, uint64_t> &NodeCounts,

const DenseMap<std::pair<uint64_t, uint64_t>, uint64_t> &EdgeCounts) {

size_t NumNodes = NodeSizes.size();

// Verify correctness of the input data.

rahmanlUnsubmitted

Done

How about shortening this to return (D1 != D2) ? (D1 > D2) : (C1->id() < C2->id())?

rahmanl: How about shortening this to `return (D1 != D2) ? (D1 > D2) : (C1->id() < C2->id())`?

assert(NodeCounts.size() == NodeSizes.size() && "Incorrect input");

assert(NumNodes > 2 && "Incorrect input");

for (size_t Node = 0; Node < NumNodes; Node++) {

assert(NodeSizes.count(Node) > 0 && "Missing node size");

assert(NodeCounts.count(Node) > 0 && "Missing node count");

}

// Apply the reordering algorithm.

auto Alg = ExtTSPImpl(NumNodes, NodeSizes, NodeCounts, EdgeCounts);

std::vector<uint64_t> Result;

Alg.run(Result);

// Verify correctness of the output.

assert(Result.front() == 0 && "Original entry point is not preserved");

assert(Result.size() == NumNodes && "Incorrect size of reordered layout");

return Result;

}

uint64_t llvm::calcExtTspScore(

const std::vector<uint64_t> &Order,

const DenseMap<uint64_t, uint64_t> &NodeSizes,

const DenseMap<uint64_t, uint64_t> &NodeCounts,

const DenseMap<std::pair<uint64_t, uint64_t>, uint64_t> &EdgeCounts) {

DenseMap<uint64_t, uint64_t> BlockIndex;

for (size_t Idx = 0; Idx < Order.size(); Idx++) {

BlockIndex[Order[Idx]] = Idx;

}

uint64_t Score = 0;

for (auto It : EdgeCounts) {

auto Pred = It.first.first;

assert(BlockIndex.find(Pred) != BlockIndex.end() && "Block not found");

auto Succ = It.first.second;

assert(BlockIndex.find(Succ) != BlockIndex.end() && "Block not found");

// Incresing the score if the two nodes are adjacent in the order.

if (BlockIndex[Pred] + 1 == BlockIndex[Succ])

Score += It.second;

}

return Score;

}

uint64_t llvm::calcExtTspScore(

const DenseMap<uint64_t, uint64_t> &NodeSizes,

const DenseMap<uint64_t, uint64_t> &NodeCounts,

const DenseMap<std::pair<uint64_t, uint64_t>, uint64_t> &EdgeCounts) {

hoyUnsubmitted

Not Done

nit: no need to use #ifndef NDEBUG

hoy: nit: no need to use #ifndef NDEBUG

std::vector<uint64_t> Order;

Order.reserve(NodeSizes.size());

for (size_t Idx = 0; Idx < NodeSizes.size(); Idx++) {

Order.push_back(Idx);

}

return calcExtTspScore(Order, NodeSizes, NodeCounts, EdgeCounts);

}

hoyUnsubmitted

Done

This is an assert-only loop. Put it under #ifndef NDEBUG ?

hoy: This is an assert-only loop. Put it under `#ifndef NDEBUG` ?

davidxlUnsubmitted

Done

Can the reserve be combined with the vector declaration, or the intention is to avoid initialization?

davidxl: Can the reserve be combined with the vector declaration, or the intention is to avoid…

davidxlUnsubmitted

Done

From the initialization, Order[Idx] == Idx, so is this map needed?

davidxl: From the initialization, Order[Idx] == Idx, so is this map needed?

spupyrevAuthorUnsubmitted

Done

This function is also called when Order != Identity

spupyrev: This function is also called when Order != Identity

davidxlUnsubmitted

Not Done

Probably provide another wrapper to compute the mapping and pass it in. The map can use the vector type, so for the case when Order == Identity, Order can be used for both.

davidxl: Probably provide another wrapper to compute the mapping and pass it in. The map can use the…

davidxlUnsubmitted

Done

typo 'Increasing'

davidxl: typo 'Increasing'

davidxlUnsubmitted

Done

why is the count not used?

davidxl: why is the count not used?

davidxlUnsubmitted

Done

This function computes the 'fallthrough' maximizing TSP score, not the extTSP in the original paper. Where is the h function and the K coeffcients?

davidxl: This function computes the 'fallthrough' maximizing TSP score, not the extTSP in the original…

spupyrevAuthorUnsubmitted

Done

oops. Refactored function to compute correct result

spupyrev: oops. Refactored function to compute correct result

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll

This file was added.

				; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 < %s \| FileCheck %s

				rahmanlUnsubmitted Done Reply Inline Actions The interesting code path (splitting the chain) is not exercised by these tests. Please add a test for that. rahmanl: The interesting code path (splitting the chain) is not exercised by these tests. Please add a…
				spupyrevAuthorUnsubmitted Done Reply Inline Actions Added some (possibly naive) test. Let me know if you have something more complex in mind. spupyrev: Added some (possibly naive) test. Let me know if you have something more complex in mind.
				rahmanlUnsubmitted Done Reply Inline Actions Thanks for adding the test case. I was thinking of something less complex which exercises the splitting to improve fallthroughs. (Might be worthwhile to check if the old algorithm gets the optimal layout in this case). foo \| \ \| \10 \| \ \| v \|17 foo.bb1 \| / \| / 10 v v foo.bb2 rahmanl: Thanks for adding the test case. I was thinking of something less complex which exercises the…
				spupyrevAuthorUnsubmitted Done Reply Inline Actions Added this test (func4) spupyrev: Added this test (func4)
				davidxlUnsubmitted Done Reply Inline Actions The extTSP algorithm should be able to handle common loop rotation case. Can you add a test case about it? davidxl: The extTSP algorithm should be able to handle common loop rotation case. Can you add a test…
				spupyrevAuthorUnsubmitted Done Reply Inline Actions I added some toy test, let me know if you have something more complex in mind. spupyrev: I added some toy test, let me know if you have something more complex in mind.
				define void @func1() {
				; Test that the placement positions the most likely sucessor first
				;
				rahmanlUnsubmitted Done Reply Inline Actions This is missing the ascii cfg. rahmanl: This is missing the ascii cfg.
				; CHECK-LABEL: func1:
				; CHECK: b0
				; CHECK: b2
				; CHECK: b1

				b0:
				%call = call zeroext i1 @a()
				br i1 %call, label %b1, label %b2, !prof !1

				davidxlUnsubmitted Done Reply Inline Actions We can reason that it requires double the weights to the merged block compared with the side branch of the triangular shaped CFG to make non-topological order to be profitable. In this case it is 100> 2* 40. Can you add a negative case that keeps the top order? davidxl: We can reason that it requires double the weights to the merged block compared with the side…
				b1:
				call void @d()
				call void @d()
				call void @d()
				br label %b2

				b2:
				call void @e()
				ret void
				}


				define void @func2() !prof !2 {
				; Test that the placement positions the hot chain is placed continuosly
				;
				davidxlUnsubmitted Done Reply Inline Actions Fix typo davidxl: Fix typo
				; +----+ [7] +-------+
				; \| b1 \| <----- \| b0 \|
				; +----+ +-------+
				; \| \|
				; \| \| [15]
				; \| v
				; \| +-------+
				; \| \| b3 \|
				; \| +-------+
				; \| \|
				; \| \| [15]
				; \| v
				; \| +-------+ [31]
				; \| \| \| -------+
				; \| \| b4 \| \|
				; \| \| \| <------+
				; \| +-------+
				; \| \|
				; \| \| [15]
				; \| v
				; \| [7] +-------+
				; +---------> \| b2 \|
				; +-------+
				;
				; CHECK-LABEL: func2:
				; CHECK: b0
				; CHECK: b3
				; CHECK: b4
				; CHECK: b2
				; CHECK: b1

				b0:
				call void @d()
				call void @d()
				call void @d()
				%call = call zeroext i1 @a()
				br i1 %call, label %b1, label %b3, !prof !3

				b1:
				call void @d()
				br label %b2

				b2:
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				ret void

				b3:
				call void @d()
				br label %b4

				b4:
				call void @d()
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %b2, label %b4, !prof !4
				}


				define void @func3() !prof !5 {
				; A larger test where it is beneficial for locality to break the loop
				;
				; +--------+
				; \| b0 \|
				; +--------+
				; \|
				; \| [177]
				; v
				; +----+ [177] +---------------------------+
				; \| b5 \| <------- \| b1 \|
				; +----+ +---------------------------+
				; \| ^ ^
				; \| [196] \| [124] \| [70]
				; v \| \|
				; +----+ [70] +--------+ \| \|
				; \| b4 \| <------- \| b2 \| \| \|
				; +----+ +--------+ \| \|
				; \| \| \| \|
				; \| \| [124] \| \|
				; \| v \| \|
				; \| +--------+ \| \|
				; \| \| b3 \| -+ \|
				; \| +--------+ \|
				; \| \|
				; +-----------------------------------+
				;
				; CHECK-LABEL: func3:
				; CHECK: b0
				; CHECK: b1
				; CHECK: b2
				; CHECK: b3
				; CHECK: b5
				; CHECK: b4

				b0:
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				call void @f()
				br label %b1

				b1:
				%call = call zeroext i1 @a()
				br i1 %call, label %b5, label %b2, !prof !6

				b2:
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %b3, label %b4, !prof !7

				b3:
				call void @d()
				call void @f()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				call void @d()
				br label %b1

				b4:
				call void @d()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				call void @e()
				br label %b1

				b5:
				ret void
				}

				declare zeroext i1 @a()
				declare void @d()
				declare void @e()
				declare void @g()
				declare void @f()

				!1 = !{!"branch_weights", i32 40, i32 100}
				!2 = !{!"function_entry_count", i64 2200}
				!3 = !{!"branch_weights", i32 700, i32 1500}
				!4 = !{!"branch_weights", i32 1500, i32 3100}
				!5 = !{!"function_entry_count", i64 177}
				!6 = !{!"branch_weights", i32 177, i32 196}
				!7 = !{!"branch_weights", i32 125, i32 70}
				davidxlUnsubmitted Done Reply Inline Actions This might be a bug in existing layout. By design, with profile data, in order to take b2 as the layout successor, the incoming edge weight needs > 210 = 20. davidxl:* This might be a bug in existing layout. By design, with profile data, in order to take b2 as…
				spupyrevAuthorUnsubmitted Done Reply Inline Actions This tests uses two different modes of ext-tsp, with and without "chain splitting" -- something that Rahman suggested. The "old" layout in MachineBlockPlacement isn't applied here. spupyrev: This tests uses two different modes of ext-tsp, with and without "chain splitting" -- something…
				rahmanlUnsubmitted Done Reply Inline Actions Please use the same names as in the CFG so this can be easily mapped. And please do this across all tests. rahmanl: Please use the same names as in the CFG so this can be easily mapped. And please do this across…

This is an archive of the discontinued LLVM Phabricator instance.

ext-tsp basic block layoutClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 385565

llvm/include/llvm/Transforms/Utils/CodeLayout.h

llvm/lib/CodeGen/MachineBlockPlacement.cpp

llvm/lib/Transforms/Utils/CMakeLists.txt

llvm/lib/Transforms/Utils/CodeLayout.cpp

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll

ext-tsp basic block layout
ClosedPublic