This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Utils/
-
llvm/
-
Transforms/
-
Utils/
4/4
CodeLayout.h
-
lib/Transforms/Utils/
-
Transforms/
-
Utils/
18/24
CodeLayout.cpp

Differential D152834

A new code layout algorithm for function reordering [2/3]
ClosedPublic

Authored by spupyrev on Jun 13 2023, 10:20 AM.

Download Raw Diff

Details

Reviewers

wenlei
hoy
wlei
Amir
maksfb
davidxl
rahmanl

Commits

rGbc59faa86308: A new code layout algorithm for function reordering [2/3]

Summary

We are bringing a new algorithm for function layout (reordering) based on the
call graph (extracted from a profile data). The algorithm is an improvement of
top of a known heuristic, C^3. It tries to co-locate hot and frequently executed
together functions in the resulting ordering. Unlike C^3, it explores a larger
search space and have an objective closely tied to the performance of
instruction and i-TLB caches. Hence, the name CDS = Cache-Directed Sort.
The algorithm can be used at the linking or post-linking (e.g., BOLT) stage.

The algorithm shares some similarities with C^3 and an approach for basic block
reordering (ext-tsp). It works with chains (ordered lists)
of functions. Initially all chains are isolated functions. On every iteration,
we pick a pair of chains whose merging yields the biggest increase in the
objective, which is a weighted combination of frequency-based and distance-based
locality. That is, we try to co-locate hot functions together (so they can share
the cache lines) and functions frequently executed together. The merging process
stops when there is only one chain left, or when merging does not improve the
objective. In the latter case, the remaining chains are sorted by density in the
decreasing order.

Complexity
We regularly apply the algorithm for large data-center binaries containing 10K+
(hot) functions, and the algorithm takes only a few seconds. For some extreme
cases with 100K-1M nodes, the runtime is within minutes.

Perf-impact
We extensively tested the implementation extensively on a benchmark of isolated
binaries and prod services. The impact is measurable for "larger" binaries that
are front-end bound: the cpu time improvement (on top of C^3) is in the range
of [0% .. 1%], which is a result of a reduced i-TLB miss rate (by up to 20%) and
i-cache miss rate (up to 5%).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spupyrev created this revision.Jun 13 2023, 10:20 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 10:20 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

spupyrev added a parent revision: D152833: A new code layout algorithm for function reordering [1/3].Jun 13 2023, 10:21 AM

spupyrev added a child revision: D152840: [ELF] A new code layout algorithm for function reordering [3a/3].Jun 13 2023, 10:57 AM

Harbormaster completed remote builds in B238543: Diff 530969.Jun 13 2023, 12:58 PM

spupyrev published this revision for review.Jun 13 2023, 1:31 PM

spupyrev edited the summary of this revision. (Show Details)

spupyrev added reviewers: wenlei, hoy, wlei.

spupyrev added subscribers: Amir, maksfb.

Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 1:31 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

spupyrev added a child revision: D153039: [BOLT] A new code layout algorithm for function reordering [3b/3].Jun 15 2023, 8:44 AM

spupyrev added reviewers: Amir, maksfb.Jun 15 2023, 8:58 AM

spupyrev added a subscriber: shatianw_mt.

spupyrev added a reviewer: davidxl.Jun 16 2023, 12:09 PM

What is the baseline of the performance measurement? Is it with AutoFDO or instrumentation FDO? I also wonder if performance number is measured with hugepage text or small pages (it is surprising to see 20% iTLB reduction over the baseline -- assuming with profile data).

davidxl added a reviewer: rahmanl.Jun 20 2023, 12:01 PM

snehasish added a subscriber: snehasish.Jun 20 2023, 12:14 PM

I am also interested in getting more details about your evaluation. Currently, LLD uses CallChainClustering for FDO (https://github.com/llvm-mirror/lld/blob/master/ELF/CallGraphSort.cpp). I wonder how much perf. improvement we get if we hook this into LLD.

@rahmanl @davidxl Our measurements are always on top of C^3 (the one currently utilized by CallGraphSort.cpp). If we compare against "no-function-ordering" (ie looking at the order produced by the compiler), then the wins would be well beyond 1% cpu but that's not a realistic scenario these days. The linked follow-up diffs, D152840 and D153039, enable the ordering in the linker and in the BOLT. I'll share more specific numbers on my benchmarks soon.

Here are my measurements on the clang binary (release_14) by compiling two large cpp files (benchmark1 and benchmark2). Negative values are improvements, bold ones are stat sig.

Using the alg in LLD (on top of D152840), with AutoFDO and without huge pages

	base	test	delta(%)
benchmark1
task-clock	6440.07 ± 16.94	6373.95 ± 14.70	-1.01 ± 0.33
icache-misses	218340412 ± 361175	210448621 ± 387645	-3.61 ± 0.16
itlb-misses	45609238 ± 129225	42629503 ± 46353	-6.38 ± 0.16
benchmark2
task-clock	9509.05 ± 21.74	9443.90 ± 30.05	-0.73 ± 0.29
icache-misses	174525893 ± 294760	166852633 ± 253654	-4.38 ± 0.16
itlb-misses	36756578 ± 90162	35175447 ± 91296	-4.29 ± 0.29

Using the alg in BOLT (on top of D153039), with AutoFDO and without huge pages

	base	test	delta(%)
benchmark1
task-clock	5398.95 ± 18.12	5366.50 ± 12.47	-0.52 ± 0.39
icache-misses	86342101 ± 258598	85715814 ± 152068	-0.63 ± 0.19
itlb-misses	16891309 ± 40480	15543677 ± 57307	-7.99 ± 0.41
benchmark2
task-clock	8307.96 ± 15.84	8316.72 ± 15.50	0.10 ± 0.20
icache-misses	67742141 ± 470515	65716219 ± 198470	-2.82 ± 0.20
itlb-misses	13591076 ± 99998	12462672 ± 70100	-8.18 ± 0.54

Using the alg in BOLT (on top of D153039), with AutoFDO and with huge pages

	base	test	delta(%)
benchmark1
task-clock	5329.71 ± 38.16	5333.77 ± 17.21	0.31 ± 0.49
icache-misses	89754736 ± 93088	90480531 ± 236996	0.69 ± 0.22
itlb-misses	2279266 ± 15032	1973922 ± 13429	-13.45 ± 0.86
benchmark2
task-clock	8241.64 ± 16.92	8252.00 ± 13.55	0.15 ± 0.21
icache-misses	69470543 ± 141858	68224372 ± 177928	-1.79 ± 0.32
itlb-misses	1902566 ± 36542	2070742 ± 19558	9.27 ± 1.60

In D152834#4439348, @spupyrev wrote:

Here are my measurements on the clang binary (release_14) by compiling two large cpp files (benchmark1 and benchmark2). Negative values are improvements, bold ones are stat sig.

Using the alg in LLD (on top of D152840), with AutoFDO and without huge pages

base test delta(%)

benchmark1

task-clock 6440.07 ± 16.94 6373.95 ± 14.70 -1.01 ± 0.33

icache-misses 218340412 ± 361175 210448621 ± 387645 -3.61 ± 0.16

itlb-misses 45609238 ± 129225 42629503 ± 46353 -6.38 ± 0.16

benchmark2

task-clock 9509.05 ± 21.74 9443.90 ± 30.05 -0.73 ± 0.29

icache-misses 174525893 ± 294760 166852633 ± 253654 -4.38 ± 0.16

itlb-misses 36756578 ± 90162 35175447 ± 91296 -4.29 ± 0.29

Using the alg in BOLT (on top of D153039), with AutoFDO and without huge pages

base test delta(%)

benchmark1

task-clock 5398.95 ± 18.12 5366.50 ± 12.47 -0.52 ± 0.39

icache-misses 86342101 ± 258598 85715814 ± 152068 -0.63 ± 0.19

itlb-misses 16891309 ± 40480 15543677 ± 57307 -7.99 ± 0.41

benchmark2

task-clock 8307.96 ± 15.84 8316.72 ± 15.50 0.10 ± 0.20

icache-misses 67742141 ± 470515 65716219 ± 198470 -2.82 ± 0.20

itlb-misses 13591076 ± 99998 12462672 ± 70100 -8.18 ± 0.54

Using the alg in BOLT (on top of D153039), with AutoFDO and with huge pages

base test delta(%)

benchmark1

task-clock 5329.71 ± 38.16 5333.77 ± 17.21 0.31 ± 0.49

icache-misses 89754736 ± 93088 90480531 ± 236996 0.69 ± 0.22

itlb-misses 2279266 ± 15032 1973922 ± 13429 -13.45 ± 0.86

benchmark2

task-clock 8241.64 ± 16.92 8252.00 ± 13.55 0.15 ± 0.21

icache-misses 69470543 ± 141858 68224372 ± 177928 -1.79 ± 0.32

itlb-misses 1902566 ± 36542 2070742 ± 19558 9.27 ± 1.60

The timing data with huge pages look expected. The icache miss and itlb miss data look puzzling though -- benchmark1 sees slight icache miss increase and huge itlb miss reduction while benchmark2 is the opposite. While the slight increase of icache miss can be the result of increase in conflict misses due to the use of huge pages, the increase in ITLB misses for benchmark2 is surprising.

Thanks for providing the evaluation results. This shows nice improvements when huge pages are not used. My intuition is that using huge pages reduces the opportunity from function reordering.

Would you please clarify what specific performance counters are used in the measurement? For iTLB misses icache_64b.iftag_miss includes speculative execution but frontend_retired.itlb_misses does not.

Hmm. I was using "cpu/event=0x85,umask=0x61/u" for i-TLB misses, which we got from https://download.01.org/perfmon, which has even been moved since then. Back in 2017 (when the algorithm was developed) we thought this is the "right" event to look at, but it might not be the case. Which one would you recommend to look at? I see this page has a good description.

Notice that when we turn on huge pages, the amount of iTLB misses decreases by ~10x, and the absolute differences between A and B sides are much smaller than when no huge pages are used.

In D152834#4439489, @spupyrev wrote:

Hmm. I was using "cpu/event=0x85,umask=0x61/u" for i-TLB misses, which we got from https://download.01.org/perfmon, which has even been moved since then. Back in 2017 (when the algorithm was developed) we thought this is the "right" event to look at, but it might not be the case. Which one would you recommend to look at? I see this page has a good description.

There are a few different sets of events which count iTLB related behaviours. The misses that matter most are the ones that stall the pipeline. This is counted by FRONTEND_RETIRED.ITLB_MISS. https://github.com/intel/perfmon/blob/main/SKL/events/skylake_core.json#L5117-L5137

For a raw count of iTLB misses which include speculative execution you can look at ICACHE64B.IFTAG_STALL (alias ICACHE_TAG.STALLS). It's unfortunately an un-intuitive name. https://github.com/intel/perfmon/blob/main/SKL/events/skylake_core.json#L2702-L2723

The set of events which use cpu/event=0x85 are meant to capture speculative and non-speculative execution triggered page walks (apart from one mask which counts sTLB hits). So I would recommend looking into the two mentioned above to quantify the impact more accurately for hugepages.

Notice that when we turn on huge pages, the amount of iTLB misses decreases by ~10x, and the absolute differences between A and B sides are much smaller than when no huge pages are used.

Yes, as @rahmanl mentioned, with hugepages enabled there probably isn't much headroom for improvement (as reflected in the task-clock measurements).

Matt added a subscriber: Matt.Jun 22 2023, 4:18 PM

Thanks for teaching me how to measure the impact of instruction caches. While re-running the experiments with the new events, I realized that my earlier report was not using C^3 as the baseline. Instead the numbers were on top of an improved code layout (referred to hfsort+) utilized by BOLT, which is not relevant here; I apologize for the confusion.
Below are details of the latest run on the same clang benchmarks, with and without huge pages. In addition to comparing the new algorithm to C^3, i'm also including the numbers on top of the "input" ordering that comes from the compiler. Here I'm building the binary with LTO and AutoFDO, but observe similar numbers when using instrumentation counts or other sampling-based profiling approaches (e.g., CSSPGO).

No hugepages:

	cds	c^3 (delta, %)	input (delta, %)
benchmark1
frontend_retired.l1i_miss	69351242	70805714 (1.99 ± 0.14)	73665990 (5.88 ± 0.11)
icache_64b.iftag_stall	377880876	440763009 (14.67 ± 0.32)	615372537 (38.18 ± 0.41)
frontend_retired.itlb_miss	4917651	5823311 (15.58 ± 0.42)	8996999 (45.14 ± 0.36)
task-clock	5348	5393 (0.72 ± 0.33)	5431 (1.47 ± 0.24)
benchmark2
frontend_retired.l1i_miss	61681268	63522660 (2.92 ± 0.27)	64953229 (5.01 ± 0.17)
icache_64b.iftag_stall	325869634	377176494 (13.49 ± 0.43)	495816988 (34.07 ± 0.38)
frontend_retired.itlb_miss	3814520	4502050 (15.33 ± 0.66)	7121213 (47.04 ± 0.27)
task-clock	8311	8338 (0.32 ± 0.17)	8363 (0.62 ± 0.23)

With hugepages:

	cds	c^3 (delta, %)	input (delta, %)
benchmark1
frontend_retired.l1i_miss	67951983	68463724 (0.75 ± 0.16)	70894569 (4.25 ± 1.32)
icache_64b.iftag_stall	132528699	151086801 (12.26 ± 0.26)	179977955 (26.24 ± 0.84)
frontend_retired.itlb_miss	255677	322445 (20.03 ± 0.41)	524749 (51.21 ± 1.04)
task-clock	5287	5314 (0.51 ± 0.30)	5349 (1.27 ± 0.28)
benchmark2
frontend_retired.l1i_miss	59593334	60726917 (1.85 ± 0.34)	60064530 (0.97 ± 0.30)
icache_64b.iftag_stall	130194520	133511012 (2.67 ± 1.03)	146815089 (11.41 ± 1.05)
frontend_retired.itlb_miss	207543	259727 (20.07 ± 2.38)	416369 (50.05 ± 1.09)
task-clock	8238	8266 (0.35 ± 0.38)	8276 (0.45 ± 0.26)

Besides these numbers, I can only share one data-point on a large production service, where CDS outperforms C^3 by around 0.25%-0.3% cpu (with huge pages and many other optimizations turned on). Though I'd generalize the wins on other binaries/benchmarks with an extra care, as it depends on a lot of factors.

Of course, using huge pages diminishes the impact of function reordering; yet it can still provide benefits.

Thanks for the updated results. They look more consistent now.

llvm/include/llvm/Transforms/Utils/CodeLayout.h
56	Please introduce cl::opts for these so they can be manually configured.
llvm/lib/Transforms/Utils/CodeLayout.cpp
1104	Here, you can use `std::make_tuple(-L->gain(), L->srcChain()->Id, L->destChain()->Id) < std::make_tuple(-R->gain(), R->srcChain()->Id, R->destChain()->Id)`.
1112	I wonder if we should make this a class member and then decompose this large function into smaller pieces.
1114	nit: Use consistent style ("Insert" instead of "Inserting").
1190	nit: missing period.
1194	nit: missing period.
1195	Is this ever used with `Offset != 0`?
1245	Can you use a better name for this function? I am thinking of something like `computeFreqBasedLocalityGainForMerge` or `freqBasedLocalityGainForMerge`. It's longer but it fully captures the semantics. Same comment about `mergeGainDist`.
1269	Use a more representative naming for this function.
1339–1342	Use `std::make_tuple` to compare these.

comments

spupyrev marked an inline comment as done.Jul 10 2023, 1:57 PM

spupyrev added inline comments.

llvm/include/llvm/Transforms/Utils/CodeLayout.h
56	Based on my experience, exposing such constants doesn't add much value but may confuse some people. For example, C^3 has a bunch of similar options tuned once on a specific benchmark years ago, and no one has ever tried to re-consider them :) A have similar experience for other algorithms, where the defaults were never touched. At this time I consider the values as some internal algorithm-specific constants that are not supposed to be modified. However, cl:opts take just a few lines of code, so i'm happy to add those if you have an alternative opinion.
llvm/lib/Transforms/Utils/CodeLayout.cpp
1190	I'm using commas only for method-level comments (`///`). Is there a standard/preference around this?

Harbormaster completed remote builds in B244272: Diff 538812.Jul 10 2023, 4:16 PM

uabelho added a subscriber: uabelho.Jul 10 2023, 10:28 PM

rahmanl added inline comments.Jul 12 2023, 1:17 PM

llvm/lib/Transforms/Utils/CodeLayout.cpp
1017	Although this is pre-existing, I wonder why this is returning by parameter.
1072–1074	Drop braces for single-statement ifs.
1108	Same here.
1142	How about we simply this using: `for (const auto &[Chain, ChainEdge]: BestSrcChain->Edges) Queue.erase(ChainEdge);`
1153	`const auto & [Chain, Edge]` to avoid unnecessary copies and improve readability.
1182	This
1190	I'd use the correct punctuation everywhere per guidance from https://llvm.org/docs/CodingStandards.html#commenting A comment ending without period may imply it's been truncated.
1301	I wonder if this could be a source of performance drop for larger programs. These chains are much bigger than the block chains and could have more edges with other chains. However, we currently use std::vector for `ChainT::Edges` and removing things from a vector one by one could result in quadratic time complexity. So a potential optimization is to remove all outdated edges in one shot using the combination of `vector::erase` and `std::remove_if`.
1329	We don't need to use stable_sort if we have tie-breaking by identifiers.
1340	Just use braces for the outer loop per LLVM standards: https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies-of-if-else-loop-statements

rahmanl added inline comments.Jul 12 2023, 1:22 PM

llvm/include/llvm/Transforms/Utils/CodeLayout.h
56	The configurations here look very architecture-dependent. So I am assuming that as hardware improves, we may increase these parameters to get a better result. WDYT?

addressing comments (mostly adding periods)

Herald added a subscriber: mgrang. · View Herald TranscriptJul 14 2023, 11:54 AM

spupyrev added inline comments.Jul 14 2023, 11:55 AM

llvm/include/llvm/Transforms/Utils/CodeLayout.h
56	Added
llvm/lib/Transforms/Utils/CodeLayout.cpp
1301	As far as I can see, `ChainT::mergeEdges` does a bit more than removing the items from the vectors; I don't see a simple way of reducing the complexity here. (But I am open for reviewing speedups of course)

spupyrev removed a parent revision: D152833: A new code layout algorithm for function reordering [1/3].Jul 14 2023, 11:55 AM

Harbormaster completed remote builds in B245466: Diff 540523.Jul 14 2023, 12:28 PM

rahmanl accepted this revision.Jul 21 2023, 10:57 AM

rahmanl added inline comments.

llvm/lib/Transforms/Utils/CodeLayout.cpp
1301	Sounds good. I'll take a stab at it later.
1344	`const CDSortConfig Config` to avoid dangling references.
1462	We don't need these I think.

This revision is now accepted and ready to land.Jul 21 2023, 10:57 AM

rebase

Thanks for review Rahman. I'll wait a few more days before landing in case anyone else has suggestions.

Harbormaster completed remote builds in B247700: Diff 543574.Jul 24 2023, 3:21 PM

Closed by commit rGbc59faa86308: A new code layout algorithm for function reordering [2/3] (authored by spupyrev). · Explain WhyJul 27 2023, 9:21 AM

This revision was automatically updated to reflect the committed changes.

spupyrev added a commit: rGbc59faa86308: A new code layout algorithm for function reordering [2/3].

spupyrev removed a child revision: D153039: [BOLT] A new code layout algorithm for function reordering [3b/3].Jul 27 2023, 10:10 AM

MaskRay mentioned this in D152840: [ELF] A new code layout algorithm for function reordering [3a/3].Sep 12 2023, 9:09 PM

spupyrev removed a child revision: D152840: [ELF] A new code layout algorithm for function reordering [3a/3].Sep 13 2023, 2:05 PM

MaskRay mentioned this in D159527: [CodeLayout] Add unittest for cache-directed sort.Sep 17 2023, 4:09 PM

spupyrev mentioned this in rG904b3f66f59e: [ELF] A new code layout algorithm for function reordering [3a/3].Sep 26 2023, 6:25 AM

MaskRay mentioned this in rGe705b37a77c6: [CodeLayout] Add unittest for cache-directed sort.Sep 27 2023, 10:52 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Utils/

CodeLayout.h

34 lines

lib/

Transforms/

Utils/

CodeLayout.cpp

599 lines

Diff 544814

llvm/include/llvm/Transforms/Utils/CodeLayout.h

Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	double calcExtTspScore(const std::vector<uint64_t> &Order,
const std::vector<uint64_t> &NodeCounts,		const std::vector<uint64_t> &NodeCounts,
const std::vector<EdgeCountT> &EdgeCounts);		const std::vector<EdgeCountT> &EdgeCounts);

/// Estimate the "quality" of the current node order in CFG.		/// Estimate the "quality" of the current node order in CFG.
double calcExtTspScore(const std::vector<uint64_t> &NodeSizes,		double calcExtTspScore(const std::vector<uint64_t> &NodeSizes,
const std::vector<uint64_t> &NodeCounts,		const std::vector<uint64_t> &NodeCounts,
const std::vector<EdgeCountT> &EdgeCounts);		const std::vector<EdgeCountT> &EdgeCounts);

		/// Algorithm-specific params for Cache-Directed Sort. The values are tuned for
		rahmanlUnsubmitted Done Reply Inline Actions Please introduce cl::opts for these so they can be manually configured. rahmanl: Please introduce cl::opts for these so they can be manually configured.
		spupyrevAuthorUnsubmitted Done Reply Inline Actions Based on my experience, exposing such constants doesn't add much value but may confuse some people. For example, C^3 has a bunch of similar options tuned once on a specific benchmark years ago, and no one has ever tried to re-consider them :) A have similar experience for other algorithms, where the defaults were never touched. At this time I consider the values as some internal algorithm-specific constants that are not supposed to be modified. However, cl:opts take just a few lines of code, so i'm happy to add those if you have an alternative opinion. spupyrev: Based on my experience, exposing such constants doesn't add much value but may confuse some…
		rahmanlUnsubmitted Done Reply Inline Actions The configurations here look very architecture-dependent. So I am assuming that as hardware improves, we may increase these parameters to get a better result. WDYT? rahmanl: The configurations here look very architecture-dependent. So I am assuming that as hardware…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions Added spupyrev: Added
		/// the best performance of large-scale front-end bound binaries.
		struct CDSortConfig {
		/// The size of the cache.
		unsigned CacheEntries = 16;
		/// The size of a line in the cache.
		unsigned CacheSize = 2048;
		/// The power exponent for the distance-based locality.
		double DistancePower = 0.25;
		/// The scale factor for the frequency-based locality.
		double FrequencyScale = 0.25;
		};

		/// Apply a Cache-Directed Sort for functions represented by a call graph.
		/// The placement is done by optimizing the call locality by co-locating
		/// frequently executed functions.
		/// \p FuncSizes: The sizes of the nodes (in bytes).
		/// \p FuncCounts: The execution counts of the nodes in the profile.
		/// \p CallCounts: The execution counts of every edge (jump) in the profile. The
		/// map also defines the edges in CFG and should include 0-count edges.
		/// \p CallOffsets: The offsets of the calls from their source nodes.
		/// \returns The best function order found.
		std::vector<uint64_t> applyCDSLayout(const std::vector<uint64_t> &FuncSizes,
		const std::vector<uint64_t> &FuncCounts,
		const std::vector<EdgeCountT> &CallCounts,
		const std::vector<uint64_t> &CallOffsets);

		/// Apply a Cache-Directed Sort with a custom config.
		std::vector<uint64_t> applyCDSLayout(const CDSortConfig &Config,
		const std::vector<uint64_t> &FuncSizes,
		const std::vector<uint64_t> &FuncCounts,
		const std::vector<EdgeCountT> &CallCounts,
		const std::vector<uint64_t> &CallOffsets);

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_TRANSFORMS_UTILS_CODELAYOUT_H		#endif // LLVM_TRANSFORMS_UTILS_CODELAYOUT_H

llvm/lib/Transforms/Utils/CodeLayout.cpp

Show All 39 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/Transforms/Utils/CodeLayout.h"		#include "llvm/Transforms/Utils/CodeLayout.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"

#include <cmath>		#include <cmath>
		#include <set>

using namespace llvm;		using namespace llvm;
#define DEBUG_TYPE "code-layout"		#define DEBUG_TYPE "code-layout"

namespace llvm {		namespace llvm {
cl::opt<bool> EnableExtTspBlockPlacement(		cl::opt<bool> EnableExtTspBlockPlacement(
"enable-ext-tsp-block-placement", cl::Hidden, cl::init(false),		"enable-ext-tsp-block-placement", cl::Hidden, cl::init(false),
cl::desc("Enable machine block placement based on the ext-tsp model, "		cl::desc("Enable machine block placement based on the ext-tsp model, "
"optimizing I-cache utilization."));		"optimizing I-cache utilization."));

cl::opt<bool> ApplyExtTspWithoutProfile(		cl::opt<bool> ApplyExtTspWithoutProfile(
"ext-tsp-apply-without-profile",		"ext-tsp-apply-without-profile",
cl::desc("Whether to apply ext-tsp placement for instances w/o profile"),		cl::desc("Whether to apply ext-tsp placement for instances w/o profile"),
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);
} // namespace llvm		} // namespace llvm

// Algorithm-specific params. The values are tuned for the best performance		// Algorithm-specific params for Ext-TSP. The values are tuned for the best
// of large-scale front-end bound binaries.		// performance of large-scale front-end bound binaries.
static cl::opt<double> ForwardWeightCond(		static cl::opt<double> ForwardWeightCond(
"ext-tsp-forward-weight-cond", cl::ReallyHidden, cl::init(0.1),		"ext-tsp-forward-weight-cond", cl::ReallyHidden, cl::init(0.1),
cl::desc("The weight of conditional forward jumps for ExtTSP value"));		cl::desc("The weight of conditional forward jumps for ExtTSP value"));

static cl::opt<double> ForwardWeightUncond(		static cl::opt<double> ForwardWeightUncond(
"ext-tsp-forward-weight-uncond", cl::ReallyHidden, cl::init(0.1),		"ext-tsp-forward-weight-uncond", cl::ReallyHidden, cl::init(0.1),
cl::desc("The weight of unconditional forward jumps for ExtTSP value"));		cl::desc("The weight of unconditional forward jumps for ExtTSP value"));

Show All 34 Lines	static cl::opt<unsigned> ChainSplitThreshold(
cl::desc("The maximum size of a chain to apply splitting"));		cl::desc("The maximum size of a chain to apply splitting"));

// The option enables splitting (large) chains along in-coming and out-going		// The option enables splitting (large) chains along in-coming and out-going
// jumps. This typically results in a better quality.		// jumps. This typically results in a better quality.
static cl::opt<bool> EnableChainSplitAlongJumps(		static cl::opt<bool> EnableChainSplitAlongJumps(
"ext-tsp-enable-chain-split-along-jumps", cl::ReallyHidden, cl::init(true),		"ext-tsp-enable-chain-split-along-jumps", cl::ReallyHidden, cl::init(true),
cl::desc("The maximum size of a chain to apply splitting"));		cl::desc("The maximum size of a chain to apply splitting"));

		// Algorithm-specific options for CDS.
		static cl::opt<unsigned> CacheEntries("cds-cache-entries", cl::ReallyHidden,
		cl::desc("The size of the cache"));

		static cl::opt<unsigned> CacheSize("cds-cache-size", cl::ReallyHidden,
		cl::desc("The size of a line in the cache"));

		static cl::opt<double> DistancePower(
		"cds-distance-power", cl::ReallyHidden,
		cl::desc("The power exponent for the distance-based locality"));

		static cl::opt<double> FrequencyScale(
		"cds-frequency-scale", cl::ReallyHidden,
		cl::desc("The scale factor for the frequency-based locality"));

namespace {		namespace {

// Epsilon for comparison of doubles.		// Epsilon for comparison of doubles.
constexpr double EPS = 1e-8;		constexpr double EPS = 1e-8;

// Compute the Ext-TSP score for a given jump.		// Compute the Ext-TSP score for a given jump.
double jumpExtTSPScore(uint64_t JumpDist, uint64_t JumpMaxDist, uint64_t Count,		double jumpExtTSPScore(uint64_t JumpDist, uint64_t JumpMaxDist, uint64_t Count,
double Weight) {		double Weight) {
▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	bool isCold() const {
for (NodeT *Node : Nodes) {		for (NodeT *Node : Nodes) {
if (Node->ExecutionCount > 0)		if (Node->ExecutionCount > 0)
return false;		return false;
}		}
return true;		return true;
}		}

ChainEdge getEdge(ChainT Other) const {		ChainEdge getEdge(ChainT Other) const {
for (auto It : Edges) {		for (const auto &[Chain, ChainEdge] : Edges) {
if (It.first == Other)		if (Chain == Other)
return It.second;		return ChainEdge;
}		}
return nullptr;		return nullptr;
}		}

void removeEdge(ChainT *Other) {		void removeEdge(ChainT *Other) {
auto It = Edges.begin();		auto It = Edges.begin();
while (It != Edges.end()) {		while (It != Edges.end()) {
if (It->first == Other) {		if (It->first == Other) {
Edges.erase(It);		Edges.erase(It);
return;		return;
}		}
It++;		It++;
}		}
}		}

void addEdge(ChainT Other, ChainEdge Edge) {		void addEdge(ChainT Other, ChainEdge Edge) {
Edges.push_back(std::make_pair(Other, Edge));		Edges.push_back(std::make_pair(Other, Edge));
}		}

void merge(ChainT Other, const std::vector<NodeT > &MergedBlocks) {		void merge(ChainT Other, const std::vector<NodeT > &MergedBlocks) {
Nodes = MergedBlocks;		Nodes = MergedBlocks;
// Update the chain's data		// Update the chain's data.
ExecutionCount += Other->ExecutionCount;		ExecutionCount += Other->ExecutionCount;
Size += Other->Size;		Size += Other->Size;
Id = Nodes[0]->Index;		Id = Nodes[0]->Index;
// Update the node's data		// Update the node's data.
for (size_t Idx = 0; Idx < Nodes.size(); Idx++) {		for (size_t Idx = 0; Idx < Nodes.size(); Idx++) {
Nodes[Idx]->CurChain = this;		Nodes[Idx]->CurChain = this;
Nodes[Idx]->CurIndex = Idx;		Nodes[Idx]->CurIndex = Idx;
}		}
}		}

void mergeEdges(ChainT *Other);		void mergeEdges(ChainT *Other);

Show All 15 Lines	struct ChainT {
// Nodes of the chain.		// Nodes of the chain.
std::vector<NodeT *> Nodes;		std::vector<NodeT *> Nodes;
// Adjacent chains and corresponding edges (lists of jumps).		// Adjacent chains and corresponding edges (lists of jumps).
std::vector<std::pair<ChainT , ChainEdge >> Edges;		std::vector<std::pair<ChainT , ChainEdge >> Edges;
};		};

/// An edge in the graph representing jumps between two chains.		/// An edge in the graph representing jumps between two chains.
/// When nodes are merged into chains, the edges are combined too so that		/// When nodes are merged into chains, the edges are combined too so that
/// there is always at most one edge between a pair of chains		/// there is always at most one edge between a pair of chains.
struct ChainEdge {		struct ChainEdge {
ChainEdge(const ChainEdge &) = delete;		ChainEdge(const ChainEdge &) = delete;
ChainEdge(ChainEdge &&) = default;		ChainEdge(ChainEdge &&) = default;
ChainEdge &operator=(const ChainEdge &) = delete;		ChainEdge &operator=(const ChainEdge &) = delete;
ChainEdge &operator=(ChainEdge &&) = delete;		ChainEdge &operator=(ChainEdge &&) = delete;

explicit ChainEdge(JumpT *Jump)		explicit ChainEdge(JumpT *Jump)
: SrcChain(Jump->Source->CurChain), DstChain(Jump->Target->CurChain),		: SrcChain(Jump->Source->CurChain), DstChain(Jump->Target->CurChain),
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	private:
MergeGainT CachedGainBackward;		MergeGainT CachedGainBackward;
// Whether the cached value must be recomputed.		// Whether the cached value must be recomputed.
bool CacheValidForward{false};		bool CacheValidForward{false};
bool CacheValidBackward{false};		bool CacheValidBackward{false};
};		};

uint64_t NodeT::outCount() const {		uint64_t NodeT::outCount() const {
uint64_t Count = 0;		uint64_t Count = 0;
for (JumpT *Jump : OutJumps) {		for (JumpT *Jump : OutJumps)
Count += Jump->ExecutionCount;		Count += Jump->ExecutionCount;
}
return Count;		return Count;
}		}

uint64_t NodeT::inCount() const {		uint64_t NodeT::inCount() const {
uint64_t Count = 0;		uint64_t Count = 0;
for (JumpT *Jump : InJumps) {		for (JumpT *Jump : InJumps)
Count += Jump->ExecutionCount;		Count += Jump->ExecutionCount;
}
return Count;		return Count;
}		}

void ChainT::mergeEdges(ChainT *Other) {		void ChainT::mergeEdges(ChainT *Other) {
// Update edges adjacent to chain Other		// Update edges adjacent to chain Other.
for (auto EdgeIt : Other->Edges) {		for (const auto &[DstChain, DstEdge] : Other->Edges) {
ChainT *DstChain = EdgeIt.first;
ChainEdge *DstEdge = EdgeIt.second;
ChainT *TargetChain = DstChain == Other ? this : DstChain;		ChainT *TargetChain = DstChain == Other ? this : DstChain;
ChainEdge *CurEdge = getEdge(TargetChain);		ChainEdge *CurEdge = getEdge(TargetChain);
if (CurEdge == nullptr) {		if (CurEdge == nullptr) {
DstEdge->changeEndpoint(Other, this);		DstEdge->changeEndpoint(Other, this);
this->addEdge(TargetChain, DstEdge);		this->addEdge(TargetChain, DstEdge);
if (DstChain != this && DstChain != Other) {		if (DstChain != this && DstChain != Other)
DstChain->addEdge(this, DstEdge);		DstChain->addEdge(this, DstEdge);
}
} else {		} else {
CurEdge->moveJumps(DstEdge);		CurEdge->moveJumps(DstEdge);
}		}
// Cleanup leftover edge		// Cleanup leftover edge.
if (DstChain != Other) {		if (DstChain != Other)
DstChain->removeEdge(Other);		DstChain->removeEdge(Other);
}		}
}		}
}

using NodeIter = std::vector<NodeT *>::const_iterator;		using NodeIter = std::vector<NodeT *>::const_iterator;

/// A wrapper around three chains of nodes; it is used to avoid extra		/// A wrapper around three chains of nodes; it is used to avoid extra
/// instantiation of the vectors.		/// instantiation of the vectors.
struct MergedChain {		struct MergedChain {
MergedChain(NodeIter Begin1, NodeIter End1, NodeIter Begin2 = NodeIter(),		MergedChain(NodeIter Begin1, NodeIter End1, NodeIter Begin2 = NodeIter(),
NodeIter End2 = NodeIter(), NodeIter Begin3 = NodeIter(),		NodeIter End2 = NodeIter(), NodeIter Begin3 = NodeIter(),
Show All 34 Lines
/// Merge two chains of nodes respecting a given 'type' and 'offset'.		/// Merge two chains of nodes respecting a given 'type' and 'offset'.
///		///
/// If MergeType == 0, then the result is a concatenation of two chains.		/// If MergeType == 0, then the result is a concatenation of two chains.
/// Otherwise, the first chain is cut into two sub-chains at the offset,		/// Otherwise, the first chain is cut into two sub-chains at the offset,
/// and merged using all possible ways of concatenating three chains.		/// and merged using all possible ways of concatenating three chains.
MergedChain mergeNodes(const std::vector<NodeT *> &X,		MergedChain mergeNodes(const std::vector<NodeT *> &X,
const std::vector<NodeT *> &Y, size_t MergeOffset,		const std::vector<NodeT *> &Y, size_t MergeOffset,
MergeTypeT MergeType) {		MergeTypeT MergeType) {
// Split the first chain, X, into X1 and X2		// Split the first chain, X, into X1 and X2.
NodeIter BeginX1 = X.begin();		NodeIter BeginX1 = X.begin();
NodeIter EndX1 = X.begin() + MergeOffset;		NodeIter EndX1 = X.begin() + MergeOffset;
NodeIter BeginX2 = X.begin() + MergeOffset;		NodeIter BeginX2 = X.begin() + MergeOffset;
NodeIter EndX2 = X.end();		NodeIter EndX2 = X.end();
NodeIter BeginY = Y.begin();		NodeIter BeginY = Y.begin();
NodeIter EndY = Y.end();		NodeIter EndY = Y.end();

// Construct a new chain from the three existing ones		// Construct a new chain from the three existing ones.
switch (MergeType) {		switch (MergeType) {
case MergeTypeT::X_Y:		case MergeTypeT::X_Y:
return MergedChain(BeginX1, EndX2, BeginY, EndY);		return MergedChain(BeginX1, EndX2, BeginY, EndY);
case MergeTypeT::Y_X:		case MergeTypeT::Y_X:
return MergedChain(BeginY, EndY, BeginX1, EndX2);		return MergedChain(BeginY, EndY, BeginX1, EndX2);
case MergeTypeT::X1_Y_X2:		case MergeTypeT::X1_Y_X2:
return MergedChain(BeginX1, EndX1, BeginY, EndY, BeginX2, EndX2);		return MergedChain(BeginX1, EndX1, BeginY, EndY, BeginX2, EndX2);
case MergeTypeT::Y_X2_X1:		case MergeTypeT::Y_X2_X1:
Show All 34 Lines	private:
void initialize(const std::vector<uint64_t> &NodeSizes,		void initialize(const std::vector<uint64_t> &NodeSizes,
const std::vector<uint64_t> &NodeCounts,		const std::vector<uint64_t> &NodeCounts,
const std::vector<EdgeCountT> &EdgeCounts) {		const std::vector<EdgeCountT> &EdgeCounts) {
// Initialize nodes		// Initialize nodes
AllNodes.reserve(NumNodes);		AllNodes.reserve(NumNodes);
for (uint64_t Idx = 0; Idx < NumNodes; Idx++) {		for (uint64_t Idx = 0; Idx < NumNodes; Idx++) {
uint64_t Size = std::max<uint64_t>(NodeSizes[Idx], 1ULL);		uint64_t Size = std::max<uint64_t>(NodeSizes[Idx], 1ULL);
uint64_t ExecutionCount = NodeCounts[Idx];		uint64_t ExecutionCount = NodeCounts[Idx];
// The execution count of the entry node is set to at least one		// The execution count of the entry node is set to at least one.
if (Idx == 0 && ExecutionCount == 0)		if (Idx == 0 && ExecutionCount == 0)
ExecutionCount = 1;		ExecutionCount = 1;
AllNodes.emplace_back(Idx, Size, ExecutionCount);		AllNodes.emplace_back(Idx, Size, ExecutionCount);
}		}

// Initialize jumps between nodes		// Initialize jumps between nodes
SuccNodes.resize(NumNodes);		SuccNodes.resize(NumNodes);
PredNodes.resize(NumNodes);		PredNodes.resize(NumNodes);
std::vector<uint64_t> OutDegree(NumNodes, 0);		std::vector<uint64_t> OutDegree(NumNodes, 0);
AllJumps.reserve(EdgeCounts.size());		AllJumps.reserve(EdgeCounts.size());
for (auto It : EdgeCounts) {		for (auto It : EdgeCounts) {
uint64_t Pred = It.first.first;		uint64_t Pred = It.first.first;
uint64_t Succ = It.first.second;		uint64_t Succ = It.first.second;
OutDegree[Pred]++;		OutDegree[Pred]++;
// Ignore self-edges		// Ignore self-edges.
if (Pred == Succ)		if (Pred == Succ)
continue;		continue;

SuccNodes[Pred].push_back(Succ);		SuccNodes[Pred].push_back(Succ);
PredNodes[Succ].push_back(Pred);		PredNodes[Succ].push_back(Pred);
uint64_t ExecutionCount = It.second;		uint64_t ExecutionCount = It.second;
if (ExecutionCount > 0) {		if (ExecutionCount > 0) {
NodeT &PredNode = AllNodes[Pred];		NodeT &PredNode = AllNodes[Pred];
NodeT &SuccNode = AllNodes[Succ];		NodeT &SuccNode = AllNodes[Succ];
AllJumps.emplace_back(&PredNode, &SuccNode, ExecutionCount);		AllJumps.emplace_back(&PredNode, &SuccNode, ExecutionCount);
SuccNode.InJumps.push_back(&AllJumps.back());		SuccNode.InJumps.push_back(&AllJumps.back());
PredNode.OutJumps.push_back(&AllJumps.back());		PredNode.OutJumps.push_back(&AllJumps.back());
}		}
}		}
for (JumpT &Jump : AllJumps) {		for (JumpT &Jump : AllJumps) {
assert(OutDegree[Jump.Source->Index] > 0);		assert(OutDegree[Jump.Source->Index] > 0);
Jump.IsConditional = OutDegree[Jump.Source->Index] > 1;		Jump.IsConditional = OutDegree[Jump.Source->Index] > 1;
}		}

// Initialize chains		// Initialize chains.
AllChains.reserve(NumNodes);		AllChains.reserve(NumNodes);
HotChains.reserve(NumNodes);		HotChains.reserve(NumNodes);
for (NodeT &Node : AllNodes) {		for (NodeT &Node : AllNodes) {
AllChains.emplace_back(Node.Index, &Node);		AllChains.emplace_back(Node.Index, &Node);
Node.CurChain = &AllChains.back();		Node.CurChain = &AllChains.back();
if (Node.ExecutionCount > 0) {		if (Node.ExecutionCount > 0)
HotChains.push_back(&AllChains.back());		HotChains.push_back(&AllChains.back());
}		}
}

// Initialize chain edges		// Initialize chain edges.
AllEdges.reserve(AllJumps.size());		AllEdges.reserve(AllJumps.size());
for (NodeT &PredNode : AllNodes) {		for (NodeT &PredNode : AllNodes) {
for (JumpT *Jump : PredNode.OutJumps) {		for (JumpT *Jump : PredNode.OutJumps) {
NodeT *SuccNode = Jump->Target;		NodeT *SuccNode = Jump->Target;
ChainEdge *CurEdge = PredNode.CurChain->getEdge(SuccNode->CurChain);		ChainEdge *CurEdge = PredNode.CurChain->getEdge(SuccNode->CurChain);
// this edge is already present in the graph		// this edge is already present in the graph.
if (CurEdge != nullptr) {		if (CurEdge != nullptr) {
assert(SuccNode->CurChain->getEdge(PredNode.CurChain) != nullptr);		assert(SuccNode->CurChain->getEdge(PredNode.CurChain) != nullptr);
CurEdge->appendJump(Jump);		CurEdge->appendJump(Jump);
continue;		continue;
}		}
// this is a new edge		// this is a new edge.
AllEdges.emplace_back(Jump);		AllEdges.emplace_back(Jump);
PredNode.CurChain->addEdge(SuccNode->CurChain, &AllEdges.back());		PredNode.CurChain->addEdge(SuccNode->CurChain, &AllEdges.back());
SuccNode->CurChain->addEdge(PredNode.CurChain, &AllEdges.back());		SuccNode->CurChain->addEdge(PredNode.CurChain, &AllEdges.back());
}		}
}		}
}		}

/// For a pair of nodes, A and B, node B is the forced successor of A,		/// For a pair of nodes, A and B, node B is the forced successor of A,
/// if (i) all jumps (based on profile) from A goes to B and (ii) all jumps		/// if (i) all jumps (based on profile) from A goes to B and (ii) all jumps
/// to B are from A. Such nodes should be adjacent in the optimal ordering;		/// to B are from A. Such nodes should be adjacent in the optimal ordering;
/// the method finds and merges such pairs of nodes.		/// the method finds and merges such pairs of nodes.
void mergeForcedPairs() {		void mergeForcedPairs() {
// Find fallthroughs based on edge weights		// Find fallthroughs based on edge weights.
for (NodeT &Node : AllNodes) {		for (NodeT &Node : AllNodes) {
if (SuccNodes[Node.Index].size() == 1 &&		if (SuccNodes[Node.Index].size() == 1 &&
PredNodes[SuccNodes[Node.Index][0]].size() == 1 &&		PredNodes[SuccNodes[Node.Index][0]].size() == 1 &&
SuccNodes[Node.Index][0] != 0) {		SuccNodes[Node.Index][0] != 0) {
size_t SuccIndex = SuccNodes[Node.Index][0];		size_t SuccIndex = SuccNodes[Node.Index][0];
Node.ForcedSucc = &AllNodes[SuccIndex];		Node.ForcedSucc = &AllNodes[SuccIndex];
AllNodes[SuccIndex].ForcedPred = &Node;		AllNodes[SuccIndex].ForcedPred = &Node;
}		}
Show All 10 Lines	for (NodeT &Node : AllNodes) {
continue;		continue;

NodeT *SuccNode = Node.ForcedSucc;		NodeT *SuccNode = Node.ForcedSucc;
while (SuccNode != nullptr && SuccNode != &Node) {		while (SuccNode != nullptr && SuccNode != &Node) {
SuccNode = SuccNode->ForcedSucc;		SuccNode = SuccNode->ForcedSucc;
}		}
if (SuccNode == nullptr)		if (SuccNode == nullptr)
continue;		continue;
// Break the cycle		// Break the cycle.
AllNodes[Node.ForcedPred->Index].ForcedSucc = nullptr;		AllNodes[Node.ForcedPred->Index].ForcedSucc = nullptr;
Node.ForcedPred = nullptr;		Node.ForcedPred = nullptr;
}		}

// Merge nodes with their fallthrough successors		// Merge nodes with their fallthrough successors.
for (NodeT &Node : AllNodes) {		for (NodeT &Node : AllNodes) {
if (Node.ForcedPred == nullptr && Node.ForcedSucc != nullptr) {		if (Node.ForcedPred == nullptr && Node.ForcedSucc != nullptr) {
const NodeT *CurBlock = &Node;		const NodeT *CurBlock = &Node;
while (CurBlock->ForcedSucc != nullptr) {		while (CurBlock->ForcedSucc != nullptr) {
const NodeT *NextBlock = CurBlock->ForcedSucc;		const NodeT *NextBlock = CurBlock->ForcedSucc;
mergeChains(Node.CurChain, NextBlock->CurChain, 0, MergeTypeT::X_Y);		mergeChains(Node.CurChain, NextBlock->CurChain, 0, MergeTypeT::X_Y);
CurBlock = NextBlock;		CurBlock = NextBlock;
}		}
}		}
}		}
}		}

/// Merge pairs of chains while improving the ExtTSP objective.		/// Merge pairs of chains while improving the ExtTSP objective.
void mergeChainPairs() {		void mergeChainPairs() {
/// Deterministically compare pairs of chains		/// Deterministically compare pairs of chains.
auto compareChainPairs = [](const ChainT A1, const ChainT B1,		auto compareChainPairs = [](const ChainT A1, const ChainT B1,
const ChainT A2, const ChainT B2) {		const ChainT A2, const ChainT B2) {
if (A1 != A2)		if (A1 != A2)
return A1->Id < A2->Id;		return A1->Id < A2->Id;
return B1->Id < B2->Id;		return B1->Id < B2->Id;
};		};

while (HotChains.size() > 1) {		while (HotChains.size() > 1) {
ChainT *BestChainPred = nullptr;		ChainT *BestChainPred = nullptr;
ChainT *BestChainSucc = nullptr;		ChainT *BestChainSucc = nullptr;
MergeGainT BestGain;		MergeGainT BestGain;
// Iterate over all pairs of chains		// Iterate over all pairs of chains.
for (ChainT *ChainPred : HotChains) {		for (ChainT *ChainPred : HotChains) {
// Get candidates for merging with the current chain		// Get candidates for merging with the current chain.
for (auto EdgeIt : ChainPred->Edges) {		for (const auto &[ChainSucc, Edge] : ChainPred->Edges) {
ChainT *ChainSucc = EdgeIt.first;		// Ignore loop edges.
ChainEdge *Edge = EdgeIt.second;
// Ignore loop edges
if (ChainPred == ChainSucc)		if (ChainPred == ChainSucc)
continue;		continue;

// Stop early if the combined chain violates the maximum allowed size		// Stop early if the combined chain violates the maximum allowed size.
if (ChainPred->numBlocks() + ChainSucc->numBlocks() >= MaxChainSize)		if (ChainPred->numBlocks() + ChainSucc->numBlocks() >= MaxChainSize)
continue;		continue;

// Compute the gain of merging the two chains		// Compute the gain of merging the two chains.
MergeGainT CurGain = getBestMergeGain(ChainPred, ChainSucc, Edge);		MergeGainT CurGain = getBestMergeGain(ChainPred, ChainSucc, Edge);
if (CurGain.score() <= EPS)		if (CurGain.score() <= EPS)
continue;		continue;

if (BestGain < CurGain \|\|		if (BestGain < CurGain \|\|
(std::abs(CurGain.score() - BestGain.score()) < EPS &&		(std::abs(CurGain.score() - BestGain.score()) < EPS &&
compareChainPairs(ChainPred, ChainSucc, BestChainPred,		compareChainPairs(ChainPred, ChainSucc, BestChainPred,
BestChainSucc))) {		BestChainSucc))) {
BestGain = CurGain;		BestGain = CurGain;
BestChainPred = ChainPred;		BestChainPred = ChainPred;
BestChainSucc = ChainSucc;		BestChainSucc = ChainSucc;
}		}
}		}
}		}

// Stop merging when there is no improvement		// Stop merging when there is no improvement.
if (BestGain.score() <= EPS)		if (BestGain.score() <= EPS)
break;		break;

// Merge the best pair of chains		// Merge the best pair of chains.
mergeChains(BestChainPred, BestChainSucc, BestGain.mergeOffset(),		mergeChains(BestChainPred, BestChainSucc, BestGain.mergeOffset(),
BestGain.mergeType());		BestGain.mergeType());
}		}
}		}

/// Merge remaining nodes into chains w/o taking jump counts into		/// Merge remaining nodes into chains w/o taking jump counts into
/// consideration. This allows to maintain the original node order in the		/// consideration. This allows to maintain the original node order in the
/// absence of profile data		/// absence of profile data.
void mergeColdChains() {		void mergeColdChains() {
for (size_t SrcBB = 0; SrcBB < NumNodes; SrcBB++) {		for (size_t SrcBB = 0; SrcBB < NumNodes; SrcBB++) {
// Iterating in reverse order to make sure original fallthrough jumps are		// Iterating in reverse order to make sure original fallthrough jumps are
// merged first; this might be beneficial for code size.		// merged first; this might be beneficial for code size.
size_t NumSuccs = SuccNodes[SrcBB].size();		size_t NumSuccs = SuccNodes[SrcBB].size();
for (size_t Idx = 0; Idx < NumSuccs; Idx++) {		for (size_t Idx = 0; Idx < NumSuccs; Idx++) {
size_t DstBB = SuccNodes[SrcBB][NumSuccs - Idx - 1];		size_t DstBB = SuccNodes[SrcBB][NumSuccs - Idx - 1];
ChainT *SrcChain = AllNodes[SrcBB].CurChain;		ChainT *SrcChain = AllNodes[SrcBB].CurChain;
Show All 37 Lines	private:
/// result is a pair with the first element being the gain and the second		/// result is a pair with the first element being the gain and the second
/// element being the corresponding merging type.		/// element being the corresponding merging type.
MergeGainT getBestMergeGain(ChainT ChainPred, ChainT ChainSucc,		MergeGainT getBestMergeGain(ChainT ChainPred, ChainT ChainSucc,
ChainEdge *Edge) const {		ChainEdge *Edge) const {
if (Edge->hasCachedMergeGain(ChainPred, ChainSucc)) {		if (Edge->hasCachedMergeGain(ChainPred, ChainSucc)) {
return Edge->getCachedMergeGain(ChainPred, ChainSucc);		return Edge->getCachedMergeGain(ChainPred, ChainSucc);
}		}

// Precompute jumps between ChainPred and ChainSucc		// Precompute jumps between ChainPred and ChainSucc.
auto Jumps = Edge->jumps();		auto Jumps = Edge->jumps();
ChainEdge *EdgePP = ChainPred->getEdge(ChainPred);		ChainEdge *EdgePP = ChainPred->getEdge(ChainPred);
if (EdgePP != nullptr) {		if (EdgePP != nullptr) {
Jumps.insert(Jumps.end(), EdgePP->jumps().begin(), EdgePP->jumps().end());		Jumps.insert(Jumps.end(), EdgePP->jumps().begin(), EdgePP->jumps().end());
}		}
assert(!Jumps.empty() && "trying to merge chains w/o jumps");		assert(!Jumps.empty() && "trying to merge chains w/o jumps");

// The object holds the best currently chosen gain of merging the two chains		// This object holds the best chosen gain of merging two chains.
MergeGainT Gain = MergeGainT();		MergeGainT Gain = MergeGainT();

/// Given a merge offset and a list of merge types, try to merge two chains		/// Given a merge offset and a list of merge types, try to merge two chains
/// and update Gain with a better alternative		/// and update Gain with a better alternative.
auto tryChainMerging = [&](size_t Offset,		auto tryChainMerging = [&](size_t Offset,
const std::vector<MergeTypeT> &MergeTypes) {		const std::vector<MergeTypeT> &MergeTypes) {
// Skip merging corresponding to concatenation w/o splitting		// Skip merging corresponding to concatenation w/o splitting.
if (Offset == 0 \|\| Offset == ChainPred->Nodes.size())		if (Offset == 0 \|\| Offset == ChainPred->Nodes.size())
return;		return;
// Skip merging if it breaks Forced successors		// Skip merging if it breaks Forced successors.
NodeT *Node = ChainPred->Nodes[Offset - 1];		NodeT *Node = ChainPred->Nodes[Offset - 1];
if (Node->ForcedSucc != nullptr)		if (Node->ForcedSucc != nullptr)
return;		return;
// Apply the merge, compute the corresponding gain, and update the best		// Apply the merge, compute the corresponding gain, and update the best
// value, if the merge is beneficial		// value, if the merge is beneficial.
for (const MergeTypeT &MergeType : MergeTypes) {		for (const MergeTypeT &MergeType : MergeTypes) {
Gain.updateIfLessThan(		Gain.updateIfLessThan(
computeMergeGain(ChainPred, ChainSucc, Jumps, Offset, MergeType));		computeMergeGain(ChainPred, ChainSucc, Jumps, Offset, MergeType));
}		}
};		};

// Try to concatenate two chains w/o splitting		// Try to concatenate two chains w/o splitting.
Gain.updateIfLessThan(		Gain.updateIfLessThan(
computeMergeGain(ChainPred, ChainSucc, Jumps, 0, MergeTypeT::X_Y));		computeMergeGain(ChainPred, ChainSucc, Jumps, 0, MergeTypeT::X_Y));

if (EnableChainSplitAlongJumps) {		if (EnableChainSplitAlongJumps) {
// Attach (a part of) ChainPred before the first node of ChainSucc		// Attach (a part of) ChainPred before the first node of ChainSucc.
for (JumpT *Jump : ChainSucc->Nodes.front()->InJumps) {		for (JumpT *Jump : ChainSucc->Nodes.front()->InJumps) {
const NodeT *SrcBlock = Jump->Source;		const NodeT *SrcBlock = Jump->Source;
if (SrcBlock->CurChain != ChainPred)		if (SrcBlock->CurChain != ChainPred)
continue;		continue;
size_t Offset = SrcBlock->CurIndex + 1;		size_t Offset = SrcBlock->CurIndex + 1;
tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::X2_X1_Y});		tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::X2_X1_Y});
}		}

// Attach (a part of) ChainPred after the last node of ChainSucc		// Attach (a part of) ChainPred after the last node of ChainSucc.
for (JumpT *Jump : ChainSucc->Nodes.back()->OutJumps) {		for (JumpT *Jump : ChainSucc->Nodes.back()->OutJumps) {
const NodeT *DstBlock = Jump->Source;		const NodeT *DstBlock = Jump->Source;
if (DstBlock->CurChain != ChainPred)		if (DstBlock->CurChain != ChainPred)
continue;		continue;
size_t Offset = DstBlock->CurIndex;		size_t Offset = DstBlock->CurIndex;
tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1});		tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1});
}		}
}		}

// Try to break ChainPred in various ways and concatenate with ChainSucc		// Try to break ChainPred in various ways and concatenate with ChainSucc.
if (ChainPred->Nodes.size() <= ChainSplitThreshold) {		if (ChainPred->Nodes.size() <= ChainSplitThreshold) {
for (size_t Offset = 1; Offset < ChainPred->Nodes.size(); Offset++) {		for (size_t Offset = 1; Offset < ChainPred->Nodes.size(); Offset++) {
// Try to split the chain in different ways. In practice, applying		// Try to split the chain in different ways. In practice, applying
// X2_Y_X1 merging is almost never provides benefits; thus, we exclude		// X2_Y_X1 merging is almost never provides benefits; thus, we exclude
// it from consideration to reduce the search space		// it from consideration to reduce the search space.
tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1,		tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1,
MergeTypeT::X2_X1_Y});		MergeTypeT::X2_X1_Y});
}		}
}		}
Edge->setCachedMergeGain(ChainPred, ChainSucc, Gain);		Edge->setCachedMergeGain(ChainPred, ChainSucc, Gain);
return Gain;		return Gain;
}		}

/// Compute the score gain of merging two chains, respecting a given		/// Compute the score gain of merging two chains, respecting a given
/// merge 'type' and 'offset'.		/// merge 'type' and 'offset'.
///		///
/// The two chains are not modified in the method.		/// The two chains are not modified in the method.
MergeGainT computeMergeGain(const ChainT ChainPred, const ChainT ChainSucc,		MergeGainT computeMergeGain(const ChainT ChainPred, const ChainT ChainSucc,
const std::vector<JumpT *> &Jumps,		const std::vector<JumpT *> &Jumps,
size_t MergeOffset, MergeTypeT MergeType) const {		size_t MergeOffset, MergeTypeT MergeType) const {
auto MergedBlocks =		auto MergedBlocks =
mergeNodes(ChainPred->Nodes, ChainSucc->Nodes, MergeOffset, MergeType);		mergeNodes(ChainPred->Nodes, ChainSucc->Nodes, MergeOffset, MergeType);

// Do not allow a merge that does not preserve the original entry point		// Do not allow a merge that does not preserve the original entry point.
if ((ChainPred->isEntry() \|\| ChainSucc->isEntry()) &&		if ((ChainPred->isEntry() \|\| ChainSucc->isEntry()) &&
!MergedBlocks.getFirstNode()->isEntry())		!MergedBlocks.getFirstNode()->isEntry())
return MergeGainT();		return MergeGainT();

// The gain for the new chain		// The gain for the new chain.
auto NewGainScore = extTSPScore(MergedBlocks, Jumps) - ChainPred->Score;		auto NewGainScore = extTSPScore(MergedBlocks, Jumps) - ChainPred->Score;
return MergeGainT(NewGainScore, MergeOffset, MergeType);		return MergeGainT(NewGainScore, MergeOffset, MergeType);
}		}

/// Merge chain From into chain Into, update the list of active chains,		/// Merge chain From into chain Into, update the list of active chains,
/// adjacency information, and the corresponding cached values.		/// adjacency information, and the corresponding cached values.
void mergeChains(ChainT Into, ChainT From, size_t MergeOffset,		void mergeChains(ChainT Into, ChainT From, size_t MergeOffset,
MergeTypeT MergeType) {		MergeTypeT MergeType) {
assert(Into != From && "a chain cannot be merged with itself");		assert(Into != From && "a chain cannot be merged with itself");

// Merge the nodes		// Merge the nodes.
MergedChain MergedNodes =		MergedChain MergedNodes =
mergeNodes(Into->Nodes, From->Nodes, MergeOffset, MergeType);		mergeNodes(Into->Nodes, From->Nodes, MergeOffset, MergeType);
Into->merge(From, MergedNodes.getNodes());		Into->merge(From, MergedNodes.getNodes());

// Merge the edges		// Merge the edges.
Into->mergeEdges(From);		Into->mergeEdges(From);
From->clear();		From->clear();

// Update cached ext-tsp score for the new chain		// Update cached ext-tsp score for the new chain.
ChainEdge *SelfEdge = Into->getEdge(Into);		ChainEdge *SelfEdge = Into->getEdge(Into);
if (SelfEdge != nullptr) {		if (SelfEdge != nullptr) {
MergedNodes = MergedChain(Into->Nodes.begin(), Into->Nodes.end());		MergedNodes = MergedChain(Into->Nodes.begin(), Into->Nodes.end());
Into->Score = extTSPScore(MergedNodes, SelfEdge->jumps());		Into->Score = extTSPScore(MergedNodes, SelfEdge->jumps());
}		}

// Remove the chain from the list of active chains		// Remove the chain from the list of active chains.
llvm::erase_value(HotChains, From);		llvm::erase_value(HotChains, From);

// Invalidate caches		// Invalidate caches.
for (auto EdgeIt : Into->Edges)		for (auto EdgeIt : Into->Edges)
EdgeIt.second->invalidateCache();		EdgeIt.second->invalidateCache();
}		}

/// Concatenate all chains into the final order.		/// Concatenate all chains into the final order.
void concatChains(std::vector<uint64_t> &Order) {		void concatChains(std::vector<uint64_t> &Order) {
// Collect chains and calculate density stats for their sorting		// Collect chains and calculate density stats for their sorting.
std::vector<const ChainT *> SortedChains;		std::vector<const ChainT *> SortedChains;
DenseMap<const ChainT *, double> ChainDensity;		DenseMap<const ChainT *, double> ChainDensity;
for (ChainT &Chain : AllChains) {		for (ChainT &Chain : AllChains) {
if (!Chain.Nodes.empty()) {		if (!Chain.Nodes.empty()) {
SortedChains.push_back(&Chain);		SortedChains.push_back(&Chain);
// Using doubles to avoid overflow of ExecutionCounts		// Using doubles to avoid overflow of ExecutionCounts.
double Size = 0;		double Size = 0;
double ExecutionCount = 0;		double ExecutionCount = 0;
for (NodeT *Node : Chain.Nodes) {		for (NodeT *Node : Chain.Nodes) {
Size += static_cast<double>(Node->Size);		Size += static_cast<double>(Node->Size);
ExecutionCount += static_cast<double>(Node->ExecutionCount);		ExecutionCount += static_cast<double>(Node->ExecutionCount);
}		}
assert(Size > 0 && "a chain of zero size");		assert(Size > 0 && "a chain of zero size");
ChainDensity[&Chain] = ExecutionCount / Size;		ChainDensity[&Chain] = ExecutionCount / Size;
}		}
}		}

// Sorting chains by density in the decreasing order		// Sorting chains by density in the decreasing order.
std::stable_sort(SortedChains.begin(), SortedChains.end(),		std::sort(SortedChains.begin(), SortedChains.end(),
[&](const ChainT L, const ChainT R) {		[&](const ChainT L, const ChainT R) {
// Make sure the original entry point is at the		// Place the entry point is at the beginning of the order.
// beginning of the order
if (L->isEntry() != R->isEntry())		if (L->isEntry() != R->isEntry())
return L->isEntry();		return L->isEntry();

const double DL = ChainDensity[L];		const double DL = ChainDensity[L];
const double DR = ChainDensity[R];		const double DR = ChainDensity[R];
// Compare by density and break ties by chain identifiers		// Compare by density and break ties by chain identifiers.
return (DL != DR) ? (DL > DR) : (L->Id < R->Id);		return (DL != DR) ? (DL > DR) : (L->Id < R->Id);
		return std::make_tuple(-DL, L->Id) <
		std::make_tuple(-DR, R->Id);
});		});

// Collect the nodes in the order specified by their chains		// Collect the nodes in the order specified by their chains.
Order.reserve(NumNodes);		Order.reserve(NumNodes);
for (const ChainT *Chain : SortedChains) {		for (const ChainT *Chain : SortedChains) {
for (NodeT *Node : Chain->Nodes) {		for (NodeT *Node : Chain->Nodes) {
Order.push_back(Node->Index);		Order.push_back(Node->Index);
}		}
}		}
}		}

Show All 18 Lines	private:

/// All edges between the chains.		/// All edges between the chains.
std::vector<ChainEdge> AllEdges;		std::vector<ChainEdge> AllEdges;

/// Active chains. The vector gets updated at runtime when chains are merged.		/// Active chains. The vector gets updated at runtime when chains are merged.
std::vector<ChainT *> HotChains;		std::vector<ChainT *> HotChains;
};		};

		/// The implementation of the Cache-Directed Sort (CDS) algorithm for ordering
		/// functions represented by a call graph.
		class CDSortImpl {
		public:
		CDSortImpl(const CDSortConfig &Config, const std::vector<uint64_t> &NodeSizes,
		const std::vector<uint64_t> &NodeCounts,
		const std::vector<EdgeCountT> &EdgeCounts,
		const std::vector<uint64_t> &EdgeOffsets)
		: Config(Config), NumNodes(NodeSizes.size()) {
		initialize(NodeSizes, NodeCounts, EdgeCounts, EdgeOffsets);
		}

		/// Run the algorithm and return an ordered set of function clusters.
		void run(std::vector<uint64_t> &Result) {
		// Merge pairs of chains while improving the objective.
		mergeChainPairs();

		LLVM_DEBUG(dbgs() << "Cache-directed function sorting reduced the number"
		<< " of chains from " << NumNodes << " to "
		<< HotChains.size() << "\n");

		// Collect nodes from all the chains.
		concatChains(Result);
		rahmanlUnsubmitted Not Done Reply Inline Actions Although this is pre-existing, I wonder why this is returning by parameter. rahmanl: Although this is pre-existing, I wonder why this is returning by parameter.
		}

		private:
		/// Initialize the algorithm's data structures.
		void initialize(const std::vector<uint64_t> &NodeSizes,
		const std::vector<uint64_t> &NodeCounts,
		const std::vector<EdgeCountT> &EdgeCounts,
		const std::vector<uint64_t> &EdgeOffsets) {
		// Initialize nodes.
		AllNodes.reserve(NumNodes);
		for (uint64_t Node = 0; Node < NumNodes; Node++) {
		uint64_t Size = std::max<uint64_t>(NodeSizes[Node], 1ULL);
		uint64_t ExecutionCount = NodeCounts[Node];
		AllNodes.emplace_back(Node, Size, ExecutionCount);
		TotalSamples += ExecutionCount;
		if (ExecutionCount > 0)
		TotalSize += Size;
		}

		// Initialize jumps between the nodes.
		SuccNodes.resize(NumNodes);
		PredNodes.resize(NumNodes);
		AllJumps.reserve(EdgeCounts.size());
		for (size_t I = 0; I < EdgeCounts.size(); I++) {
		auto It = EdgeCounts[I];
		uint64_t Pred = It.first.first;
		uint64_t Succ = It.first.second;
		// Ignore recursive calls.
		if (Pred == Succ)
		continue;

		SuccNodes[Pred].push_back(Succ);
		PredNodes[Succ].push_back(Pred);
		uint64_t ExecutionCount = It.second;
		if (ExecutionCount > 0) {
		NodeT &PredNode = AllNodes[Pred];
		NodeT &SuccNode = AllNodes[Succ];
		AllJumps.emplace_back(&PredNode, &SuccNode, ExecutionCount);
		AllJumps.back().Offset = EdgeOffsets[I];
		SuccNode.InJumps.push_back(&AllJumps.back());
		PredNode.OutJumps.push_back(&AllJumps.back());
		}
		}

		// Initialize chains.
		AllChains.reserve(NumNodes);
		HotChains.reserve(NumNodes);
		for (NodeT &Node : AllNodes) {
		// Adjust execution counts.
		Node.ExecutionCount = std::max(Node.ExecutionCount, Node.inCount());
		Node.ExecutionCount = std::max(Node.ExecutionCount, Node.outCount());
		// Create chain.
		AllChains.emplace_back(Node.Index, &Node);
		Node.CurChain = &AllChains.back();
		if (Node.ExecutionCount > 0)
		HotChains.push_back(&AllChains.back());
		}
		rahmanlUnsubmitted Done Reply Inline Actions Drop braces for single-statement ifs. rahmanl: Drop braces for single-statement ifs.

		// Initialize chain edges.
		AllEdges.reserve(AllJumps.size());
		for (NodeT &PredNode : AllNodes) {
		for (JumpT *Jump : PredNode.OutJumps) {
		NodeT *SuccNode = Jump->Target;
		ChainEdge *CurEdge = PredNode.CurChain->getEdge(SuccNode->CurChain);
		// this edge is already present in the graph.
		if (CurEdge != nullptr) {
		assert(SuccNode->CurChain->getEdge(PredNode.CurChain) != nullptr);
		CurEdge->appendJump(Jump);
		continue;
		}
		// this is a new edge.
		AllEdges.emplace_back(Jump);
		PredNode.CurChain->addEdge(SuccNode->CurChain, &AllEdges.back());
		SuccNode->CurChain->addEdge(PredNode.CurChain, &AllEdges.back());
		}
		}
		}

		/// Merge pairs of chains while there is an improvement in the objective.
		void mergeChainPairs() {
		// Create a priority queue containing all edges ordered by the merge gain.
		auto GainComparator = [](ChainEdge L, ChainEdge R) {
		return std::make_tuple(-L->gain(), L->srcChain()->Id, L->dstChain()->Id) <
		std::make_tuple(-R->gain(), R->srcChain()->Id, R->dstChain()->Id);
		};
		std::set<ChainEdge *, decltype(GainComparator)> Queue(GainComparator);

		rahmanlUnsubmitted Done Reply Inline Actions Here, you can use `std::make_tuple(-L->gain(), L->srcChain()->Id, L->destChain()->Id) < std::make_tuple(-R->gain(), R->srcChain()->Id, R->destChain()->Id)`. rahmanl: Here, you can use `std::make_tuple(-L->gain(), L->srcChain()->Id, L->destChain()->Id) < std…
		// Insert the edges into the queue.
		for (ChainT *ChainPred : HotChains) {
		for (const auto &[Chain, Edge] : ChainPred->Edges) {
		// Ignore self-edges.
		rahmanlUnsubmitted Done Reply Inline Actions Same here. rahmanl: Same here.
		if (Edge->isSelfEdge())
		continue;
		// Ignore already processed edges.
		if (Edge->gain() != -1.0)
		rahmanlUnsubmitted Not Done Reply Inline Actions I wonder if we should make this a class member and then decompose this large function into smaller pieces. rahmanl: I wonder if we should make this a class member and then decompose this large function into…
		continue;

		rahmanlUnsubmitted Done Reply Inline Actions nit: Use consistent style ("Insert" instead of "Inserting"). rahmanl: nit: Use consistent style ("Insert" instead of "Inserting").
		// Compute the gain of merging the two chains.
		MergeGainT Gain = getBestMergeGain(Edge);
		Edge->setMergeGain(Gain);

		if (Edge->gain() > EPS)
		Queue.insert(Edge);
		}
		}

		// Merge the chains while the gain of merging is positive.
		while (!Queue.empty()) {
		// Extract the best (top) edge for merging.
		ChainEdge BestEdge = Queue.begin();
		Queue.erase(Queue.begin());
		// Ignore self-edges.
		if (BestEdge->isSelfEdge())
		continue;
		// Ignore edges with non-positive gains.
		if (BestEdge->gain() <= EPS)
		continue;

		ChainT *BestSrcChain = BestEdge->srcChain();
		ChainT *BestDstChain = BestEdge->dstChain();

		// Remove outdated edges from the queue.
		for (const auto &[Chain, ChainEdge] : BestSrcChain->Edges)
		Queue.erase(ChainEdge);
		for (const auto &[Chain, ChainEdge] : BestDstChain->Edges)
		rahmanlUnsubmitted Done Reply Inline Actions How about we simply this using: `for (const auto &[Chain, ChainEdge]: BestSrcChain->Edges) Queue.erase(ChainEdge);` rahmanl: How about we simply this using: `for (const auto &[Chain, ChainEdge]: BestSrcChain->Edges)…
		Queue.erase(ChainEdge);

		// Merge the best pair of chains.
		MergeGainT BestGain = BestEdge->getMergeGain();
		mergeChains(BestSrcChain, BestDstChain, BestGain.mergeOffset(),
		BestGain.mergeType());

		// Insert newly created edges into the queue.
		for (const auto &[Chain, Edge] : BestSrcChain->Edges) {
		// Ignore loop edges.
		if (Edge->isSelfEdge())
		rahmanlUnsubmitted Done Reply Inline Actions `const auto & [Chain, Edge]` to avoid unnecessary copies and improve readability. rahmanl: `const auto & [Chain, Edge]` to avoid unnecessary copies and improve readability.
		continue;

		// Compute the gain of merging the two chains.
		MergeGainT Gain = getBestMergeGain(Edge);
		Edge->setMergeGain(Gain);

		if (Edge->gain() > EPS)
		Queue.insert(Edge);
		}
		}
		}

		/// Compute the gain of merging two chains.
		///
		/// The function considers all possible ways of merging two chains and
		/// computes the one having the largest increase in ExtTSP objective. The
		/// result is a pair with the first element being the gain and the second
		/// element being the corresponding merging type.
		MergeGainT getBestMergeGain(ChainEdge *Edge) const {
		// Precompute jumps between ChainPred and ChainSucc.
		auto Jumps = Edge->jumps();
		assert(!Jumps.empty() && "trying to merge chains w/o jumps");
		ChainT *SrcChain = Edge->srcChain();
		ChainT *DstChain = Edge->dstChain();

		// This object holds the best currently chosen gain of merging two chains.
		MergeGainT Gain = MergeGainT();

		/// Given a list of merge types, try to merge two chains and update Gain
		rahmanlUnsubmitted Done Reply Inline Actions This rahmanl: This
		/// with a better alternative.
		auto tryChainMerging = [&](const std::vector<MergeTypeT> &MergeTypes) {
		// Apply the merge, compute the corresponding gain, and update the best
		// value, if the merge is beneficial.
		for (const MergeTypeT &MergeType : MergeTypes) {
		MergeGainT NewGain =
		computeMergeGain(SrcChain, DstChain, Jumps, MergeType);

		rahmanlUnsubmitted Done Reply Inline Actions nit: missing period. rahmanl: nit: missing period.
		spupyrevAuthorUnsubmitted Done Reply Inline Actions I'm using commas only for method-level comments (`///`). Is there a standard/preference around this? spupyrev: I'm using commas only for method-level comments (`///`). Is there a standard/preference around…
		rahmanlUnsubmitted Done Reply Inline Actions I'd use the correct punctuation everywhere per guidance from https://llvm.org/docs/CodingStandards.html#commenting A comment ending without period may imply it's been truncated. rahmanl: I'd use the correct punctuation everywhere per guidance from https://llvm.
		// When forward and backward gains are the same, prioritize merging that
		// preserves the original order of the functions in the binary.
		if (std::abs(Gain.score() - NewGain.score()) < EPS) {
		if ((MergeType == MergeTypeT::X_Y && SrcChain->Id < DstChain->Id) \|\|
		rahmanlUnsubmitted Done Reply Inline Actions nit: missing period. rahmanl: nit: missing period.
		(MergeType == MergeTypeT::Y_X && SrcChain->Id > DstChain->Id)) {
		rahmanlUnsubmitted Done Reply Inline Actions Is this ever used with `Offset != 0`? rahmanl: Is this ever used with `Offset != 0`?
		Gain = NewGain;
		}
		} else if (NewGain.score() > Gain.score() + EPS) {
		Gain = NewGain;
		}
		}
		};

		// Try to concatenate two chains w/o splitting.
		tryChainMerging({MergeTypeT::X_Y, MergeTypeT::Y_X});

		return Gain;
		}

		/// Compute the score gain of merging two chains, respecting a given type.
		///
		/// The two chains are not modified in the method.
		MergeGainT computeMergeGain(ChainT ChainPred, ChainT ChainSucc,
		const std::vector<JumpT *> &Jumps,
		MergeTypeT MergeType) const {
		// This doesn't depend on the ordering of the nodes
		double FreqGain = freqBasedLocalityGain(ChainPred, ChainSucc);

		// Merge offset is always 0, as the chains are not split.
		size_t MergeOffset = 0;
		auto MergedBlocks =
		mergeNodes(ChainPred->Nodes, ChainSucc->Nodes, MergeOffset, MergeType);
		double DistGain = distBasedLocalityGain(MergedBlocks, Jumps);

		double GainScore = DistGain + Config.FrequencyScale * FreqGain;
		// Scale the result to increase the importance of merging short chains.
		if (GainScore >= 0.0)
		GainScore /= std::min(ChainPred->Size, ChainSucc->Size);

		return MergeGainT(GainScore, MergeOffset, MergeType);
		}

		/// Compute the change of the frequency locality after merging the chains.
		double freqBasedLocalityGain(ChainT ChainPred, ChainT ChainSucc) const {
		auto missProbability = [&](double ChainDensity) {
		double PageSamples = ChainDensity * Config.CacheSize;
		if (PageSamples >= TotalSamples)
		return 0.0;
		double P = PageSamples / TotalSamples;
		return pow(1.0 - P, static_cast<double>(Config.CacheEntries));
		};

		// Cache misses on the chains before merging.
		double CurScore =
		ChainPred->ExecutionCount * missProbability(ChainPred->density()) +
		rahmanlUnsubmitted Done Reply Inline Actions Can you use a better name for this function? I am thinking of something like `computeFreqBasedLocalityGainForMerge` or `freqBasedLocalityGainForMerge`. It's longer but it fully captures the semantics. Same comment about `mergeGainDist`. rahmanl: Can you use a better name for this function? I am thinking of something like…
		ChainSucc->ExecutionCount * missProbability(ChainSucc->density());

		// Cache misses on the merged chain
		double MergedCounts = ChainPred->ExecutionCount + ChainSucc->ExecutionCount;
		double MergedSize = ChainPred->Size + ChainSucc->Size;
		double MergedDensity = static_cast<double>(MergedCounts) / MergedSize;
		double NewScore = MergedCounts * missProbability(MergedDensity);

		return CurScore - NewScore;
		}

		/// Compute the distance locality for a jump / call.
		double distScore(uint64_t SrcAddr, uint64_t DstAddr, uint64_t Count) const {
		uint64_t Dist = SrcAddr <= DstAddr ? DstAddr - SrcAddr : SrcAddr - DstAddr;
		double D = Dist == 0 ? 0.1 : static_cast<double>(Dist);
		return static_cast<double>(Count) * std::pow(D, -Config.DistancePower);
		}

		/// Compute the change of the distance locality after merging the chains.
		double distBasedLocalityGain(const MergedChain &MergedBlocks,
		const std::vector<JumpT *> &Jumps) const {
		if (Jumps.empty())
		return 0.0;
		uint64_t CurAddr = 0;
		rahmanlUnsubmitted Done Reply Inline Actions Use a more representative naming for this function. rahmanl: Use a more representative naming for this function.
		MergedBlocks.forEach([&](const NodeT *Node) {
		Node->EstimatedAddr = CurAddr;
		CurAddr += Node->Size;
		});

		double CurScore = 0;
		double NewScore = 0;
		for (const JumpT *Arc : Jumps) {
		uint64_t SrcAddr = Arc->Source->EstimatedAddr + Arc->Offset;
		uint64_t DstAddr = Arc->Target->EstimatedAddr;
		NewScore += distScore(SrcAddr, DstAddr, Arc->ExecutionCount);
		CurScore += distScore(0, TotalSize, Arc->ExecutionCount);
		}
		return NewScore - CurScore;
		}

		/// Merge chain From into chain Into, update the list of active chains,
		/// adjacency information, and the corresponding cached values.
		void mergeChains(ChainT Into, ChainT From, size_t MergeOffset,
		MergeTypeT MergeType) {
		assert(Into != From && "a chain cannot be merged with itself");

		// Merge the nodes.
		MergedChain MergedNodes =
		mergeNodes(Into->Nodes, From->Nodes, MergeOffset, MergeType);
		Into->merge(From, MergedNodes.getNodes());

		// Merge the edges.
		Into->mergeEdges(From);
		From->clear();

		// Remove the chain from the list of active chains.
		rahmanlUnsubmitted Not Done Reply Inline Actions I wonder if this could be a source of performance drop for larger programs. These chains are much bigger than the block chains and could have more edges with other chains. However, we currently use std::vector for `ChainT::Edges` and removing things from a vector one by one could result in quadratic time complexity. So a potential optimization is to remove all outdated edges in one shot using the combination of `vector::erase` and `std::remove_if`. rahmanl: I wonder if this could be a source of performance drop for larger programs. These chains are…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions As far as I can see, `ChainT::mergeEdges` does a bit more than removing the items from the vectors; I don't see a simple way of reducing the complexity here. (But I am open for reviewing speedups of course) spupyrev: As far as I can see, `ChainT::mergeEdges` does a bit more than removing the items from the…
		rahmanlUnsubmitted Not Done Reply Inline Actions Sounds good. I'll take a stab at it later. rahmanl: Sounds good. I'll take a stab at it later.
		llvm::erase_value(HotChains, From);
		}

		/// Concatenate all chains into the final order.
		void concatChains(std::vector<uint64_t> &Order) {
		// Collect chains and calculate density stats for their sorting.
		std::vector<const ChainT *> SortedChains;
		DenseMap<const ChainT *, double> ChainDensity;
		for (ChainT &Chain : AllChains) {
		if (!Chain.Nodes.empty()) {
		SortedChains.push_back(&Chain);
		// Using doubles to avoid overflow of ExecutionCounts.
		double Size = 0;
		double ExecutionCount = 0;
		for (NodeT *Node : Chain.Nodes) {
		Size += static_cast<double>(Node->Size);
		ExecutionCount += static_cast<double>(Node->ExecutionCount);
		}
		assert(Size > 0 && "a chain of zero size");
		ChainDensity[&Chain] = ExecutionCount / Size;
		}
		}

		// Sort chains by density in the decreasing order.
		std::sort(SortedChains.begin(), SortedChains.end(),
		[&](const ChainT L, const ChainT R) {
		const double DL = ChainDensity[L];
		const double DR = ChainDensity[R];
		rahmanlUnsubmitted Done Reply Inline Actions We don't need to use stable_sort if we have tie-breaking by identifiers. rahmanl: We don't need to use stable_sort if we have tie-breaking by identifiers.
		// Compare by density and break ties by chain identifiers.
		return std::make_tuple(-DL, L->Id) <
		std::make_tuple(-DR, R->Id);
		});

		// Collect the nodes in the order specified by their chains.
		Order.reserve(NumNodes);
		for (const ChainT *Chain : SortedChains)
		for (NodeT *Node : Chain->Nodes)
		Order.push_back(Node->Index);
		}
		rahmanlUnsubmitted Done Reply Inline Actions Just use braces for the outer loop per LLVM standards: https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies-of-if-else-loop-statements rahmanl: Just use braces for the outer loop per LLVM standards: https://llvm.org/docs/CodingStandards.

		private:
		rahmanlUnsubmitted Done Reply Inline Actions Use `std::make_tuple` to compare these. rahmanl: Use `std::make_tuple` to compare these.
		/// Config for the algorithm.
		const CDSortConfig Config;
		rahmanlUnsubmitted Not Done Reply Inline Actions `const CDSortConfig Config` to avoid dangling references. rahmanl: `const CDSortConfig Config` to avoid dangling references.

		/// The number of nodes in the graph.
		const size_t NumNodes;

		/// Successors of each node.
		std::vector<std::vector<uint64_t>> SuccNodes;

		/// Predecessors of each node.
		std::vector<std::vector<uint64_t>> PredNodes;

		/// All nodes (functions) in the graph.
		std::vector<NodeT> AllNodes;

		/// All jumps (function calls) between the nodes.
		std::vector<JumpT> AllJumps;

		/// All chains of nodes.
		std::vector<ChainT> AllChains;

		/// All edges between the chains.
		std::vector<ChainEdge> AllEdges;

		/// Active chains. The vector gets updated at runtime when chains are merged.
		std::vector<ChainT *> HotChains;

		/// The total number of samples in the graph.
		uint64_t TotalSamples{0};

		/// The total size of the nodes in the graph.
		uint64_t TotalSize{0};
		};

} // end of anonymous namespace		} // end of anonymous namespace

std::vector<uint64_t>		std::vector<uint64_t>
llvm::applyExtTspLayout(const std::vector<uint64_t> &NodeSizes,		llvm::applyExtTspLayout(const std::vector<uint64_t> &NodeSizes,
const std::vector<uint64_t> &NodeCounts,		const std::vector<uint64_t> &NodeCounts,
const std::vector<EdgeCountT> &EdgeCounts) {		const std::vector<EdgeCountT> &EdgeCounts) {
// Verify correctness of the input data		// Verify correctness of the input data.
assert(NodeCounts.size() == NodeSizes.size() && "Incorrect input");		assert(NodeCounts.size() == NodeSizes.size() && "Incorrect input");
assert(NodeSizes.size() > 2 && "Incorrect input");		assert(NodeSizes.size() > 2 && "Incorrect input");

// Apply the reordering algorithm		// Apply the reordering algorithm.
ExtTSPImpl Alg(NodeSizes, NodeCounts, EdgeCounts);		ExtTSPImpl Alg(NodeSizes, NodeCounts, EdgeCounts);
std::vector<uint64_t> Result;		std::vector<uint64_t> Result;
Alg.run(Result);		Alg.run(Result);

// Verify correctness of the output		// Verify correctness of the output.
assert(Result.front() == 0 && "Original entry point is not preserved");		assert(Result.front() == 0 && "Original entry point is not preserved");
assert(Result.size() == NodeSizes.size() && "Incorrect size of layout");		assert(Result.size() == NodeSizes.size() && "Incorrect size of layout");
return Result;		return Result;
}		}

double llvm::calcExtTspScore(const std::vector<uint64_t> &Order,		double llvm::calcExtTspScore(const std::vector<uint64_t> &Order,
const std::vector<uint64_t> &NodeSizes,		const std::vector<uint64_t> &NodeSizes,
const std::vector<uint64_t> &NodeCounts,		const std::vector<uint64_t> &NodeCounts,
const std::vector<EdgeCountT> &EdgeCounts) {		const std::vector<EdgeCountT> &EdgeCounts) {
// Estimate addresses of the blocks in memory		// Estimate addresses of the blocks in memory.
std::vector<uint64_t> Addr(NodeSizes.size(), 0);		std::vector<uint64_t> Addr(NodeSizes.size(), 0);
for (size_t Idx = 1; Idx < Order.size(); Idx++) {		for (size_t Idx = 1; Idx < Order.size(); Idx++) {
Addr[Order[Idx]] = Addr[Order[Idx - 1]] + NodeSizes[Order[Idx - 1]];		Addr[Order[Idx]] = Addr[Order[Idx - 1]] + NodeSizes[Order[Idx - 1]];
}		}
std::vector<uint64_t> OutDegree(NodeSizes.size(), 0);		std::vector<uint64_t> OutDegree(NodeSizes.size(), 0);
for (auto It : EdgeCounts) {		for (auto It : EdgeCounts) {
uint64_t Pred = It.first.first;		uint64_t Pred = It.first.first;
OutDegree[Pred]++;		OutDegree[Pred]++;
}		}

// Increase the score for each jump		// Increase the score for each jump.
double Score = 0;		double Score = 0;
for (auto It : EdgeCounts) {		for (auto It : EdgeCounts) {
uint64_t Pred = It.first.first;		uint64_t Pred = It.first.first;
uint64_t Succ = It.first.second;		uint64_t Succ = It.first.second;
uint64_t Count = It.second;		uint64_t Count = It.second;
bool IsConditional = OutDegree[Pred] > 1;		bool IsConditional = OutDegree[Pred] > 1;
Score += ::extTSPScore(Addr[Pred], NodeSizes[Pred], Addr[Succ], Count,		Score += ::extTSPScore(Addr[Pred], NodeSizes[Pred], Addr[Succ], Count,
IsConditional);		IsConditional);
}		}
return Score;		return Score;
}		}

double llvm::calcExtTspScore(const std::vector<uint64_t> &NodeSizes,		double llvm::calcExtTspScore(const std::vector<uint64_t> &NodeSizes,
const std::vector<uint64_t> &NodeCounts,		const std::vector<uint64_t> &NodeCounts,
const std::vector<EdgeCountT> &EdgeCounts) {		const std::vector<EdgeCountT> &EdgeCounts) {
std::vector<uint64_t> Order(NodeSizes.size());		std::vector<uint64_t> Order(NodeSizes.size());
for (size_t Idx = 0; Idx < NodeSizes.size(); Idx++) {		for (size_t Idx = 0; Idx < NodeSizes.size(); Idx++) {
Order[Idx] = Idx;		Order[Idx] = Idx;
}		}
return calcExtTspScore(Order, NodeSizes, NodeCounts, EdgeCounts);		return calcExtTspScore(Order, NodeSizes, NodeCounts, EdgeCounts);
}		}

		std::vector<uint64_t>
		llvm::applyCDSLayout(const CDSortConfig &Config,
		const std::vector<uint64_t> &FuncSizes,
		const std::vector<uint64_t> &FuncCounts,
		const std::vector<EdgeCountT> &CallCounts,
		const std::vector<uint64_t> &CallOffsets) {
		// Verify correctness of the input data.
		assert(FuncCounts.size() == FuncSizes.size() && "Incorrect input");

		// Apply the reordering algorithm.
		CDSortImpl Alg(Config, FuncSizes, FuncCounts, CallCounts, CallOffsets);
		std::vector<uint64_t> Result;
		Alg.run(Result);

		// Verify correctness of the output.
		assert(Result.size() == FuncSizes.size() && "Incorrect size of layout");
		return Result;
		}

		std::vector<uint64_t>
		llvm::applyCDSLayout(const std::vector<uint64_t> &FuncSizes,
		const std::vector<uint64_t> &FuncCounts,
		const std::vector<EdgeCountT> &CallCounts,
		const std::vector<uint64_t> &CallOffsets) {
		CDSortConfig Config;
		// Populate the config from the command-line options.
		if (CacheEntries.getNumOccurrences() > 0)
		rahmanlUnsubmitted Not Done Reply Inline Actions We don't need these I think. rahmanl: We don't need these I think.
		Config.CacheEntries = CacheEntries;
		if (CacheSize.getNumOccurrences() > 0)
		Config.CacheSize = CacheSize;
		if (DistancePower.getNumOccurrences() > 0)
		Config.DistancePower = DistancePower;
		if (FrequencyScale.getNumOccurrences() > 0)
		Config.FrequencyScale = FrequencyScale;
		return applyCDSLayout(Config, FuncSizes, FuncCounts, CallCounts, CallOffsets);
		}