This is an archive of the discontinued LLVM Phabricator instance.

A new code layout algorithm for function reordering [2/3]
ClosedPublic

Authored by spupyrev on Jun 13 2023, 10:20 AM.

Details

Summary

We are bringing a new algorithm for function layout (reordering) based on the
call graph (extracted from a profile data). The algorithm is an improvement of
top of a known heuristic, C^3. It tries to co-locate hot and frequently executed
together functions in the resulting ordering. Unlike C^3, it explores a larger
search space and have an objective closely tied to the performance of
instruction and i-TLB caches. Hence, the name CDS = Cache-Directed Sort.
The algorithm can be used at the linking or post-linking (e.g., BOLT) stage.

The algorithm shares some similarities with C^3 and an approach for basic block
reordering (ext-tsp). It works with chains (ordered lists)
of functions. Initially all chains are isolated functions. On every iteration,
we pick a pair of chains whose merging yields the biggest increase in the
objective, which is a weighted combination of frequency-based and distance-based
locality. That is, we try to co-locate hot functions together (so they can share
the cache lines) and functions frequently executed together. The merging process
stops when there is only one chain left, or when merging does not improve the
objective. In the latter case, the remaining chains are sorted by density in the
decreasing order.

Complexity
We regularly apply the algorithm for large data-center binaries containing 10K+
(hot) functions, and the algorithm takes only a few seconds. For some extreme
cases with 100K-1M nodes, the runtime is within minutes.

Perf-impact
We extensively tested the implementation extensively on a benchmark of isolated
binaries and prod services. The impact is measurable for "larger" binaries that
are front-end bound: the cpu time improvement (on top of C^3) is in the range
of [0% .. 1%], which is a result of a reduced i-TLB miss rate (by up to 20%) and
i-cache miss rate (up to 5%).

Diff Detail

Event Timeline

spupyrev created this revision.Jun 13 2023, 10:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 10:20 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
spupyrev published this revision for review.Jun 13 2023, 1:31 PM
spupyrev edited the summary of this revision. (Show Details)
spupyrev added reviewers: wenlei, hoy, wlei.
spupyrev added subscribers: Amir, maksfb.
Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 1:31 PM
spupyrev added a subscriber: shatianw_mt.

What is the baseline of the performance measurement? Is it with AutoFDO or instrumentation FDO? I also wonder if performance number is measured with hugepage text or small pages (it is surprising to see 20% iTLB reduction over the baseline -- assuming with profile data).

I am also interested in getting more details about your evaluation. Currently, LLD uses CallChainClustering for FDO (https://github.com/llvm-mirror/lld/blob/master/ELF/CallGraphSort.cpp). I wonder how much perf. improvement we get if we hook this into LLD.

@rahmanl @davidxl Our measurements are always on top of C^3 (the one currently utilized by CallGraphSort.cpp). If we compare against "no-function-ordering" (ie looking at the order produced by the compiler), then the wins would be well beyond 1% cpu but that's not a realistic scenario these days. The linked follow-up diffs, D152840 and D153039, enable the ordering in the linker and in the BOLT. I'll share more specific numbers on my benchmarks soon.

Here are my measurements on the clang binary (release_14) by compiling two large cpp files (benchmark1 and benchmark2). Negative values are improvements, bold ones are stat sig.

  1. Using the alg in LLD (on top of D152840), with AutoFDO and without huge pages
basetestdelta(%)
benchmark1
task-clock6440.07 ± 16.946373.95 ± 14.70-1.01 ± 0.33
icache-misses218340412 ± 361175210448621 ± 387645-3.61 ± 0.16
itlb-misses45609238 ± 12922542629503 ± 46353-6.38 ± 0.16
benchmark2
task-clock9509.05 ± 21.749443.90 ± 30.05-0.73 ± 0.29
icache-misses174525893 ± 294760166852633 ± 253654-4.38 ± 0.16
itlb-misses36756578 ± 9016235175447 ± 91296-4.29 ± 0.29
  1. Using the alg in BOLT (on top of D153039), with AutoFDO and without huge pages
basetestdelta(%)
benchmark1
task-clock5398.95 ± 18.125366.50 ± 12.47-0.52 ± 0.39
icache-misses86342101 ± 25859885715814 ± 152068-0.63 ± 0.19
itlb-misses16891309 ± 4048015543677 ± 57307-7.99 ± 0.41
benchmark2
task-clock8307.96 ± 15.848316.72 ± 15.500.10 ± 0.20
icache-misses67742141 ± 47051565716219 ± 198470-2.82 ± 0.20
itlb-misses13591076 ± 9999812462672 ± 70100-8.18 ± 0.54
  1. Using the alg in BOLT (on top of D153039), with AutoFDO and with huge pages
basetestdelta(%)
benchmark1
task-clock5329.71 ± 38.165333.77 ± 17.210.31 ± 0.49
icache-misses89754736 ± 9308890480531 ± 2369960.69 ± 0.22
itlb-misses2279266 ± 150321973922 ± 13429-13.45 ± 0.86
benchmark2
task-clock8241.64 ± 16.928252.00 ± 13.550.15 ± 0.21
icache-misses69470543 ± 14185868224372 ± 177928-1.79 ± 0.32
itlb-misses1902566 ± 365422070742 ± 195589.27 ± 1.60

Here are my measurements on the clang binary (release_14) by compiling two large cpp files (benchmark1 and benchmark2). Negative values are improvements, bold ones are stat sig.

  1. Using the alg in LLD (on top of D152840), with AutoFDO and without huge pages
basetestdelta(%)
benchmark1
task-clock6440.07 ± 16.946373.95 ± 14.70-1.01 ± 0.33
icache-misses218340412 ± 361175210448621 ± 387645-3.61 ± 0.16
itlb-misses45609238 ± 12922542629503 ± 46353-6.38 ± 0.16
benchmark2
task-clock9509.05 ± 21.749443.90 ± 30.05-0.73 ± 0.29
icache-misses174525893 ± 294760166852633 ± 253654-4.38 ± 0.16
itlb-misses36756578 ± 9016235175447 ± 91296-4.29 ± 0.29
  1. Using the alg in BOLT (on top of D153039), with AutoFDO and without huge pages
basetestdelta(%)
benchmark1
task-clock5398.95 ± 18.125366.50 ± 12.47-0.52 ± 0.39
icache-misses86342101 ± 25859885715814 ± 152068-0.63 ± 0.19
itlb-misses16891309 ± 4048015543677 ± 57307-7.99 ± 0.41
benchmark2
task-clock8307.96 ± 15.848316.72 ± 15.500.10 ± 0.20
icache-misses67742141 ± 47051565716219 ± 198470-2.82 ± 0.20
itlb-misses13591076 ± 9999812462672 ± 70100-8.18 ± 0.54
  1. Using the alg in BOLT (on top of D153039), with AutoFDO and with huge pages
basetestdelta(%)
benchmark1
task-clock5329.71 ± 38.165333.77 ± 17.210.31 ± 0.49
icache-misses89754736 ± 9308890480531 ± 2369960.69 ± 0.22
itlb-misses2279266 ± 150321973922 ± 13429-13.45 ± 0.86
benchmark2
task-clock8241.64 ± 16.928252.00 ± 13.550.15 ± 0.21
icache-misses69470543 ± 14185868224372 ± 177928-1.79 ± 0.32
itlb-misses1902566 ± 365422070742 ± 195589.27 ± 1.60

The timing data with huge pages look expected. The icache miss and itlb miss data look puzzling though -- benchmark1 sees slight icache miss increase and huge itlb miss reduction while benchmark2 is the opposite. While the slight increase of icache miss can be the result of increase in conflict misses due to the use of huge pages, the increase in ITLB misses for benchmark2 is surprising.

Thanks for providing the evaluation results. This shows nice improvements when huge pages are not used. My intuition is that using huge pages reduces the opportunity from function reordering.

Would you please clarify what specific performance counters are used in the measurement? For iTLB misses icache_64b.iftag_miss includes speculative execution but frontend_retired.itlb_misses does not.

spupyrev added a comment.EditedJun 21 2023, 5:07 PM

Hmm. I was using "cpu/event=0x85,umask=0x61/u" for i-TLB misses, which we got from https://download.01.org/perfmon, which has even been moved since then. Back in 2017 (when the algorithm was developed) we thought this is the "right" event to look at, but it might not be the case. Which one would you recommend to look at? I see this page has a good description.

Notice that when we turn on huge pages, the amount of iTLB misses decreases by ~10x, and the absolute differences between A and B sides are much smaller than when no huge pages are used.

Hmm. I was using "cpu/event=0x85,umask=0x61/u" for i-TLB misses, which we got from https://download.01.org/perfmon, which has even been moved since then. Back in 2017 (when the algorithm was developed) we thought this is the "right" event to look at, but it might not be the case. Which one would you recommend to look at? I see this page has a good description.

There are a few different sets of events which count iTLB related behaviours. The misses that matter most are the ones that stall the pipeline. This is counted by FRONTEND_RETIRED.ITLB_MISS. https://github.com/intel/perfmon/blob/main/SKL/events/skylake_core.json#L5117-L5137

For a raw count of iTLB misses which include speculative execution you can look at ICACHE64B.IFTAG_STALL (alias ICACHE_TAG.STALLS). It's unfortunately an un-intuitive name. https://github.com/intel/perfmon/blob/main/SKL/events/skylake_core.json#L2702-L2723

The set of events which use cpu/event=0x85 are meant to capture speculative and non-speculative execution triggered page walks (apart from one mask which counts sTLB hits). So I would recommend looking into the two mentioned above to quantify the impact more accurately for hugepages.

Notice that when we turn on huge pages, the amount of iTLB misses decreases by ~10x, and the absolute differences between A and B sides are much smaller than when no huge pages are used.

Yes, as @rahmanl mentioned, with hugepages enabled there probably isn't much headroom for improvement (as reflected in the task-clock measurements).

Matt added a subscriber: Matt.Jun 22 2023, 4:18 PM

Thanks for teaching me how to measure the impact of instruction caches. While re-running the experiments with the new events, I realized that my earlier report was not using C^3 as the baseline. Instead the numbers were on top of an improved code layout (referred to hfsort+) utilized by BOLT, which is not relevant here; I apologize for the confusion.
Below are details of the latest run on the same clang benchmarks, with and without huge pages. In addition to comparing the new algorithm to C^3, i'm also including the numbers on top of the "input" ordering that comes from the compiler. Here I'm building the binary with LTO and AutoFDO, but observe similar numbers when using instrumentation counts or other sampling-based profiling approaches (e.g., CSSPGO).

No hugepages:

cdsc^3 (delta, %)input (delta, %)
benchmark1
frontend_retired.l1i_miss6935124270805714 (1.99 ± 0.14)73665990 (5.88 ± 0.11)
icache_64b.iftag_stall377880876440763009 (14.67 ± 0.32)615372537 (38.18 ± 0.41)
frontend_retired.itlb_miss49176515823311 (15.58 ± 0.42)8996999 (45.14 ± 0.36)
task-clock53485393 (0.72 ± 0.33)5431 (1.47 ± 0.24)
benchmark2
frontend_retired.l1i_miss6168126863522660 (2.92 ± 0.27)64953229 (5.01 ± 0.17)
icache_64b.iftag_stall325869634377176494 (13.49 ± 0.43)495816988 (34.07 ± 0.38)
frontend_retired.itlb_miss38145204502050 (15.33 ± 0.66)7121213 (47.04 ± 0.27)
task-clock83118338 (0.32 ± 0.17)8363 (0.62 ± 0.23)

With hugepages:

cdsc^3 (delta, %)input (delta, %)
benchmark1
frontend_retired.l1i_miss6795198368463724 (0.75 ± 0.16)70894569 (4.25 ± 1.32)
icache_64b.iftag_stall132528699151086801 (12.26 ± 0.26)179977955 (26.24 ± 0.84)
frontend_retired.itlb_miss255677322445 (20.03 ± 0.41)524749 (51.21 ± 1.04)
task-clock52875314 (0.51 ± 0.30)5349 (1.27 ± 0.28)
benchmark2
frontend_retired.l1i_miss5959333460726917 (1.85 ± 0.34)60064530 (0.97 ± 0.30)
icache_64b.iftag_stall130194520133511012 (2.67 ± 1.03)146815089 (11.41 ± 1.05)
frontend_retired.itlb_miss207543259727 (20.07 ± 2.38)416369 (50.05 ± 1.09)
task-clock82388266 (0.35 ± 0.38)8276 (0.45 ± 0.26)

Besides these numbers, I can only share one data-point on a large production service, where CDS outperforms C^3 by around 0.25%-0.3% cpu (with huge pages and many other optimizations turned on). Though I'd generalize the wins on other binaries/benchmarks with an extra care, as it depends on a lot of factors.

Of course, using huge pages diminishes the impact of function reordering; yet it can still provide benefits.

Thanks for the updated results. They look more consistent now.

llvm/include/llvm/Transforms/Utils/CodeLayout.h
56

Please introduce cl::opts for these so they can be manually configured.

llvm/lib/Transforms/Utils/CodeLayout.cpp
1097

Here, you can use std::make_tuple(-L->gain(), L->srcChain()->Id, L->destChain()->Id) < std::make_tuple(-R->gain(), R->srcChain()->Id, R->destChain()->Id).

1105

I wonder if we should make this a class member and then decompose this large function into smaller pieces.

1107

nit: Use consistent style ("Insert" instead of "Inserting").

1183

nit: missing period.

1187

nit: missing period.

1188

Is this ever used with Offset != 0?

1238

Can you use a better name for this function? I am thinking of something like computeFreqBasedLocalityGainForMerge or freqBasedLocalityGainForMerge. It's longer but it fully captures the semantics.

Same comment about mergeGainDist.

1262

Use a more representative naming for this function.

1332–1335

Use std::make_tuple to compare these.

spupyrev updated this revision to Diff 538812.Jul 10 2023, 1:46 PM
spupyrev marked 6 inline comments as done.

comments

spupyrev marked an inline comment as done.Jul 10 2023, 1:57 PM
spupyrev added inline comments.
llvm/include/llvm/Transforms/Utils/CodeLayout.h
56

Based on my experience, exposing such constants doesn't add much value but may confuse some people. For example, C^3 has a bunch of similar options tuned once on a specific benchmark years ago, and no one has ever tried to re-consider them :) A have similar experience for other algorithms, where the defaults were never touched. At this time I consider the values as some internal algorithm-specific constants that are not supposed to be modified.
However, cl:opts take just a few lines of code, so i'm happy to add those if you have an alternative opinion.

llvm/lib/Transforms/Utils/CodeLayout.cpp
1183

I'm using commas only for method-level comments (///). Is there a standard/preference around this?

rahmanl added inline comments.Jul 12 2023, 1:17 PM
llvm/lib/Transforms/Utils/CodeLayout.cpp
1010

Although this is pre-existing, I wonder why this is returning by parameter.

1065–1067

Drop braces for single-statement ifs.

1101

Same here.

1135

How about we simply this using:
for (const auto &[Chain, ChainEdge]: BestSrcChain->Edges) Queue.erase(ChainEdge);

1146

const auto & [Chain, Edge] to avoid unnecessary copies and improve readability.

1175

This

1183

I'd use the correct punctuation everywhere per guidance from https://llvm.org/docs/CodingStandards.html#commenting
A comment ending without period may imply it's been truncated.

1294

I wonder if this could be a source of performance drop for larger programs. These chains are much bigger than the block chains and could have more edges with other chains. However, we currently use std::vector for ChainT::Edges and removing things from a vector one by one could result in quadratic time complexity. So a potential optimization is to remove all outdated edges in one shot using the combination of vector::erase and std::remove_if.

1322

We don't need to use stable_sort if we have tie-breaking by identifiers.

1333
rahmanl added inline comments.Jul 12 2023, 1:22 PM
llvm/include/llvm/Transforms/Utils/CodeLayout.h
56

The configurations here look very architecture-dependent. So I am assuming that as hardware improves, we may increase these parameters to get a better result. WDYT?

spupyrev updated this revision to Diff 540523.Jul 14 2023, 11:54 AM
spupyrev marked 11 inline comments as done.

addressing comments (mostly adding periods)

spupyrev added inline comments.Jul 14 2023, 11:55 AM
llvm/include/llvm/Transforms/Utils/CodeLayout.h
56

Added

llvm/lib/Transforms/Utils/CodeLayout.cpp
1294

As far as I can see, ChainT::mergeEdges does a bit more than removing the items from the vectors; I don't see a simple way of reducing the complexity here. (But I am open for reviewing speedups of course)

rahmanl accepted this revision.Jul 21 2023, 10:57 AM
rahmanl added inline comments.
llvm/lib/Transforms/Utils/CodeLayout.cpp
1294

Sounds good. I'll take a stab at it later.

1337

const CDSortConfig Config to avoid dangling references.

1467

We don't need these I think.

This revision is now accepted and ready to land.Jul 21 2023, 10:57 AM

Thanks for review Rahman. I'll wait a few more days before landing in case anyone else has suggestions.