Page MenuHomePhabricator

[lld][ELF] Add profile guided section layout
ClosedPublic

Authored by Bigcheese on Aug 4 2017, 9:13 PM.

Diff Detail

Repository
rLLD LLVM Linker

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Bigcheese updated this revision to Diff 133550.Feb 8 2018, 6:05 PM
  • Update to top of tree.
  • Add some comments.
pdeng6 added a subscriber: pdeng6.Feb 9 2018, 1:18 AM

@Bigcheese

I'm trying function layout methodologies within ChromeOS to improve chrome performance, this feature is very important since the work relies on it.
I've checked out the patches for both lld(https://reviews.llvm.org/D36351?id=133550) and llvm(https://reviews.llvm.org/D42161 and https://reviews.llvm.org/D42163), and built it locally, but I'm afraid the patches for llvm is not as fresh as that for lld.
Would it be possible for you to show the latest patches at both side, and the base llvm/lld revisions you are working on? then we can completely align the environment.

thanks
Pan

espindola edited reviewers, added: espindola; removed: rafael.Mar 2 2018, 5:52 PM
Bigcheese updated this revision to Diff 137040.Mar 5 2018, 10:42 AM

Rewrote the algorithm to match hfsort. Now gets the same performance as hfsort in Rafael's testcase.

Bigcheese updated this revision to Diff 137539.Mar 7 2018, 8:03 PM

Address review comments.

Rafael Avila de Espindola <rafael.espindola@gmail.com> writes:

+ InputSectionBase *FromSec = SymbolSection.lookup(Fields[0]);
+ InputSectionBase *ToSec = SymbolSection.lookup(Fields[1]);
+ if (FromSec && ToSec)
+ Config->CallGraphProfile[std::make_pair(FromSec, ToSec)] = Count;

This should be "+=", no? If there are multiple calls from one section to
another I would expect us to use the total in the graph.

Looks like the missing + gets us the last missing performance bits.

In the attached log "old" is a rebased version of your previous patch, "new"
is your current patch and "new-new" is your current patch with the above
fix.

Ah, and please git-clang-format :-)

Cheers,
Rafael

Michael Spencer via Phabricator via llvm-commits
<llvm-commits@lists.llvm.org> writes:

Bigcheese updated this revision to Diff 137040.
Bigcheese added a comment.

Rewrote the algorithm to match hfsort. Now gets the same performance as hfsort in Rafael's testcase.

https://reviews.llvm.org/D36351

Files:

ELF/CMakeLists.txt
ELF/CallGraphSort.cpp
ELF/CallGraphSort.h
ELF/Config.h
ELF/Driver.cpp
ELF/Options.td
ELF/Writer.cpp
test/ELF/cgprofile-txt.s

Index: test/ELF/cgprofile-txt.s

  • /dev/null

+++ test/ELF/cgprofile-txt.s
@@ -0,0 +1,106 @@
+# REQUIRES: x86
+
+# RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux %s -o %t
+# RUN: ld.lld -e A %t -o %t2
+# RUN: llvm-readobj -symbols %t2 | FileCheck %s --check-prefix=NOSORT
+
+# RUN: echo "A B 100" > %t.call_graph
+# RUN: echo "A C 40" >> %t.call_graph
+# RUN: echo "B C 30" >> %t.call_graph
+# RUN: echo "C D 90" >> %t.call_graph
+# RUN: echo "PP TS 100" >> %t.call_graph
+# RUN: echo "_init2 _init 24567837" >> %t.call_graph
+# RUN: echo "TS QC 9001" >> %t.call_graph
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2
+# RUN: llvm-readobj -symbols %t2 | FileCheck %s
+
+ .section .text.D,"ax",@progbits
+D:
+ retq
+
+ .section .text.C,"ax",@progbits
+ .globl C
+C:
+ retq
+
+ .section .text.B,"ax",@progbits
+ .globl B
+B:
+ retq
+
+ .section .text.A,"ax",@progbits
+ .globl A
+A:
+ retq
+
+ .section .ponies,"ax",@progbits,unique,1
+ .globl TS
+TS:
+ retq
+
+ .section .ponies,"ax",@progbits,unique,2
+ .globl PP
+PP:
+ retq
+
+ .section .other,"ax",@progbits,unique,1
+ .globl QC
+QC:
+ retq
+
+ .section .other,"ax",@progbits,unique,2
+ .globl GB
+GB:
+ retq
+
+ .section .init,"ax",@progbits,unique,1
+ .globl _init
+_init:
+ retq
+
+ .section .init,"ax",@progbits,unique,2
+ .globl _init2
+_init2:
+ retq
+
+# CHECK: Name: D
+# CHECK-NEXT: Value: 0x201003
+# CHECK: Name: A
+# CHECK-NEXT: Value: 0x201000
+# CHECK: Name: B
+# CHECK-NEXT: Value: 0x201001
+# CHECK: Name: C
+# CHECK-NEXT: Value: 0x201002
+# CHECK: Name: GB
+# CHECK-NEXT: Value: 0x201007
+# CHECK: Name: PP
+# CHECK-NEXT: Value: 0x201004
+# CHECK: Name: QC
+# CHECK-NEXT: Value: 0x201006
+# CHECK: Name: TS
+# CHECK-NEXT: Value: 0x201005
+# CHECK: Name: _init
+# CHECK-NEXT: Value: 0x201008
+# CHECK: Name: _init2
+# CHECK-NEXT: Value: 0x201009
+
+# NOSORT: Name: D
+# NOSORT-NEXT: Value: 0x201000
+# NOSORT: Name: A
+# NOSORT-NEXT: Value: 0x201003
+# NOSORT: Name: B
+# NOSORT-NEXT: Value: 0x201002
+# NOSORT: Name: C
+# NOSORT-NEXT: Value: 0x201001
+# NOSORT: Name: GB
+# NOSORT-NEXT: Value: 0x201007
+# NOSORT: Name: PP
+# NOSORT-NEXT: Value: 0x201005
+# NOSORT: Name: QC
+# NOSORT-NEXT: Value: 0x201006
+# NOSORT: Name: TS
+# NOSORT-NEXT: Value: 0x201004
+# NOSORT: Name: _init
+# NOSORT-NEXT: Value: 0x201008
+# NOSORT: Name: _init2
+# NOSORT-NEXT: Value: 0x201009

Index: ELF/Writer.cpp

  • ELF/Writer.cpp

+++ ELF/Writer.cpp
@@ -9,6 +9,7 @@

#include "Writer.h"
#include "AArch64ErrataFix.h"
+#include "CallGraphSort.h"
#include "Config.h"
#include "Filesystem.h"
#include "LinkerScript.h"
@@ -1110,6 +1111,14 @@
If no layout was provided by linker script, we want to apply default
sorting for special input sections. This also handles --symbol-ordering-file.
template <class ELFT> void Writer<ELFT>::sortInputSections() {
+ // Use the rarely used option -call-graph-ordering-file to sort sections.
+ if (!Config->CallGraphProfile.empty()) {
+ DenseMap<const InputSectionBase *, int> Order =
+ computeCallGraphProfileOrder();
+ for (BaseCommand *Base : Script->SectionCommands)
+ if (auto *Sec = dyn_cast<OutputSection>(Base))
+ sortSection(Sec, Order);
+ }

// Build the order once since it is expensive.
DenseMap<const InputSectionBase *, int> Order = buildSectionOrder();
for (BaseCommand *Base : Script->SectionCommands)

Index: ELF/Options.td

  • ELF/Options.td

+++ ELF/Options.td
@@ -66,6 +66,9 @@

"Only set DT_NEEDED for shared libraries if used",
"Always set DT_NEEDED for shared libraries">;

+defm call_graph_ordering_file: Eq<"call-graph-ordering-file">,
+ HelpText<"Layout sections to optimize the given callgraph">;
+
// -chroot doesn't have a help text because it is an internal option.
defm chroot: Eq<"chroot">;

Index: ELF/Driver.cpp

  • ELF/Driver.cpp

+++ ELF/Driver.cpp
@@ -571,6 +571,31 @@

return {BuildIdKind::None, {}};

}

+static void readCallGraph(MemoryBufferRef MB) {
+ // Build a map from symbol name to section
+ DenseMap<StringRef, InputSectionBase *> SymbolSection;
+ for (InputFile *File : ObjectFiles)
+ for (Symbol *Sym : File->getSymbols())
+ if (auto *D = dyn_cast<Defined>(Sym))
+ if (auto *IS = dyn_cast_or_null<InputSectionBase>(D->Section))
+ SymbolSection[D->getName()] = IS;
+
+ std::vector<StringRef> Lines = args::getLines(MB);
+ for (StringRef L : Lines) {
+ SmallVector<StringRef, 3> Fields;
+ L.split(Fields, ' ');
+ if (Fields.size() != 3)
+ fatal("parse error");
+ uint64_t Count;
+ if (!to_integer(Fields[2], Count))
+ fatal("parse error");
+ InputSectionBase *FromSec = SymbolSection.lookup(Fields[0]);
+ InputSectionBase *ToSec = SymbolSection.lookup(Fields[1]);
+ if (FromSec && ToSec)
+ Config->CallGraphProfile[std::make_pair(FromSec, ToSec)] = Count;
+ }
+}
+
static bool getCompressDebugSections(opt::InputArgList &Args) {

StringRef S = Args.getLastArgValue(OPT_compress_debug_sections, "none");
if (S == "none")

@@ -1118,6 +1143,10 @@

// Apply symbol renames for -wrap.
Symtab->applySymbolWrap();

+ if (auto *Arg = Args.getLastArg(OPT_call_graph_ordering_file))
+ if (Optional<MemoryBufferRef> Buffer = readFile(Arg->getValue()))
+ readCallGraph(*Buffer);
+

// Now that we have a complete list of input files.
// Beyond this point, no new files are added.
// Aggregate all input sections into one place.

Index: ELF/Config.h

  • ELF/Config.h

+++ ELF/Config.h
@@ -24,6 +24,7 @@
namespace elf {

class InputFile;
+class InputSectionBase;

enum ELFKind {

ELFNoneKind,

@@ -103,6 +104,9 @@

std::vector<SymbolVersion> VersionScriptGlobals;
std::vector<SymbolVersion> VersionScriptLocals;
std::vector<uint8_t> BuildIdVector;

+ llvm::MapVector<std::pair<const InputSectionBase *, const InputSectionBase *>,
+ uint64_t>
+ CallGraphProfile;

bool AllowMultipleDefinition;
bool AndroidPackDynRelocs = false;
bool ARMHasBlx = false;

Index: ELF/CallGraphSort.h

  • /dev/null

+++ ELF/CallGraphSort.h
@@ -0,0 +1,23 @@
+===- CallGraphSort.h ------------------------------------------*- C++ -*-===
+
+
The LLVM Linker
+
+
This file is distributed under the University of Illinois Open Source
+ License. See LICENSE.TXT for details.
+

+===----------------------------------------------------------------------===
+
+#ifndef LLD_ELF_CALL_GRAPH_SORT_H
+#define LLD_ELF_CALL_GRAPH_SORT_H
+
+#include "llvm/ADT/DenseMap.h"
+
+namespace lld {
+namespace elf {
+class InputSectionBase;
+
+llvm::DenseMap<const InputSectionBase *, int> computeCallGraphProfileOrder();
+} namespace elf
+}
namespace lld
+
+#endif

Index: ELF/CallGraphSort.cpp

  • /dev/null

+++ ELF/CallGraphSort.cpp
@@ -0,0 +1,350 @@
+===- CallGraphSort.cpp --------------------------------------------------===
+
+
The LLVM Linker
+
+
This file is distributed under the University of Illinois Open Source
+ License. See LICENSE.TXT for details.
+

+===----------------------------------------------------------------------===
+/
+
/ Implementation of Call-Chain Clustering from: Optimizing Function Placement
+/ for Large-Scale Data-Center Applications
+
/ https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
+/
+
/ The goal of this algorithm is to improve runtime performance of the final
+/ executable by arranging code sections such that page table and i-cache
+
/ misses are minimized.
+/
+
/ Definitions:
+/ * Cluster
+
/ * An ordered list of input sections which are layed out as a unit. At the
+/ beginning of the algorithm each input section has its own cluster and
+
/ the weight of the cluster is the sum of the weight of all incomming
+/ edges.
+
/ * Call-Chain Clustering (C³) Heuristic
+/ * Defines when and how clusters are combined. Pick the highest weighted
+
/ input section then add it to its most likely predecessor if it wouldn't
+/ penalize it too much.
+
/ * Density
+/ * The weight of the cluster divided by the size of the cluster. This is a
+
/ proxy for the ammount of execution time spent per byte of the cluster.
+/
+
/ It does so given a call graph profile by the following:
+/ * Build a weighted call graph from the profile
+
/ * Sort input sections by weight
+/ * For each input section starting with the highest weight
+
/ * Find its most likely predecessor cluster
+/ * Check if the combined cluster would be too large, or would have too low
+
/ a density.
+/ * If not, then combine the clusters.
+
/ * Sort non-empty clusters by density
+/
+
===----------------------------------------------------------------------===
+
+#include "CallGraphSort.h"
+#include "OutputSections.h"
+#include "SymbolTable.h"
+#include "Symbols.h"
+
+#include "llvm/Support/MathExtras.h"
+
+using namespace llvm;
+using namespace lld;
+using namespace lld::elf;
+
+namespace {
+using ClusterIndex = std::ptrdiff_t;
+using SectionIndex = std::ptrdiff_t;
+using EdgeIndex = std::ptrdiff_t;
+
+
Used for calculating an comparing density. Use soft-float for determinism.
+struct Double : APFloat {
+ Double() : APFloat(APFloat::IEEEdouble(), 0) {}
+ Double(uint64_t Val) : APFloat(APFloat::IEEEdouble(), Val) {}
+ Double(APFloat A) : APFloat(A) {}
+ bool operator>(const Double Other) const {
+ return compare(Other) == cmpGreaterThan;
+ }
+ bool operator<(const Double Other) const {
+ return compare(Other) == cmpLessThan;
+ }
+};
+
+struct Cluster {
+ Cluster(SectionIndex Sec, const InputSectionBase *IS);
+
+ Double getDensity() const {
+ if (Size == 0)
+ return 0;
+ return Double(Weight) / Double(Size);
+ }
+
+ std::vector<const InputSectionBase *> ISBs;
+ std::vector<SectionIndex> Sections;
+ int64_t Size = 0;
+ uint64_t Weight = 0;
+};
+
+struct Section {
+ Section(const InputSectionBase *IS) : ISB(IS) { Size = ISB->getSize(); }
+
+ Double getDensity() const {
+ if (Size == 0)
+ return 0;
+ return Double(Weight) / Double(Size);
+ }
+
+ int64_t Size = 0;
+ uint64_t Weight = 0;
+ const InputSectionBase *ISB;
+ std::vector<SectionIndex> Preds;
+ std::vector<SectionIndex> Succs;
+};
+
+struct Edge {
+ SectionIndex From;
+ SectionIndex To;
+ uint64_t Weight;
+ Double NormalizedWeight = 0;
+
+ bool operator==(const Edge Other) const;
+};
+
+struct EdgeDenseMapInfo {
+ static Edge getEmptyKey() {
+ return {DenseMapInfo<SectionIndex>::getEmptyKey(),
+ DenseMapInfo<SectionIndex>::getEmptyKey(), 0, 0};
+ }
+ static Edge getTombstoneKey() {
+ return {DenseMapInfo<SectionIndex>::getTombstoneKey(),
+ DenseMapInfo<SectionIndex>::getTombstoneKey(), 0, 0};
+ }
+ static unsigned getHashValue(const Edge &Val) {
+ return hash_combine(DenseMapInfo<SectionIndex>::getHashValue(Val.From),
+ DenseMapInfo<SectionIndex>::getHashValue(Val.To));
+ }
+ static bool isEqual(const Edge &LHS, const Edge &RHS) { return LHS == RHS; }
+};
+
+class CallGraphSort {
+public:
+ CallGraphSort();
+
+ DenseMap<const InputSectionBase *, int> run();
+
+private:
+ DenseMap<Edge, EdgeIndex, EdgeDenseMapInfo> EdgeMap;
+ std::vector<Cluster> Clusters;
+ std::vector<Edge> Edges;
+ std::vector<Section> Sections;
+
+ void normalizeEdgeWeights();
+ void generateClusters();
+};
+
+ Maximum ammount the combined cluster density can be worse than the original
+
cluster to consider merging.
+constexpr int MAX_DENSITY_DEGRADATION = 8;
+
+ Maximum cluster size in bytes.
+constexpr uint64_t MAX_CLUSTER_SIZE = 1024 * 1024;
+}
end anonymous namespace
+
+Cluster::Cluster(SectionIndex Sec, const InputSectionBase *IS) {
+ ISBs.push_back(IS);
+ Sections.push_back(Sec);
+ Size = IS->getSize();
+}
+
+bool Edge::operator==(const Edge Other) const {
+ return From == Other.From && To == Other.To;
+}
+
+ Take the edge list in Config->CallGraphProfile, resolve symbol names to
+
Symbols, and generate a graph between InputSections with the provided
+ weights.
+CallGraphSort::CallGraphSort() {
+ MapVector<std::pair<const InputSectionBase *, const InputSectionBase *>,
+ uint64_t> &Profile = Config->CallGraphProfile;
+ DenseMap<const InputSectionBase *, SectionIndex> SecToSec;
+
+ auto GetOrCreateNode = [&](const InputSectionBase *IS) -> SectionIndex {
+ auto Res = SecToSec.insert(std::make_pair(IS, Sections.size()));
+ if (Res.second)
+ Sections.emplace_back(IS);
+ return Res.first->second;
+ };
+
+
Create the graph.
+ for (const auto &C : Profile) {
+ const InputSectionBase *FromSB = C.first.first;
+ const InputSectionBase *ToSB = C.first.second;
+ uint64_t Weight = C.second;
+
+ if (Weight == 0)
+ continue;
+
+ Ignore edges between input sections belonging to different output
+
sections. This is done because otherwise we would end up with clusters
+ containing input sections that can't actually be placed adjacently in the
+
output. This messes with the cluster size and density calculations. We
+ would also end up moving input sections in other output sections without
+
moving them closer to what calls them.
+ if (FromSB->getOutputSection() != ToSB->getOutputSection())
+ continue;
+
+ SectionIndex From = GetOrCreateNode(FromSB);
+ SectionIndex To = GetOrCreateNode(ToSB);
+
+ Sections[To].Weight = SaturatingAdd(Sections[To].Weight, Weight);
+
+ if (From == To)
+ continue;
+
+ Edge E{From, To, Weight};
+
+ Add or increment an edge
+ auto Res = EdgeMap.insert(std::make_pair(E, Edges.size()));
+ EdgeIndex EI = Res.first->second;
+ if (Res.second) {
+ Edges.push_back(E);
+ Sections[From].Succs.push_back(To);
+ Sections[To].Preds.push_back(From);
+ } else
+ Edges[EI].Weight = SaturatingAdd(Edges[EI].Weight, Weight);
+ }
+ normalizeEdgeWeights();
+}
+
+
Normalize the edge weights so that we can reject edges which have a low
+ probibility.
+void CallGraphSort::normalizeEdgeWeights() {
+ for (SectionIndex SI = 0, SE = Sections.size(); SI != SE; ++SI) {
+ Section &S = Sections[SI];
+ for (SectionIndex PI : S.Preds) {
+ Edge &E = Edges[EdgeMap[{PI, SI, 0, 0}]];
+ if (S.Weight == 0)
+ continue;
+ E.NormalizedWeight = Double(E.Weight) / Double(S.Weight);
+ }
+ }
+}
+
+
It's bad to merge clusters which would degrade the density too much.
+static bool isNewDensityBad(Cluster &A, Cluster &B) {
+ Double NewDensity = Double(A.Weight + B.Weight) / Double(A.Size + B.Size);
+ if (A.getDensity() >
+ NewDensity * Double(MAX_DENSITY_DEGRADATION))
+ return true;
+ return false;
+}
+
+static void mergeClusters(Cluster &Into, Cluster &From) {
+ Into.ISBs.insert(Into.ISBs.end(), From.ISBs.begin(), From.ISBs.end());
+ Into.Sections.insert(Into.Sections.end(), From.Sections.begin(),
+ From.Sections.end());
+ Into.Size += From.Size;
+ Into.Weight += From.Weight;
+ From.ISBs.clear();
+ From.Sections.clear();
+ From.Size = 0;
+ From.Weight = 0;
+}
+
+ Group InputSections into clusters using the Call-Chain Clustering heuristic
+
then sort the clusters by density.
+void CallGraphSort::generateClusters() {
+ Minimum edge probability to consider merging.
+ const Double MIN_EDGE_PROBABILITY = Double(1) / Double(10);
+
+ std::vector<SectionIndex> SortedSecs;
+ std::vector<Cluster *> SecToCluster(Sections.size());
+
+ Clusters.reserve(Sections.size());
+
+ for (SectionIndex SI = 0, SE = Sections.size(); SI != SE; ++SI) {
+ Cluster C(SI, Sections[SI].ISB);
+ C.Size = Sections[SI].Size;
+ C.Weight = Sections[SI].Weight;
+ Clusters.push_back(C);
+ SortedSecs.push_back(SI);
+ }
+
+ for (Cluster &C : Clusters) {
+ SecToCluster[C.Sections.front()] = &C;
+ }
+
+ std::stable_sort(SortedSecs.begin(), SortedSecs.end(),
+ [&](SectionIndex A, SectionIndex B) {
+ return Sections[A].getDensity() > Sections[B].getDensity();
+ });
+
+ for (SectionIndex SI : SortedSecs) {
+ Cluster &C = *SecToCluster[SI];
+
+ SectionIndex BestPred = -1;
+ Double BestWeight = 0;
+
+ for (SectionIndex PI : Sections[SI].Preds) {
+ Edge &E = Edges[EdgeMap[{PI, SI, 0, 0}]];
+ if (BestPred == -1 || E.NormalizedWeight > BestWeight) {
+ BestPred = PI;
+ BestWeight = E.NormalizedWeight;
+ }
+ }
+
+ if (BestWeight < MIN_EDGE_PROBABILITY)
+ continue;
+
+ Cluster *PredC = SecToCluster[BestPred];
+ if (PredC == nullptr || PredC == &C)
+ continue;
+
+ if (C.Size + PredC->Size > MAX_CLUSTER_SIZE)
+ continue;
+
+ if (isNewDensityBad(*PredC, C))
+ continue;
+
+ for (SectionIndex SI : C.Sections)
+ SecToCluster[SI] = PredC;
+
+ mergeClusters(*PredC, C);
+ }
+
+
Remove empty or dead nodes.
+ Clusters.erase(std::remove_if(Clusters.begin(), Clusters.end(),
+ [](const Cluster &C) {
+ return C.Size == 0 || C.Sections.empty();
+ }),
+ Clusters.end());
+
+ Sort by density. Invalidates all NodeIndexs.
+ std::sort(Clusters.begin(), Clusters.end(),
+ [](const Cluster &A, const Cluster &B) {
+ return A.getDensity() > B.getDensity();
+ });
+}
+
+DenseMap<const InputSectionBase *, int> CallGraphSort::run() {
+ generateClusters();
+
+
Generate order.
+ llvm::DenseMap<const InputSectionBase *, int> OrderMap;
+ ssize_t CurOrder = 1;
+
+ for (const Cluster &C : Clusters)
+ for (const InputSectionBase *IS : C.ISBs)
+ OrderMap[IS] = CurOrder++;
+
+ return OrderMap;
+}
+
+ Sort sections by the profile data provided by -callgraph-profile-file
+

+ This first builds a call graph based on the profile data then merges sections
+
according to the C³ huristic. All clusters are then sorted by a density
+// metric to further improve locality.
+DenseMap<const InputSectionBase *, int> elf::computeCallGraphProfileOrder() {
+ return CallGraphSort().run();
+}

Index: ELF/CMakeLists.txt

  • ELF/CMakeLists.txt

+++ ELF/CMakeLists.txt
@@ -19,6 +19,7 @@

Arch/SPARCV9.cpp
Arch/X86.cpp
Arch/X86_64.cpp

+ CallGraphSort.cpp

Driver.cpp
DriverUtils.cpp
EhFrame.cpp

llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

Michael Spencer via Phabricator via llvm-commits
<llvm-commits@lists.llvm.org> writes:

@@ -1110,6 +1111,14 @@
If no layout was provided by linker script, we want to apply default
sorting for special input sections. This also handles --symbol-ordering-file.
template <class ELFT> void Writer<ELFT>::sortInputSections() {
+ // Use the rarely used option -call-graph-ordering-file to sort sections.
+ if (!Config->CallGraphProfile.empty()) {
+ DenseMap<const InputSectionBase *, int> Order =
+ computeCallGraphProfileOrder();
+ for (BaseCommand *Base : Script->SectionCommands)
+ if (auto *Sec = dyn_cast<OutputSection>(Base))
+ sortSection(Sec, Order);
+ }

Instead of adding this if you can change buildSectionOrder:

static DenseMap<const InputSectionBase *, int> buildSectionOrder() {

if (!Config->CallGraphProfile.empty())
  return computeCallGraphProfileOrder();
...

+ InputSectionBase *FromSec = SymbolSection.lookup(Fields[0]);
+ InputSectionBase *ToSec = SymbolSection.lookup(Fields[1]);
+ if (FromSec && ToSec)
+ Config->CallGraphProfile[std::make_pair(FromSec, ToSec)] = Count;

This should be "+=", no? If there are multiple calls from one section to
another I would expect us to use the total in the graph.

+struct EdgeDenseMapInfo {
+ static Edge getEmptyKey() {
+ return {DenseMapInfo<SectionIndex>::getEmptyKey(),
+ DenseMapInfo<SectionIndex>::getEmptyKey(), 0, 0};
+ }

This doesn't build with clang:

/home/espindola/llvm/llvm/tools/lld/ELF/CallGraphSort.cpp:115:12: error:
no matching constructor for initialization of '(anonymous
namespace)::Edge'

return {DenseMapInfo<SectionIndex>::getEmptyKey(),
       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

/home/espindola/llvm/llvm/tools/lld/ELF/CallGraphSort.cpp:104:8: note:
candidate constructor (the implicit copy constructor) not viable:
requires 1 argument, but 4 were provided
struct Edge {

^

/home/espindola/llvm/llvm/tools/lld/ELF/CallGraphSort.cpp:104:8: note:
candidate constructor (the implicit move constructor) not viable:
requires 1 argument, but 4 were provided
/home/espindola/llvm/llvm/tools/lld/ELF/CallGraphSort.cpp:104:8: note:
candidate constructor (the implicit default constructor) not viable:
requires 0 arguments, but 4 were provided

Cheers,
Rafael

Michael Spencer via Phabricator via llvm-commits
<llvm-commits@lists.llvm.org> writes:

+struct Section {
+ Section(const InputSectionBase *IS) : ISB(IS) { Size = ISB->getSize(); }
+
+ Double getDensity() const {
+ if (Size == 0)
+ return 0;
+ return Double(Weight) / Double(Size);
+ }
+
+ int64_t Size = 0;

Why is Size signed? With the previous errors fixed I get a warning:

CallGraphSort.cpp:302:30: warning: comparison of integers of different
signs: 'long' and 'const uint64_t' (aka 'const unsigned long')

Cheers,
Rafael

I now get this warning:

/home/espindola/llvm/llvm/tools/lld/ELF/CallGraphSort.cpp:204:28:
warning: missing field 'NormalizedWeight' initializer
[-Wmissing-field-initializers]

Edge E{From, To, Weight};

And git-clang-format still produces this diff:

  • if (A.getDensity() >
  • NewDensity * Double(MAX_DENSITY_DEGRADATION))

+ if (A.getDensity() > NewDensity * Double(MAX_DENSITY_DEGRADATION))

Cheers,
Rafael

Michael Spencer via Phabricator via llvm-commits
<llvm-commits@lists.llvm.org> writes:

Bigcheese updated this revision to Diff 137539.
Bigcheese added a comment.

Address review comments.

https://reviews.llvm.org/D36351

Files:

ELF/CMakeLists.txt
ELF/CallGraphSort.cpp
ELF/CallGraphSort.h
ELF/Config.h
ELF/Driver.cpp
ELF/Options.td
ELF/Writer.cpp
test/ELF/cgprofile-txt.s

Index: test/ELF/cgprofile-txt.s

  • /dev/null

+++ test/ELF/cgprofile-txt.s
@@ -0,0 +1,106 @@
+# REQUIRES: x86
+
+# RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux %s -o %t
+# RUN: ld.lld -e A %t -o %t2
+# RUN: llvm-readobj -symbols %t2 | FileCheck %s --check-prefix=NOSORT
+
+# RUN: echo "A B 100" > %t.call_graph
+# RUN: echo "A C 40" >> %t.call_graph
+# RUN: echo "B C 30" >> %t.call_graph
+# RUN: echo "C D 90" >> %t.call_graph
+# RUN: echo "PP TS 100" >> %t.call_graph
+# RUN: echo "_init2 _init 24567837" >> %t.call_graph
+# RUN: echo "TS QC 9001" >> %t.call_graph
+# RUN: ld.lld -e A %t --call-graph-ordering-file %t.call_graph -o %t2
+# RUN: llvm-readobj -symbols %t2 | FileCheck %s
+
+ .section .text.D,"ax",@progbits
+D:
+ retq
+
+ .section .text.C,"ax",@progbits
+ .globl C
+C:
+ retq
+
+ .section .text.B,"ax",@progbits
+ .globl B
+B:
+ retq
+
+ .section .text.A,"ax",@progbits
+ .globl A
+A:
+ retq
+
+ .section .ponies,"ax",@progbits,unique,1
+ .globl TS
+TS:
+ retq
+
+ .section .ponies,"ax",@progbits,unique,2
+ .globl PP
+PP:
+ retq
+
+ .section .other,"ax",@progbits,unique,1
+ .globl QC
+QC:
+ retq
+
+ .section .other,"ax",@progbits,unique,2
+ .globl GB
+GB:
+ retq
+
+ .section .init,"ax",@progbits,unique,1
+ .globl _init
+_init:
+ retq
+
+ .section .init,"ax",@progbits,unique,2
+ .globl _init2
+_init2:
+ retq
+
+# CHECK: Name: D
+# CHECK-NEXT: Value: 0x201003
+# CHECK: Name: A
+# CHECK-NEXT: Value: 0x201000
+# CHECK: Name: B
+# CHECK-NEXT: Value: 0x201001
+# CHECK: Name: C
+# CHECK-NEXT: Value: 0x201002
+# CHECK: Name: GB
+# CHECK-NEXT: Value: 0x201007
+# CHECK: Name: PP
+# CHECK-NEXT: Value: 0x201004
+# CHECK: Name: QC
+# CHECK-NEXT: Value: 0x201006
+# CHECK: Name: TS
+# CHECK-NEXT: Value: 0x201005
+# CHECK: Name: _init
+# CHECK-NEXT: Value: 0x201008
+# CHECK: Name: _init2
+# CHECK-NEXT: Value: 0x201009
+
+# NOSORT: Name: D
+# NOSORT-NEXT: Value: 0x201000
+# NOSORT: Name: A
+# NOSORT-NEXT: Value: 0x201003
+# NOSORT: Name: B
+# NOSORT-NEXT: Value: 0x201002
+# NOSORT: Name: C
+# NOSORT-NEXT: Value: 0x201001
+# NOSORT: Name: GB
+# NOSORT-NEXT: Value: 0x201007
+# NOSORT: Name: PP
+# NOSORT-NEXT: Value: 0x201005
+# NOSORT: Name: QC
+# NOSORT-NEXT: Value: 0x201006
+# NOSORT: Name: TS
+# NOSORT-NEXT: Value: 0x201004
+# NOSORT: Name: _init
+# NOSORT-NEXT: Value: 0x201008
+# NOSORT: Name: _init2
+# NOSORT-NEXT: Value: 0x201009

Index: ELF/Writer.cpp

  • ELF/Writer.cpp

+++ ELF/Writer.cpp
@@ -9,6 +9,7 @@

#include "Writer.h"
#include "AArch64ErrataFix.h"
+#include "CallGraphSort.h"
#include "Config.h"
#include "Filesystem.h"
#include "LinkerScript.h"
@@ -1020,6 +1021,10 @@
// Builds section order for handling --symbol-ordering-file.
static DenseMap<const InputSectionBase *, int> buildSectionOrder() {

DenseMap<const InputSectionBase *, int> SectionOrder;

+ // Use the rarely used option -call-graph-ordering-file to sort sections.
+ if (!Config->CallGraphProfile.empty())
+ return computeCallGraphProfileOrder();
+

if (Config->SymbolOrderingFile.empty())
  return SectionOrder;

Index: ELF/Options.td

  • ELF/Options.td

+++ ELF/Options.td
@@ -66,6 +66,9 @@

"Only set DT_NEEDED for shared libraries if used",
"Always set DT_NEEDED for shared libraries">;

+defm call_graph_ordering_file: Eq<"call-graph-ordering-file">,
+ HelpText<"Layout sections to optimize the given callgraph">;
+
// -chroot doesn't have a help text because it is an internal option.
defm chroot: Eq<"chroot">;

Index: ELF/Driver.cpp

  • ELF/Driver.cpp

+++ ELF/Driver.cpp
@@ -571,6 +571,31 @@

return {BuildIdKind::None, {}};

}

+static void readCallGraph(MemoryBufferRef MB) {
+ // Build a map from symbol name to section
+ DenseMap<StringRef, InputSectionBase *> SymbolSection;
+ for (InputFile *File : ObjectFiles)
+ for (Symbol *Sym : File->getSymbols())
+ if (auto *D = dyn_cast<Defined>(Sym))
+ if (auto *IS = dyn_cast_or_null<InputSectionBase>(D->Section))
+ SymbolSection[D->getName()] = IS;
+
+ std::vector<StringRef> Lines = args::getLines(MB);
+ for (StringRef L : Lines) {
+ SmallVector<StringRef, 3> Fields;
+ L.split(Fields, ' ');
+ if (Fields.size() != 3)
+ fatal("parse error");
+ uint64_t Count;
+ if (!to_integer(Fields[2], Count))
+ fatal("parse error");
+ InputSectionBase *FromSec = SymbolSection.lookup(Fields[0]);
+ InputSectionBase *ToSec = SymbolSection.lookup(Fields[1]);
+ if (FromSec && ToSec)
+ Config->CallGraphProfile[std::make_pair(FromSec, ToSec)] += Count;
+ }
+}
+
static bool getCompressDebugSections(opt::InputArgList &Args) {

StringRef S = Args.getLastArgValue(OPT_compress_debug_sections, "none");
if (S == "none")

@@ -1124,6 +1149,10 @@

// Apply symbol renames for -wrap.
Symtab->applySymbolWrap();

+ if (auto *Arg = Args.getLastArg(OPT_call_graph_ordering_file))
+ if (Optional<MemoryBufferRef> Buffer = readFile(Arg->getValue()))
+ readCallGraph(*Buffer);
+

// Now that we have a complete list of input files.
// Beyond this point, no new files are added.
// Aggregate all input sections into one place.

Index: ELF/Config.h

  • ELF/Config.h

+++ ELF/Config.h
@@ -24,6 +24,7 @@
namespace elf {

class InputFile;
+class InputSectionBase;

enum ELFKind {

ELFNoneKind,

@@ -103,6 +104,9 @@

std::vector<SymbolVersion> VersionScriptGlobals;
std::vector<SymbolVersion> VersionScriptLocals;
std::vector<uint8_t> BuildIdVector;

+ llvm::MapVector<std::pair<const InputSectionBase *, const InputSectionBase *>,
+ uint64_t>
+ CallGraphProfile;

bool AllowMultipleDefinition;
bool AndroidPackDynRelocs = false;
bool ARMHasBlx = false;

Index: ELF/CallGraphSort.h

  • /dev/null

+++ ELF/CallGraphSort.h
@@ -0,0 +1,23 @@
+===- CallGraphSort.h ------------------------------------------*- C++ -*-===
+
+
The LLVM Linker
+
+
This file is distributed under the University of Illinois Open Source
+ License. See LICENSE.TXT for details.
+

+===----------------------------------------------------------------------===
+
+#ifndef LLD_ELF_CALL_GRAPH_SORT_H
+#define LLD_ELF_CALL_GRAPH_SORT_H
+
+#include "llvm/ADT/DenseMap.h"
+
+namespace lld {
+namespace elf {
+class InputSectionBase;
+
+llvm::DenseMap<const InputSectionBase *, int> computeCallGraphProfileOrder();
+} namespace elf
+}
namespace lld
+
+#endif

Index: ELF/CallGraphSort.cpp

  • /dev/null

+++ ELF/CallGraphSort.cpp
@@ -0,0 +1,350 @@
+===- CallGraphSort.cpp --------------------------------------------------===
+
+
The LLVM Linker
+
+
This file is distributed under the University of Illinois Open Source
+ License. See LICENSE.TXT for details.
+

+===----------------------------------------------------------------------===
+/
+
/ Implementation of Call-Chain Clustering from: Optimizing Function Placement
+/ for Large-Scale Data-Center Applications
+
/ https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
+/
+
/ The goal of this algorithm is to improve runtime performance of the final
+/ executable by arranging code sections such that page table and i-cache
+
/ misses are minimized.
+/
+
/ Definitions:
+/ * Cluster
+
/ * An ordered list of input sections which are layed out as a unit. At the
+/ beginning of the algorithm each input section has its own cluster and
+
/ the weight of the cluster is the sum of the weight of all incomming
+/ edges.
+
/ * Call-Chain Clustering (C³) Heuristic
+/ * Defines when and how clusters are combined. Pick the highest weighted
+
/ input section then add it to its most likely predecessor if it wouldn't
+/ penalize it too much.
+
/ * Density
+/ * The weight of the cluster divided by the size of the cluster. This is a
+
/ proxy for the ammount of execution time spent per byte of the cluster.
+/
+
/ It does so given a call graph profile by the following:
+/ * Build a weighted call graph from the profile
+
/ * Sort input sections by weight
+/ * For each input section starting with the highest weight
+
/ * Find its most likely predecessor cluster
+/ * Check if the combined cluster would be too large, or would have too low
+
/ a density.
+/ * If not, then combine the clusters.
+
/ * Sort non-empty clusters by density
+/
+
===----------------------------------------------------------------------===
+
+#include "CallGraphSort.h"
+#include "OutputSections.h"
+#include "SymbolTable.h"
+#include "Symbols.h"
+
+#include "llvm/Support/MathExtras.h"
+
+using namespace llvm;
+using namespace lld;
+using namespace lld::elf;
+
+namespace {
+using ClusterIndex = std::ptrdiff_t;
+using SectionIndex = std::ptrdiff_t;
+using EdgeIndex = std::ptrdiff_t;
+
+
Used for calculating an comparing density. Use soft-float for determinism.
+struct Double : APFloat {
+ Double() : APFloat(APFloat::IEEEdouble(), 0) {}
+ Double(uint64_t Val) : APFloat(APFloat::IEEEdouble(), Val) {}
+ Double(APFloat A) : APFloat(A) {}
+ bool operator>(const Double Other) const {
+ return compare(Other) == cmpGreaterThan;
+ }
+ bool operator<(const Double Other) const {
+ return compare(Other) == cmpLessThan;
+ }
+};
+
+struct Cluster {
+ Cluster(SectionIndex Sec, const InputSectionBase *IS);
+
+ Double getDensity() const {
+ if (Size == 0)
+ return 0;
+ return Double(Weight) / Double(Size);
+ }
+
+ std::vector<const InputSectionBase *> ISBs;
+ std::vector<SectionIndex> Sections;
+ size_t Size = 0;
+ uint64_t Weight = 0;
+};
+
+struct Section {
+ Section(const InputSectionBase *IS) : ISB(IS) { Size = ISB->getSize(); }
+
+ Double getDensity() const {
+ if (Size == 0)
+ return 0;
+ return Double(Weight) / Double(Size);
+ }
+
+ size_t Size = 0;
+ uint64_t Weight = 0;
+ const InputSectionBase *ISB;
+ std::vector<SectionIndex> Preds;
+ std::vector<SectionIndex> Succs;
+};
+
+struct Edge {
+ SectionIndex From;
+ SectionIndex To;
+ uint64_t Weight;
+ Double NormalizedWeight;
+
+ bool operator==(const Edge Other) const;
+};
+
+struct EdgeDenseMapInfo {
+ static Edge getEmptyKey() {
+ return {DenseMapInfo<SectionIndex>::getEmptyKey(),
+ DenseMapInfo<SectionIndex>::getEmptyKey(), 0, 0};
+ }
+ static Edge getTombstoneKey() {
+ return {DenseMapInfo<SectionIndex>::getTombstoneKey(),
+ DenseMapInfo<SectionIndex>::getTombstoneKey(), 0, 0};
+ }
+ static unsigned getHashValue(const Edge &Val) {
+ return hash_combine(DenseMapInfo<SectionIndex>::getHashValue(Val.From),
+ DenseMapInfo<SectionIndex>::getHashValue(Val.To));
+ }
+ static bool isEqual(const Edge &LHS, const Edge &RHS) { return LHS == RHS; }
+};
+
+class CallGraphSort {
+public:
+ CallGraphSort();
+
+ DenseMap<const InputSectionBase *, int> run();
+
+private:
+ DenseMap<Edge, EdgeIndex, EdgeDenseMapInfo> EdgeMap;
+ std::vector<Cluster> Clusters;
+ std::vector<Edge> Edges;
+ std::vector<Section> Sections;
+
+ void normalizeEdgeWeights();
+ void generateClusters();
+};
+
+ Maximum ammount the combined cluster density can be worse than the original
+
cluster to consider merging.
+constexpr int MAX_DENSITY_DEGRADATION = 8;
+
+ Maximum cluster size in bytes.
+constexpr uint64_t MAX_CLUSTER_SIZE = 1024 * 1024;
+}
end anonymous namespace
+
+Cluster::Cluster(SectionIndex Sec, const InputSectionBase *IS) {
+ ISBs.push_back(IS);
+ Sections.push_back(Sec);
+ Size = IS->getSize();
+}
+
+bool Edge::operator==(const Edge Other) const {
+ return From == Other.From && To == Other.To;
+}
+
+ Take the edge list in Config->CallGraphProfile, resolve symbol names to
+
Symbols, and generate a graph between InputSections with the provided
+ weights.
+CallGraphSort::CallGraphSort() {
+ MapVector<std::pair<const InputSectionBase *, const InputSectionBase *>,
+ uint64_t> &Profile = Config->CallGraphProfile;
+ DenseMap<const InputSectionBase *, SectionIndex> SecToSec;
+
+ auto GetOrCreateNode = [&](const InputSectionBase *IS) -> SectionIndex {
+ auto Res = SecToSec.insert(std::make_pair(IS, Sections.size()));
+ if (Res.second)
+ Sections.emplace_back(IS);
+ return Res.first->second;
+ };
+
+
Create the graph.
+ for (const auto &C : Profile) {
+ const InputSectionBase *FromSB = C.first.first;
+ const InputSectionBase *ToSB = C.first.second;
+ uint64_t Weight = C.second;
+
+ if (Weight == 0)
+ continue;
+
+ Ignore edges between input sections belonging to different output
+
sections. This is done because otherwise we would end up with clusters
+ containing input sections that can't actually be placed adjacently in the
+
output. This messes with the cluster size and density calculations. We
+ would also end up moving input sections in other output sections without
+
moving them closer to what calls them.
+ if (FromSB->getOutputSection() != ToSB->getOutputSection())
+ continue;
+
+ SectionIndex From = GetOrCreateNode(FromSB);
+ SectionIndex To = GetOrCreateNode(ToSB);
+
+ Sections[To].Weight = SaturatingAdd(Sections[To].Weight, Weight);
+
+ if (From == To)
+ continue;
+
+ Edge E{From, To, Weight};
+
+ Add or increment an edge
+ auto Res = EdgeMap.insert(std::make_pair(E, Edges.size()));
+ EdgeIndex EI = Res.first->second;
+ if (Res.second) {
+ Edges.push_back(E);
+ Sections[From].Succs.push_back(To);
+ Sections[To].Preds.push_back(From);
+ } else
+ Edges[EI].Weight = SaturatingAdd(Edges[EI].Weight, Weight);
+ }
+ normalizeEdgeWeights();
+}
+
+
Normalize the edge weights so that we can reject edges which have a low
+ probibility.
+void CallGraphSort::normalizeEdgeWeights() {
+ for (SectionIndex SI = 0, SE = Sections.size(); SI != SE; ++SI) {
+ Section &S = Sections[SI];
+ for (SectionIndex PI : S.Preds) {
+ Edge &E = Edges[EdgeMap[{PI, SI, 0, 0}]];
+ if (S.Weight == 0)
+ continue;
+ E.NormalizedWeight = Double(E.Weight) / Double(S.Weight);
+ }
+ }
+}
+
+
It's bad to merge clusters which would degrade the density too much.
+static bool isNewDensityBad(Cluster &A, Cluster &B) {
+ Double NewDensity = Double(A.Weight + B.Weight) / Double(A.Size + B.Size);
+ if (A.getDensity() >
+ NewDensity * Double(MAX_DENSITY_DEGRADATION))
+ return true;
+ return false;
+}
+
+static void mergeClusters(Cluster &Into, Cluster &From) {
+ Into.ISBs.insert(Into.ISBs.end(), From.ISBs.begin(), From.ISBs.end());
+ Into.Sections.insert(Into.Sections.end(), From.Sections.begin(),
+ From.Sections.end());
+ Into.Size += From.Size;
+ Into.Weight += From.Weight;
+ From.ISBs.clear();
+ From.Sections.clear();
+ From.Size = 0;
+ From.Weight = 0;
+}
+
+ Group InputSections into clusters using the Call-Chain Clustering heuristic
+
then sort the clusters by density.
+void CallGraphSort::generateClusters() {
+ Minimum edge probability to consider merging.
+ const Double MIN_EDGE_PROBABILITY = Double(1) / Double(10);
+
+ std::vector<SectionIndex> SortedSecs;
+ std::vector<Cluster *> SecToCluster(Sections.size());
+
+ Clusters.reserve(Sections.size());
+
+ for (SectionIndex SI = 0, SE = Sections.size(); SI != SE; ++SI) {
+ Cluster C(SI, Sections[SI].ISB);
+ C.Size = Sections[SI].Size;
+ C.Weight = Sections[SI].Weight;
+ Clusters.push_back(C);
+ SortedSecs.push_back(SI);
+ }
+
+ for (Cluster &C : Clusters) {
+ SecToCluster[C.Sections.front()] = &C;
+ }
+
+ std::stable_sort(SortedSecs.begin(), SortedSecs.end(),
+ [&](SectionIndex A, SectionIndex B) {
+ return Sections[A].getDensity() > Sections[B].getDensity();
+ });
+
+ for (SectionIndex SI : SortedSecs) {
+ Cluster &C = *SecToCluster[SI];
+
+ SectionIndex BestPred = -1;
+ Double BestWeight = 0;
+
+ for (SectionIndex PI : Sections[SI].Preds) {
+ Edge &E = Edges[EdgeMap[{PI, SI, 0, 0}]];
+ if (BestPred == -1 || E.NormalizedWeight > BestWeight) {
+ BestPred = PI;
+ BestWeight = E.NormalizedWeight;
+ }
+ }
+
+ if (BestWeight < MIN_EDGE_PROBABILITY)
+ continue;
+
+ Cluster *PredC = SecToCluster[BestPred];
+ if (PredC == nullptr || PredC == &C)
+ continue;
+
+ if (C.Size + PredC->Size > MAX_CLUSTER_SIZE)
+ continue;
+
+ if (isNewDensityBad(*PredC, C))
+ continue;
+
+ for (SectionIndex SI : C.Sections)
+ SecToCluster[SI] = PredC;
+
+ mergeClusters(*PredC, C);
+ }
+
+
Remove empty or dead nodes.
+ Clusters.erase(std::remove_if(Clusters.begin(), Clusters.end(),
+ [](const Cluster &C) {
+ return C.Size == 0 || C.Sections.empty();
+ }),
+ Clusters.end());
+
+ Sort by density. Invalidates all NodeIndexs.
+ std::sort(Clusters.begin(), Clusters.end(),
+ [](const Cluster &A, const Cluster &B) {
+ return A.getDensity() > B.getDensity();
+ });
+}
+
+DenseMap<const InputSectionBase *, int> CallGraphSort::run() {
+ generateClusters();
+
+
Generate order.
+ llvm::DenseMap<const InputSectionBase *, int> OrderMap;
+ ssize_t CurOrder = 1;
+
+ for (const Cluster &C : Clusters)
+ for (const InputSectionBase *IS : C.ISBs)
+ OrderMap[IS] = CurOrder++;
+
+ return OrderMap;
+}
+
+ Sort sections by the profile data provided by -callgraph-profile-file
+

+ This first builds a call graph based on the profile data then merges sections
+
according to the C³ huristic. All clusters are then sorted by a density
+// metric to further improve locality.
+DenseMap<const InputSectionBase *, int> elf::computeCallGraphProfileOrder() {
+ return CallGraphSort().run();
+}

Index: ELF/CMakeLists.txt

  • ELF/CMakeLists.txt

+++ ELF/CMakeLists.txt
@@ -19,6 +19,7 @@

Arch/SPARCV9.cpp
Arch/X86.cpp
Arch/X86_64.cpp

+ CallGraphSort.cpp

Driver.cpp
DriverUtils.cpp
EhFrame.cpp

llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

+static void readCallGraph(MemoryBufferRef MB) {
+ // Build a map from symbol name to section
+ DenseMap<StringRef, InputSectionBase *> SymbolSection;
+ for (InputFile *File : ObjectFiles)
+ for (Symbol *Sym : File->getSymbols())
+ if (auto *D = dyn_cast<Defined>(Sym))
+ if (auto *IS = dyn_cast_or_null<InputSectionBase>(D->Section))
+ SymbolSection[D->getName()] = IS;
+
+ std::vector<StringRef> Lines = args::getLines(MB);
+ for (StringRef L : Lines) {
+ SmallVector<StringRef, 3> Fields;
+ L.split(Fields, ' ');
+ if (Fields.size() != 3)
+ fatal("parse error");
+ uint64_t Count;
+ if (!to_integer(Fields[2], Count))
+ fatal("parse error");
+ InputSectionBase *FromSec = SymbolSection.lookup(Fields[0]);
+ InputSectionBase *ToSec = SymbolSection.lookup(Fields[1]);
+ if (FromSec && ToSec)
+ Config->CallGraphProfile[std::make_pair(FromSec, ToSec)] += Count;

Should this be using FromSec->Repl, ToSec->Repl to handle ICF? If so
please update the code and add a new test (the one in the patch is
getting big).

+ }
+}

With --symbol-ordering-file we produce warnings if, for example, the
symbol is found to be absolute. The same issue applies here, no? Can you
refactor the code so that we print the same warnings for
--call-graph-ordering-file?

+using ClusterIndex = std::ptrdiff_t;
+using SectionIndex = std::ptrdiff_t;
+using EdgeIndex = std::ptrdiff_t;

Does this means that we use 64 bit integers for indexes? Why?

+// Used for calculating an comparing density. Use soft-float for determinism.
+struct Double : APFloat {

This used to be a problem with x87, but do we support a system today
where this is a problem?

Cheers,
Rafael

Bigcheese updated this revision to Diff 139535.Mar 22 2018, 5:09 PM

Apply review comments.

After discussing the floating point issue with people and checking compilers, I believe this specific use is safe as there is no opportunity to form FMAs and it only uses division and comparison.

espindola added inline comments.Mar 28 2018, 8:16 AM
ELF/CallGraphSort.cpp
247

please git-clang-format the patch.

espindola added inline comments.Mar 28 2018, 9:38 AM
ELF/CallGraphSort.cpp
159

This warning is duplicated with what we have in Writer.cpp for --symbol-ordering-file. Could it be merged?

278

Delete extra empty lines.

Bigcheese updated this revision to Diff 140132.Mar 28 2018, 1:56 PM

Address review comments.

hintonda removed a subscriber: hintonda.Mar 28 2018, 5:47 PM
espindola added inline comments.Mar 28 2018, 10:48 PM
ELF/CallGraphSort.cpp
57

dead code.

58

Why do you need a 64 bit index. Why does it need a typedef?

espindola added inline comments.Mar 28 2018, 11:05 PM
ELF/CallGraphSort.cpp
150

const Edge &Other

ELF/Config.h
27

dead declaration.

ELF/Driver.cpp
600

This should now be just =, no?

Bigcheese marked 3 inline comments as done.Mar 29 2018, 11:35 AM
Bigcheese added inline comments.
ELF/CallGraphSort.cpp
58

While ELF limits the number of sections/symbols a single ELF file can hold, there's no limit on the number of input sections that go into the final link. If we have multiple ELF files with 4bn sections, then this could overflow. The chances of this are rare, but there's almost no cost in supporting it. The correct default type for array indexing is std::ptrdiff_t, you need a good reason to use something else.

As for the typedef, it makes it clearer what the purpose of the index is. Otherwise they would all have the same type. If C++ had a good way to do strong typedefs I would be using that here to add compile time enforcement.

ELF/Driver.cpp
600

It should be a saturating add. An input file can legitimately have multiple edges, and they should be merged.

Address review comments.

espindola added inline comments.Mar 29 2018, 5:55 PM
ELF/CallGraphSort.cpp
58

The reason is that nothing in lld does this.

Just use int/unsigned as appropriate. We do not design for an abstract pie in the sky.

Change this.

Bigcheese updated this revision to Diff 140713.Apr 2 2018, 5:15 PM

Address review comments.

espindola added inline comments.Apr 2 2018, 6:05 PM
ELF/CallGraphSort.cpp
57

Delete these typedefs and just use the proper type directly.

149

Define this inline.

ELF/Driver.cpp
602

This means that we supports having duplicated symbol edges. Is there a reason for that? If not please change this to report an error on duplicated edges.

espindola added inline comments.Apr 2 2018, 6:19 PM
ELF/CallGraphSort.cpp
202

This is using SaturatingAdd on a uint64_t. That also seems like over engineering.

Each unit of edge weight comes from a sample during execution. Even with a compact format using a single byte per sample we would need 16 exbibytes for it to overflow. Where would such a sample have been stored?

espindola added inline comments.Apr 2 2018, 6:28 PM
ELF/CallGraphSort.cpp
149

Actually, it is only called once. Inline it.

espindola added inline comments.Apr 2 2018, 6:47 PM
ELF/Driver.cpp
583–584

This is marked done, but it is not.

Bigcheese added inline comments.Apr 5 2018, 3:52 PM
ELF/CallGraphSort.cpp
57

That would significantly reduce code clarity. It's not uncommon in LLVM to do this. LLD even has a case of this with Relocations.h RelType.

ELF/Driver.cpp
602

This makes it easier to create input. For instance you can cat together profiles from different runs.

Bigcheese updated this revision to Diff 141236.Apr 5 2018, 3:53 PM

Address review comments.

espindola added inline comments.Apr 5 2018, 8:46 PM
ELF/CallGraphSort.cpp
57

RelType is the one example in lld and its use spans most of the linker, so it makes sense there. This is completely local and is just noise. Delete it.

ELF/Symbols.cpp
251

Rebase on top of r329371. You should be able to remove the File argument with that.

espindola added inline comments.Apr 6 2018, 10:53 AM
ELF/CallGraphSort.cpp
50

Dead include.

ELF/Driver.cpp
602

Fair enough. Please add a testcase showing that behavior.

Bigcheese updated this revision to Diff 141718.Apr 9 2018, 12:45 PM

Address review comments.

espindola added inline comments.Apr 10 2018, 4:08 PM
ELF/CallGraphSort.cpp
33

The part about profile should be in a followup patch. For now this is given the weighted call graph.

64

You can remove this vector.

The only real use is when computing the returned DenseMap and you can use the Sections vector there:

for (const Cluster &C : Clusters)
  for (int SecIndex : C.Sections)
    OrderMap[Sections[SecIndex].ISB] = CurOrder++;
138

Define this function inline.

264

You don't need this loop, just add

SecToCluster[SI] = &Clusters.back();

to the previous one.

298

This is a slow union find solution when there are many union operations. Maybe that doesn't impact this in practice, but at least leave a comment about a possible improvement if we ever hit it.

Bigcheese updated this revision to Diff 142093.Apr 11 2018, 4:18 PM

Address review comments.

ELF/CallGraphSort.cpp
33

This is talking about the for loop in CallGraphSort() where it builds the graph.

espindola added inline comments.Apr 12 2018, 8:38 AM
ELF/CallGraphSort.cpp
33

Building a graph from a set of edges doesn't sound much, but it is OK to keep the comment. Just don't use the word profile as it makes it sound like lld is reading the direct output of perf or similar.

204

This is not tested. All tests pass if I replace it with an assert.

espindola added inline comments.Apr 12 2018, 9:29 AM
test/ELF/cgprofile-txt.s
9

This should be >>, no?

espindola added inline comments.Apr 12 2018, 10:16 AM
ELF/CallGraphSort.cpp
72

This structure is almost a perfect duplicate of Cluster. That is not too surprising since each cluster starts as a single section.

In the traditional description of union-find, there is only one type of element which gets combined as union operations are performed.

I was able to remove this struct and I think it makes the code quite a bite closer to a description of the union-find structure and easier to read. The patch on top of this is at https://reviews.llvm.org/D45577.

290

BTW, this is a very interesting problem.

The traditional disjoint-set/union-find uses unordered sets which are easy to represent with trees with only parent pointers.

Changing it to represent order in the sets and still be efficient is an interesting challenge (not for this patch).

Bigcheese updated this revision to Diff 142272.Apr 12 2018, 3:15 PM

Addressed review comments.

Bigcheese marked 47 inline comments as done.Apr 12 2018, 3:19 PM
Bigcheese added inline comments.
ELF/CallGraphSort.cpp
33

I've made it clear it's talking about the call graph profile, not whatever profile that was generated from.

espindola added inline comments.Apr 12 2018, 9:32 PM
ELF/CallGraphSort.cpp
199

Instead of storing a normalized edge value it is simpler to store the original cluster weight: https://reviews.llvm.org/D45607

258

This is not tested.

This is not in the paper, right? Do you know what is the motivation? It is interesting in that the algorithm prioritizes hot nodes, but has a limit on how cold an edge can be.

It does seem to have a small performance win, at lest on the LTOing FileCheck testcase. With this patch a run takes 2.519023951 seconds, with this check removed it takes 2.520792327 (no sorting is 2.575764375).

espindola added inline comments.Apr 13 2018, 9:15 AM
ELF/CallGraphSort.cpp
71

Succs is not used, please delete.

espindola added inline comments.Apr 13 2018, 10:22 AM
ELF/Config.h
108

Using Symbol in here causes some duplication. We support duplicated edges, and we have to support two symbols pointing to the same section. This means two maps, this one for symbols and one for edges from section to section.

If this map is changed to directly map from section to section, we can avoid the second map completely:

https://reviews.llvm.org/D45630

Bigcheese updated this revision to Diff 142708.Apr 16 2018, 3:23 PM
Bigcheese marked an inline comment as done.

Address review comments.

Bigcheese marked 4 inline comments as done.Apr 16 2018, 3:25 PM
espindola accepted this revision.Apr 16 2018, 5:45 PM

LGTM with a few last requests.

ELF/CallGraphSort.cpp
181

Since this is looking at a section we haven't merged to a predecessor yet this can be just

Cluster &C = Clusters[SI];

186

This can be just C.Preds.

193

You don't think you need this if. If there are no predecessors, BestWeight will be 0 and so the following if will continue since InitialWeight is unsigned.

207

All tests pass if I delete this if. Please add a test.

213

All tests pass if I delete this loop. Please add a test.

225

NodeIndex is not used elsewhere in the patch. What this (and the previous remove_if) invalidate are the cluster to cluster edges.

This revision was not accepted when it landed; it landed in state Needs Review.Apr 17 2018, 4:33 PM
This revision was automatically updated to reflect the committed changes.