This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/MachO/
-
MachO/
2/2
InputFiles.cpp
4/6
InputSection.h
13/20
MarkLive.cpp

Differential D110018

[lld-macho] Speed up markLive()
Needs ReviewPublic

Authored by oontvoo on Sep 17 2021, 11:09 PM.

Download Raw Diff

Details

Reviewers

gkm
smeenai

Group Reviewers

Restricted Project

Summary

[lld-macho] Speed up markLive()

From the trace report, it's one of the most substantial pieces (only after load-input and write-output) and it seems like an easy win here.

Changes:
 - parallelize scanning symtab, and sections.
 - also add additional timescope

Stats for 10 runs (on Mac, 2.4 GHz 8-Core Intel Core i9 | Memory: 32 GB 2667 MHz DDR4)

    N           Min           Max        Median           Avg        Stddev
x  10         19.77         23.35         21.51        21.569     1.1866053
+  10          19.9         20.09            20        19.999   0.062795966
Difference at 95.0% confidence
        -1.57 +/- 0.789477
        -7.27897% +/- 3.66024%
        (Student's t, pooled s = 0.840231)

Differential Revision: https://reviews.llvm.org/D110018

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

oontvoo created this revision.Sep 17 2021, 11:09 PM

Herald added a reviewer: gkm. · View Herald TranscriptSep 17 2021, 11:09 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added a reviewer: Restricted Project. · View Herald Transcript

oontvoo requested review of this revision.Sep 17 2021, 11:09 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 17 2021, 11:09 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

oontvoo edited the summary of this revision. (Show Details)Sep 17 2021, 11:10 PM

Harbormaster completed remote builds in B124522: Diff 373393.Sep 17 2021, 11:54 PM

update

Harbormaster completed remote builds in B124526: Diff 373398.Sep 18 2021, 2:23 AM

thevinster added a subscriber: thevinster.Sep 20 2021, 12:11 AM

thevinster added inline comments.

lld/MachO/InputSection.h
101	Why do we need to define our own copy constructor? Isn't the compiler doing that for us already?
118	Shouldn't this method also have the mutex as well?
lld/MachO/MarkLive.cpp
48	I think we might have a race condition here. If two different symbols pointing to the same `isec` and `off` happen to enter this function, we could end up writing the same value twice to `worklist`. It feels like the lock should be at the `enqueue` scope instead of the list here. Doing so, I believe we can also remove all the locks on the `isLive` and `markLive` methods.

oontvoo marked 2 inline comments as done.Sep 20 2021, 9:33 AM

oontvoo added inline comments.

lld/MachO/InputSection.h
101	because the mutex makes it non-trivially copyable(ie., the default copy ctor won't work). (This reminds me I should change this to a proper copy ctor :) )
118	yes - missed this one
lld/MachO/MarkLive.cpp
48	No, if the lock were to be put on the whole enqueue() then the whole thing would essentially be single-threaded (because all different sections would have to wait for the same lock). We should minimise the locked region here. And yes - this should have checked isLive() again before pushing into the queue

I'm a bit ambivalent about this change. The typical way that parallel mark-sweep is implemented is via a work-stealing queue. That way we have one lock per thread, instead of all threads contending on one lock. Granted this diff is already an improvement, but perhaps it would be best to implement the optimal solution right off the bat...

lld/MachO/InputFiles.cpp
278	why this change?
lld/MachO/InputSection.h
120	I think this lock isn't necessary... multiple concurrent calls to this method will still end up setting `live` to true
259–260	atomics might perform better here, assuming low contention

In D110018#3009702, @int3 wrote:

I'm a bit ambivalent about this change. The typical way that parallel mark-sweep is implemented is via a work-stealing queue. That way we have one lock per thread, instead of all threads contending on one lock. Granted this diff is already an improvement, but perhaps it would be best to implement the optimal solution right off the bat...

I was hoping for a easy fix here rather than an overhaul. But yeah, I guess it doesn't make a lot of sense to leave it half-done ...

Taking this off review queue to implement a proper concurrent mark-sweep

lld/MachO/InputFiles.cpp
278	ConcatInputSection is no longer trivially copy-able.

Updated diff:

add a simple WorkStealingQueue based on existing implementation + tests
first pass at making markLive using the queue.

Herald added a subscriber: mgorny. · View Herald TranscriptSep 22 2021, 9:57 PM

oontvoo edited the summary of this revision. (Show Details)Sep 22 2021, 9:57 PM

typo in makefile

Harbormaster completed remote builds in B125277: Diff 374439.Sep 22 2021, 10:07 PM

rebase

Harbormaster completed remote builds in B125443: Diff 374675.Sep 23 2021, 3:07 PM

Haven't taken a proper look yet, but could you fix the lints first to make it easier to read? :)

Also, that was pretty fast... I guess you've implemented something similar before?

fixed lint issues

In D110018#3024127, @int3 wrote:

Haven't taken a proper look yet, but could you fix the lints first to make it easier to read? :)

Done - sorry, hard to switch back and forth with different naming style

Also, that was pretty fast... I guess you've implemented something similar before?

The queue is mostly from existing code :)
The mark-and-sweep is based on code I wrote ~3 years ago ... (which is to say, there might be bugs ...)

Harbormaster completed remote builds in B126235: Diff 375768.Sep 29 2021, 2:04 AM

rebase

Harbormaster completed remote builds in B128996: Diff 379903.Oct 14 2021, 8:21 PM

fixed more lints

Harbormaster completed remote builds in B129006: Diff 379919.Oct 14 2021, 10:59 PM

int3 added inline comments.Oct 18 2021, 8:22 PM

lld/MachO/MarkLive.cpp
32	`hardware_concurrency()` seems like a good initializer
39–40	hm, this seems more like a `Worker` or `Executor` to me, each of which executes a number of tasks or jobs...
45
52–53	if `start` and `end` denoted half-open (instead of closed) intervals, would we still need `skip`? also, how do you feel about creating an array of `ArrayRef`s here, instead of working with raw indices? That uses a few extra ints, but given that the number of threads in the pool is small, that shouldn't be an issue.
61	would prefer a more descriptive name... `run` / `execute`?
81–104	I'm kind of suspicious of this. It's marking sections as live based on the liveness of other sections, which are concurrently being marked as live. Depending on the order of writes, it may incorrectly conclude that a certain live section X is not live, and therefore not mark the sections that X points to as live.
110	I think the "no" makes it clearer :)
130	`work` is usually a noun, not a verb... maybe this could also be `run()` (or have `run` above take a default argument)
132	can we have a more descriptive name, like `groupSize`?
134
155–156	i'm kind of confused here. why are we comparing `start` to `idx` -- isn't `start` an index into the `size` array of jobs, whereas `idx` is the index of the `Task` vector of size `poolSize`? seems like we're comparing numbers on different scales... same thing for `idx` and `size`
lld/MachO/WorkStealingQueue.h
38 ↗	(On Diff #379919)	the rest of the codebase doesn't use the `_` suffix; I think we can just use `capacity` here and have the getter be `getCapacity()`
lld/unittests/MachOUnitTests/CMakeLists.txt
2 ↗	(On Diff #379919)

oontvoo marked an inline comment as done.Oct 20 2021, 2:04 PM

oontvoo added inline comments.

lld/MachO/MarkLive.cpp
81–104	Good point! (kind of surprised not a lot of tests were failing ...)
155–156	"idx" is the index of the Task vector, yes but it correlates to where its work items are: So the division of labour is like this: Task[0] : start=0, end = start + d -1 = d -1 Task[1] : start=d, end = d + d - 1 Task[2]: start = d + d, end = d + d + d -1 .... Task[n]: start = n * d, end = start + d -1.

addressing comments

int3 added inline comments.Oct 20 2021, 4:42 PM

lld/MachO/MarkLive.cpp
81–104	Race conditions often don't show up on small inputs :) I would recommend running dead_strip on a large program before and after this change, and checking that the outputs are identical. I think the easiest way to fix this is to create a mapping of sections to their live support sections (basically reversing the pointer), and then have `addSect` visit the live support sections.
155–156	right but... what does `start < idx` mean here, and when do we expect it to be true? shouldn't it never be true, since `n >= n * d` for all nonnegative `n` and `d`?

oontvoo added inline comments.Oct 20 2021, 5:17 PM

lld/MachO/MarkLive.cpp
155–156	Ah- I misunderstood what you were confusing about. shouldn't it never be true, since n >= n * d for all nonnegative n and d d could be zero, which is non-negative :) ie., when we have fewer tasks than the number of workers, then you dont need to use all of the workers, when that happens, you just need the first [0, x) workers, and give each of them 1 task. Similarly, `end >= size` is true when `start== d == 0` (because its type is size_t , hence the `-1` will ensure it). (but I guess you're right that this is written in a much more convoluted way than it should be ... will fix)

addressed review comments:

comestic issues
potential race-condition w.r.t handling of live-support

Harbormaster completed remote builds in B130336: Diff 381816.Oct 24 2021, 6:50 PM

rebase

Harbormaster completed remote builds in B130860: Diff 382525.Oct 26 2021, 10:36 PM

How does the work-stealing queue compare with oneTBB tbb::parallel_for_each with a feeder?

lld/MachO/MarkLive.cpp
74	open-ended intervals are usually easy to work with and less error-prone
95	https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies-of-if-else-loop-statements
lld/MachO/WorkStealingQueue.h
76 ↗	(On Diff #382525)	I find that `SmallVector<X, 0>` is usually more efficient and compiles to less code.
82 ↗	(On Diff #382525)	Why is the `.inc` split from the `.h` file?

Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2022, 11:07 PM

@int3 Do you also have a plan to implement a work-stealing queue? It can be in lld/Common so that lld/ELF and lld/COFF can use it as well...

I'd asked @oontvoo to hold off on this until we'd stabilized the dead_strip code. But yeah, it would make sense for whatever work-stealing queue we implement to go under Common :)

smeenai resigned from this revision.Apr 28 2022, 1:55 PM

Revision Contents

Path

Size

lld/

MachO/

InputFiles.cpp

9 lines

InputSection.h

46 lines

MarkLive.cpp

37 lines

Diff 373398

lld/MachO/InputFiles.cpp

Show First 20 Lines • Show All 269 Lines • ▼ Show 20 Lines	auto splitRecords = [&](int recordSize) -> void {

SubsectionMap &subsecMap = subsections.back();		SubsectionMap &subsecMap = subsections.back();
subsecMap.reserve(data.size() / recordSize);		subsecMap.reserve(data.size() / recordSize);
auto *isec = make<ConcatInputSection>(		auto *isec = make<ConcatInputSection>(
segname, name, this, data.slice(0, recordSize), align, flags);		segname, name, this, data.slice(0, recordSize), align, flags);
subsecMap.push_back({0, isec});		subsecMap.push_back({0, isec});
for (uint64_t off = recordSize; off < data.size(); off += recordSize) {		for (uint64_t off = recordSize; off < data.size(); off += recordSize) {
// Copying requires less memory than constructing a fresh InputSection.		// Copying requires less memory than constructing a fresh InputSection.
auto copy = make<ConcatInputSection>(isec);		auto copy = make<ConcatInputSection>(
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto copy' can be declared as 'auto copy' [llvm-qualified-auto] not useful Lint: Pre-merge checks:* clang-tidy: warning: 'auto copy' can be declared as 'auto *copy' [llvm-qualified-auto] [[https…
		int3Unsubmitted Done Reply Inline Actions why this change? int3: why this change?
		oontvooAuthorUnsubmitted Done Reply Inline Actions ConcatInputSection is no longer trivially copy-able. oontvoo: ConcatInputSection is no longer trivially copy-able.
		segname, name, this, data.slice(0, recordSize), align, flags);

copy->data = data.slice(off, recordSize);		copy->data = data.slice(off, recordSize);
subsecMap.push_back({off, copy});		subsecMap.push_back({off, copy});
}		}
};		};

if (sectionType(sec.flags) == S_CSTRING_LITERALS \|\|		if (sectionType(sec.flags) == S_CSTRING_LITERALS \|\|
(config->dedupLiterals && isWordLiteralSection(sec.flags))) {		(config->dedupLiterals && isWordLiteralSection(sec.flags))) {
if (sec.nreloc && config->dedupLiterals)		if (sec.nreloc && config->dedupLiterals)
▲ Show 20 Lines • Show All 324 Lines • ▼ Show 20 Lines	macho::Symbol *ObjFile::parseNonSectionSymbol(const NList &sym,
case N_SECT:		case N_SECT:
llvm_unreachable(		llvm_unreachable(
"N_SECT symbols should not be passed to parseNonSectionSymbol");		"N_SECT symbols should not be passed to parseNonSectionSymbol");
default:		default:
llvm_unreachable("invalid symbol type");		llvm_unreachable("invalid symbol type");
}		}
}		}

template <class NList>		template <class NList> static bool isUndef(const NList &sym) {
static bool isUndef(const NList &sym) {
return (sym.n_type & N_TYPE) == N_UNDF && sym.n_value == 0;		return (sym.n_type & N_TYPE) == N_UNDF && sym.n_value == 0;
}		}

template <class LP>		template <class LP>
void ObjFile::parseSymbols(ArrayRef<typename LP::section> sectionHeaders,		void ObjFile::parseSymbols(ArrayRef<typename LP::section> sectionHeaders,
ArrayRef<typename LP::nlist> nList,		ArrayRef<typename LP::nlist> nList,
const char *strtab, bool subsectionsViaSymbols) {		const char *strtab, bool subsectionsViaSymbols) {
using NList = typename LP::nlist;		using NList = typename LP::nlist;
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	for (size_t j = 0; j < symbolIndices.size(); ++j) {
if (!subsectionsViaSymbols \|\| symbolOffset == 0 \|\|		if (!subsectionsViaSymbols \|\| symbolOffset == 0 \|\|
sym.n_desc & N_ALT_ENTRY \|\| !isa<ConcatInputSection>(isec)) {		sym.n_desc & N_ALT_ENTRY \|\| !isa<ConcatInputSection>(isec)) {
symbols[symIndex] =		symbols[symIndex] =
createDefined(sym, name, isec, symbolOffset, symbolSize);		createDefined(sym, name, isec, symbolOffset, symbolSize);
continue;		continue;
}		}
auto *concatIsec = cast<ConcatInputSection>(isec);		auto *concatIsec = cast<ConcatInputSection>(isec);

auto nextIsec = make<ConcatInputSection>(concatIsec);		auto *nextIsec = make<ConcatInputSection>(concatIsec);
nextIsec->numRefs = 0;		nextIsec->numRefs = 0;
nextIsec->wasCoalesced = false;		nextIsec->wasCoalesced = false;
if (isZeroFill(isec->getFlags())) {		if (isZeroFill(isec->getFlags())) {
// Zero-fill sections have NULL data.data() non-zero data.size()		// Zero-fill sections have NULL data.data() non-zero data.size()
nextIsec->data = {nullptr, isec->data.size() - symbolOffset};		nextIsec->data = {nullptr, isec->data.size() - symbolOffset};
isec->data = {nullptr, symbolOffset};		isec->data = {nullptr, symbolOffset};
} else {		} else {
nextIsec->data = isec->data.slice(symbolOffset);		nextIsec->data = isec->data.slice(symbolOffset);
▲ Show 20 Lines • Show All 659 Lines • Show Last 20 Lines

lld/MachO/InputSection.h

Show All 13 Lines

#include "lld/Common/LLVM.h"		#include "lld/Common/LLVM.h"
#include "lld/Common/Memory.h"		#include "lld/Common/Memory.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/BitVector.h"		#include "llvm/ADT/BitVector.h"
#include "llvm/ADT/CachedHashString.h"		#include "llvm/ADT/CachedHashString.h"
#include "llvm/BinaryFormat/MachO.h"		#include "llvm/BinaryFormat/MachO.h"

		#include <mutex>
		#include <thread>

namespace lld {		namespace lld {
namespace macho {		namespace macho {

class InputFile;		class InputFile;
class OutputSection;		class OutputSection;
class Defined;		class Defined;

class InputSection {		class InputSection {
Show All 13 Lines	public:
uint32_t getFlags() const { return shared->flags; }		uint32_t getFlags() const { return shared->flags; }
uint64_t getFileSize() const;		uint64_t getFileSize() const;
// Translates \p off -- an offset relative to this InputSection -- into an		// Translates \p off -- an offset relative to this InputSection -- into an
// offset from the beginning of its parent OutputSection.		// offset from the beginning of its parent OutputSection.
virtual uint64_t getOffset(uint64_t off) const = 0;		virtual uint64_t getOffset(uint64_t off) const = 0;
// The offset from the beginning of the file.		// The offset from the beginning of the file.
uint64_t getVA(uint64_t off) const;		uint64_t getVA(uint64_t off) const;
// Whether the data at \p off in this InputSection is live.		// Whether the data at \p off in this InputSection is live.
virtual bool isLive(uint64_t off) const = 0;		virtual bool isLive(uint64_t off) = 0;
virtual void markLive(uint64_t off) = 0;		virtual void markLive(uint64_t off) = 0;
virtual InputSection *canonical() { return this; }		virtual InputSection *canonical() { return this; }

OutputSection *parent = nullptr;		OutputSection *parent = nullptr;

uint32_t align = 1;		uint32_t align = 1;
uint32_t callSiteCount : 31;		uint32_t callSiteCount : 31;
// is address assigned?		// is address assigned?
Show All 22 Lines	protected:
InputSection(Kind kind, StringRef segname, StringRef name, InputFile *file,		InputSection(Kind kind, StringRef segname, StringRef name, InputFile *file,
ArrayRef<uint8_t> data, uint32_t align, uint32_t flags)		ArrayRef<uint8_t> data, uint32_t align, uint32_t flags)
: align(align), callSiteCount(0), isFinal(false), data(data),		: align(align), callSiteCount(0), isFinal(false), data(data),
shared(make<Shared>(file, name, segname, flags, kind)) {}		shared(make<Shared>(file, name, segname, flags, kind)) {}

const Shared *const shared;		const Shared *const shared;
};		};

		// TODO(BEFORE SUBMIT): only need the lock in any isLive() when markLike() is
		// running

// ConcatInputSections are combined into (Concat)OutputSections through simple		// ConcatInputSections are combined into (Concat)OutputSections through simple
// concatenation, in contrast with literal sections which may have their		// concatenation, in contrast with literal sections which may have their
// contents merged before output.		// contents merged before output.
class ConcatInputSection final : public InputSection {		class ConcatInputSection final : public InputSection {
public:		public:
		ConcatInputSection(ConcatInputSection *isec)
		thevinsterUnsubmitted Done Reply Inline Actions Why do we need to define our own copy constructor? Isn't the compiler doing that for us already? thevinster: Why do we need to define our own copy constructor? Isn't the compiler doing that for us already?
		oontvooAuthorUnsubmitted Done Reply Inline Actions because the mutex makes it non-trivially copyable(ie., the default copy ctor won't work). (This reminds me I should change this to a proper copy ctor :) ) oontvoo: because the mutex makes it non-trivially copyable(ie., the default copy ctor won't work). (This…
		: ConcatInputSection(isec->shared->segname, isec->shared->name,
		isec->shared->file, isec->data, isec->align,
		isec->shared->flags) {}
ConcatInputSection(StringRef segname, StringRef name, InputFile *file,		ConcatInputSection(StringRef segname, StringRef name, InputFile *file,
ArrayRef<uint8_t> data, uint32_t align = 1,		ArrayRef<uint8_t> data, uint32_t align = 1,
uint32_t flags = 0)		uint32_t flags = 0)
: InputSection(ConcatKind, segname, name, file, data, align, flags) {}		: InputSection(ConcatKind, segname, name, file, data, align, flags) {}

ConcatInputSection(StringRef segname, StringRef name)		ConcatInputSection(StringRef segname, StringRef name)
: ConcatInputSection(segname, name, /file=/nullptr,		: ConcatInputSection(segname, name, /file=/nullptr,
/data=/{},		/data=/{},
/align=/1, /flags=/0) {}		/align=/1, /flags=/0) {}

uint64_t getOffset(uint64_t off) const override { return outSecOff + off; }		uint64_t getOffset(uint64_t off) const override { return outSecOff + off; }
uint64_t getVA() const { return InputSection::getVA(0); }		uint64_t getVA() const { return InputSection::getVA(0); }
// ConcatInputSections are entirely live or dead, so the offset is irrelevant.		// ConcatInputSections are entirely live or dead, so the offset is irrelevant.
bool isLive(uint64_t off) const override { return live; }		bool isLive(uint64_t off) override { return live; }
		thevinsterUnsubmitted Done Reply Inline Actions Shouldn't this method also have the mutex as well? thevinster: Shouldn't this method also have the mutex as well?
		oontvooAuthorUnsubmitted Done Reply Inline Actions yes - missed this one oontvoo: yes - missed this one
void markLive(uint64_t off) override { live = true; }		void markLive(uint64_t off) override {
		const std::lock_guard<std::mutex> l(liveNessMutex);
		int3Unsubmitted Not Done Reply Inline Actions I think this lock isn't necessary... multiple concurrent calls to this method will still end up setting `live` to true int3: I think this lock isn't necessary... multiple concurrent calls to this method will still end up…
		live = true;
		}
bool isCoalescedWeak() const { return wasCoalesced && numRefs == 0; }		bool isCoalescedWeak() const { return wasCoalesced && numRefs == 0; }
bool shouldOmitFromOutput() const { return !live \|\| isCoalescedWeak(); }		bool shouldOmitFromOutput() const { return !live \|\| isCoalescedWeak(); }
bool isHashableForICF() const;		bool isHashableForICF() const;
void hashForICF();		void hashForICF();
void writeTo(uint8_t *buf);		void writeTo(uint8_t *buf);

void foldIdentical(ConcatInputSection *redundant);		void foldIdentical(ConcatInputSection *redundant);
InputSection *canonical() override {		InputSection *canonical() override {
Show All 10 Lines	public:
uint64_t icfEqClass[2] = {0, 0};		uint64_t icfEqClass[2] = {0, 0};

// With subsections_via_symbols, most symbols have their own InputSection,		// With subsections_via_symbols, most symbols have their own InputSection,
// and for weak symbols (e.g. from inline functions), only the		// and for weak symbols (e.g. from inline functions), only the
// InputSection from one translation unit will make it to the output,		// InputSection from one translation unit will make it to the output,
// while all copies in other translation units are coalesced into the		// while all copies in other translation units are coalesced into the
// first and not copied to the output.		// first and not copied to the output.
bool wasCoalesced = false;		bool wasCoalesced = false;
bool live = !config->deadStrip;		mutable bool live = !config->deadStrip;
// How many symbols refer to this InputSection.		// How many symbols refer to this InputSection.
uint32_t numRefs = 0;		uint32_t numRefs = 0;
// This variable has two usages. Initially, it represents the input order.		// This variable has two usages. Initially, it represents the input order.
// After assignAddresses is called, it represents the offset from the		// After assignAddresses is called, it represents the offset from the
// beginning of the output section this section was assigned to.		// beginning of the output section this section was assigned to.
uint64_t outSecOff = 0;		uint64_t outSecOff = 0;
		std::mutex liveNessMutex;
};		};

// Verify ConcatInputSection's size on 64-bit builds.		// Verify ConcatInputSection's size on 64-bit builds.
static_assert(sizeof(int) != 8 \|\| sizeof(ConcatInputSection) == 112,		static_assert(sizeof(int) != 8 \|\| sizeof(ConcatInputSection) == 112,
"Try to minimize ConcatInputSection's size, we create many "		"Try to minimize ConcatInputSection's size, we create many "
"instances of it");		"instances of it");

// Helper functions to make it easy to sprinkle asserts.		// Helper functions to make it easy to sprinkle asserts.

inline bool shouldOmitFromOutput(InputSection *isec) {		inline bool shouldOmitFromOutput(InputSection *isec) {
return isa<ConcatInputSection>(isec) &&		return isa<ConcatInputSection>(isec) &&
cast<ConcatInputSection>(isec)->shouldOmitFromOutput();		cast<ConcatInputSection>(isec)->shouldOmitFromOutput();
}		}

inline bool isCoalescedWeak(InputSection *isec) {		inline bool isCoalescedWeak(InputSection *isec) {
return isa<ConcatInputSection>(isec) &&		return isa<ConcatInputSection>(isec) &&
cast<ConcatInputSection>(isec)->isCoalescedWeak();		cast<ConcatInputSection>(isec)->isCoalescedWeak();
}		}

// We allocate a lot of these and binary search on them, so they should be as		// We allocate a lot of these and binary search on them, so they should be as
// compact as possible. Hence the use of 31 rather than 64 bits for the hash.		// compact as possible. Hence the use of 31 rather than 64 bits for the hash.
struct StringPiece {		struct StringPiece {
// Offset from the start of the containing input section.		// Offset from the start of the containing input section.
uint32_t inSecOff;		uint32_t inSecOff;
uint32_t live : 1;		mutable uint32_t live : 1;
// Only set if deduplicating literals		// Only set if deduplicating literals
uint32_t hash : 31;		uint32_t hash : 31;
// Offset from the start of the containing output section.		// Offset from the start of the containing output section.
uint64_t outSecOff = 0;		uint64_t outSecOff = 0;

StringPiece(uint64_t off, uint32_t hash)		StringPiece(uint64_t off, uint32_t hash)
: inSecOff(off), live(!config->deadStrip), hash(hash) {}		: inSecOff(off), live(!config->deadStrip), hash(hash) {}
};		};
Show All 12 Lines
// conservative behavior we can certainly implement that.		// conservative behavior we can certainly implement that.
class CStringInputSection final : public InputSection {		class CStringInputSection final : public InputSection {
public:		public:
CStringInputSection(StringRef segname, StringRef name, InputFile *file,		CStringInputSection(StringRef segname, StringRef name, InputFile *file,
ArrayRef<uint8_t> data, uint32_t align, uint32_t flags)		ArrayRef<uint8_t> data, uint32_t align, uint32_t flags)
: InputSection(CStringLiteralKind, segname, name, file, data, align,		: InputSection(CStringLiteralKind, segname, name, file, data, align,
flags) {}		flags) {}
uint64_t getOffset(uint64_t off) const override;		uint64_t getOffset(uint64_t off) const override;
bool isLive(uint64_t off) const override { return getStringPiece(off).live; }		bool isLive(uint64_t off) override {
void markLive(uint64_t off) override { getStringPiece(off).live = true; }		const std::lock_guard<std::mutex> l(liveNessMutex);
		return getStringPiece(off).live;
		}
		void markLive(uint64_t off) override {
		const std::lock_guard<std::mutex> l(liveNessMutex);
		getStringPiece(off).live = true;
		}
// Find the StringPiece that contains this offset.		// Find the StringPiece that contains this offset.
StringPiece &getStringPiece(uint64_t off);		StringPiece &getStringPiece(uint64_t off);
const StringPiece &getStringPiece(uint64_t off) const;		const StringPiece &getStringPiece(uint64_t off) const;
// Split at each null byte.		// Split at each null byte.
void splitIntoPieces();		void splitIntoPieces();

LLVM_ATTRIBUTE_ALWAYS_INLINE		LLVM_ATTRIBUTE_ALWAYS_INLINE
StringRef getStringRef(size_t i) const {		StringRef getStringRef(size_t i) const {
Show All 11 Lines	llvm::CachedHashStringRef getCachedHashStringRef(size_t i) const {
return {getStringRef(i), pieces[i].hash};		return {getStringRef(i), pieces[i].hash};
}		}

static bool classof(const InputSection *isec) {		static bool classof(const InputSection *isec) {
return isec->kind() == CStringLiteralKind;		return isec->kind() == CStringLiteralKind;
}		}

std::vector<StringPiece> pieces;		std::vector<StringPiece> pieces;
		std::mutex liveNessMutex;
};		};

class WordLiteralInputSection final : public InputSection {		class WordLiteralInputSection final : public InputSection {
public:		public:
WordLiteralInputSection(StringRef segname, StringRef name, InputFile *file,		WordLiteralInputSection(StringRef segname, StringRef name, InputFile *file,
ArrayRef<uint8_t> data, uint32_t align,		ArrayRef<uint8_t> data, uint32_t align,
uint32_t flags);		uint32_t flags);
uint64_t getOffset(uint64_t off) const override;		uint64_t getOffset(uint64_t off) const override;
bool isLive(uint64_t off) const override {		bool isLive(uint64_t off) override {
		const std::lock_guard<std::mutex> l(liveNessMutex);
return live[off >> power2LiteralSize];		return live[off >> power2LiteralSize];
}		}
void markLive(uint64_t off) override { live[off >> power2LiteralSize] = 1; }		void markLive(uint64_t off) override {
		const std::lock_guard<std::mutex> l(liveNessMutex);
		live[off >> power2LiteralSize] = 1;
		int3Unsubmitted Not Done Reply Inline Actions atomics might perform better here, assuming low contention int3: atomics might perform better here, assuming low contention
		}

static bool classof(const InputSection *isec) {		static bool classof(const InputSection *isec) {
return isec->kind() == WordLiteralKind;		return isec->kind() == WordLiteralKind;
}		}

private:		private:
unsigned power2LiteralSize;		unsigned power2LiteralSize;
// The liveness of data[off] is tracked by live[off >> power2LiteralSize].		// The liveness of data[off] is tracked by live[off >> power2LiteralSize].
llvm::BitVector live;		mutable llvm::BitVector live;
		std::mutex liveNessMutex;
};		};

inline uint8_t sectionType(uint32_t flags) {		inline uint8_t sectionType(uint32_t flags) {
return flags & llvm::MachO::SECTION_TYPE;		return flags & llvm::MachO::SECTION_TYPE;
}		}

inline bool isZeroFill(uint32_t flags) {		inline bool isZeroFill(uint32_t flags) {
return llvm::MachO::isVirtualSection(sectionType(flags));		return llvm::MachO::isVirtualSection(sectionType(flags));
▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

lld/MachO/MarkLive.cpp

//===- MarkLive.cpp -------------------------------------------------------===// //===- MarkLive.cpp -------------------------------------------------------===//

// //

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information. // See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

// //

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

#include "MarkLive.h" #include "MarkLive.h"

#include "Config.h" #include "Config.h"

#include "OutputSegment.h" #include "OutputSegment.h"

#include "SymbolTable.h" #include "SymbolTable.h"

#include "Symbols.h" #include "Symbols.h"

#include "UnwindInfoSection.h" #include "UnwindInfoSection.h"

#include "mach-o/compact_unwind_encoding.h" #include "mach-o/compact_unwind_encoding.h"

#include "llvm/Support/Parallel.h"

#include "llvm/Support/TimeProfiler.h" #include "llvm/Support/TimeProfiler.h"

#include <mutex>

#include <thread>

namespace lld { namespace lld {

namespace macho { namespace macho {

using namespace llvm; using namespace llvm;

using namespace llvm::MachO; using namespace llvm::MachO;

// Set live bit on for each reachable chunk. Unmarked (unreachable) // Set live bit on for each reachable chunk. Unmarked (unreachable)

// InputSections will be ignored by Writer, so they will be excluded // InputSections will be ignored by Writer, so they will be excluded

// from the final output. // from the final output.

void markLive() { void markLive() {

TimeTraceScope timeScope("markLive"); TimeTraceScope timeScope("markLive");

int3Unsubmitted

Done

hardware_concurrency() seems like a good initializer

int3: `hardware_concurrency()` seems like a good initializer

// We build up a worklist of sections which have been marked as live. We only // We build up a worklist of sections which have been marked as live. We only

// push into the worklist when we discover an unmarked section, and we mark // push into the worklist when we discover an unmarked section, and we mark

// as we push, so sections never appear twice in the list. // as we push, so sections never appear twice in the list.

// Literal sections cannot contain references to other sections, so we only // Literal sections cannot contain references to other sections, so we only

// store ConcatInputSections in our worklist. // store ConcatInputSections in our worklist.

SmallVector<ConcatInputSection *, 256> worklist; SmallVector<ConcatInputSection *, 256> worklist;

std::mutex listMutex;

int3Unsubmitted

Done

hm, this seems more like a Worker or Executor to me, each of which executes a number of tasks or jobs...

int3: hm, this seems more like a `Worker` or `Executor` to me, each of which executes a number of…

auto enqueue = [&](InputSection *isec, uint64_t off) { auto enqueue = [&](InputSection *isec, uint64_t off) {

if (isec->isLive(off)) if (isec->isLive(off))

return; return;

isec->markLive(off); isec->markLive(off);

int3Unsubmitted

Done

: idx(idx), pool(queues) {

- // Calcuate the portion of inputSections this task were to own.

+ // Calculate the portion of inputSections this task is to own.

// We divide inputSections into poolSize groups.

int3:

if (auto s = dyn_cast<ConcatInputSection>(isec)) { if (auto s = dyn_cast<ConcatInputSection>(isec)) {

assert(!s->isCoalescedWeak()); assert(!s->isCoalescedWeak());

const std::lock_guard<std::mutex> l(listMutex);

thevinsterUnsubmitted

Done

I think we might have a race condition here. If two different symbols pointing to the same isec and off happen to enter this function, we could end up writing the same value twice to worklist. It feels like the lock should be at the enqueue scope instead of the list here.

Doing so, I believe we can also remove all the locks on the isLive and markLive methods.

thevinster: I think we might have a race condition here. If two different symbols pointing to the same…

oontvooAuthorUnsubmitted

Done

No, if the lock were to be put on the whole enqueue() then the whole thing would essentially be single-threaded (because all different sections would have to wait for the same lock).

We should minimise the locked region here.
And yes - this should have checked isLive() again before pushing into the queue

oontvoo: No, if the lock were to be put on the whole enqueue() then the whole thing would essentially be…

worklist.push_back(s); worklist.push_back(s);

} }

}; };

auto addSym = [&](Symbol *s) { auto addSym = [&](Symbol *s) {

int3Unsubmitted

Not Done

if start and end denoted half-open (instead of closed) intervals, would we still need skip?

also, how do you feel about creating an array of ArrayRefs here, instead of working with raw indices? That uses a few extra ints, but given that the number of threads in the pool is small, that shouldn't be an issue.

int3: if `start` and `end` denoted half-open (instead of closed) intervals, would we still need…

s->used = true; s->used = true;

if (auto *d = dyn_cast<Defined>(s)) if (auto *d = dyn_cast<Defined>(s))

if (d->isec) if (d->isec)

enqueue(d->isec, d->value); enqueue(d->isec, d->value);

}; };

// Add GC roots. // Add GC roots.

if (config->entry) if (config->entry)

int3Unsubmitted

Done

would prefer a more descriptive name... run / execute?

int3: would prefer a more descriptive name... `run` / `execute`?

addSym(config->entry); addSym(config->entry);

for (Symbol *sym : symtab->getSymbols()) { auto symbols = symtab->getSymbols();

parallelForEachN(0, symbols.size(), [&](size_t i) {

Symbol *sym = symbols[i];

TimeTraceScope timeScope("scan for symtab");

if (auto *defined = dyn_cast<Defined>(sym)) { if (auto *defined = dyn_cast<Defined>(sym)) {

// -exported_symbol(s_list) // -exported_symbol(s_list)

if (!config->exportedSymbols.empty() && if (!config->exportedSymbols.empty() &&

config->exportedSymbols.match(defined->getName())) { config->exportedSymbols.match(defined->getName())) {

// FIXME: Instead of doing this here, maybe the Driver code doing // FIXME: Instead of doing this here, maybe the Driver code doing

// the matching should add them to explicitUndefineds? Then the // the matching should add them to explicitUndefineds? Then the

// explicitUndefineds code below would handle this automatically. // explicitUndefineds code below would handle this automatically.

assert(!defined->privateExtern && assert(!defined->privateExtern &&

MaskRayUnsubmitted

Not Done

open-ended intervals are usually easy to work with and less error-prone

MaskRay: open-ended intervals are usually easy to work with and less error-prone

"should have been rejected by driver"); "should have been rejected by driver");

addSym(defined); addSym(defined);

continue; return;

} }

// public symbols explicitly marked .no_dead_strip // public symbols explicitly marked .no_dead_strip

if (defined->referencedDynamically || defined->noDeadStrip) { if (defined->referencedDynamically || defined->noDeadStrip) {

addSym(defined); addSym(defined);

continue; return;

} }

// FIXME: When we implement these flags, make symbols from them GC roots: // FIXME: When we implement these flags, make symbols from them GC roots:

// * -reexported_symbol(s_list) // * -reexported_symbol(s_list)

// * -alias(-list) // * -alias(-list)

// * -init // * -init

// In dylibs and bundles and in executables with -export_dynamic, // In dylibs and bundles and in executables with -export_dynamic,

// all external functions are GC roots. // all external functions are GC roots.

bool externsAreRoots = bool externsAreRoots =

config->outputType != MH_EXECUTE || config->exportDynamic; config->outputType != MH_EXECUTE || config->exportDynamic;

if (externsAreRoots && !defined->privateExtern) { if (externsAreRoots && !defined->privateExtern) {

MaskRayUnsubmitted

Not Done

https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies-of-if-else-loop-statements

MaskRay: https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies…

addSym(defined); addSym(defined);

continue; return;

}

} }

});

// -u symbols // -u symbols

for (Symbol *sym : config->explicitUndefineds) for (Symbol *sym : config->explicitUndefineds)

if (auto *defined = dyn_cast<Defined>(sym)) if (auto *defined = dyn_cast<Defined>(sym))

addSym(defined); addSym(defined);

int3Unsubmitted

Not Done

I'm kind of suspicious of this. It's marking sections as live based on the liveness of other sections, which are concurrently being marked as live. Depending on the order of writes, it may incorrectly conclude that a certain live section X is not live, and therefore not mark the sections that X points to as live.

int3: I'm kind of suspicious of this. It's marking sections as live based on the liveness of other…

oontvooAuthorUnsubmitted

Not Done

Good point! (kind of surprised not a lot of tests were failing ...)

oontvoo: Good point! (kind of surprised not a lot of tests were failing ...)

int3Unsubmitted

Not Done

Race conditions often don't show up on small inputs :) I would recommend running dead_strip on a large program before and after this change, and checking that the outputs are identical.

I think the easiest way to fix this is to create a mapping of sections to their live support sections (basically reversing the pointer), and then have addSect visit the live support sections.

int3: Race conditions often don't show up on small inputs :) I would recommend running dead_strip on…

// local symbols explicitly marked .no_dead_strip // local symbols explicitly marked .no_dead_strip

for (const InputFile *file : inputFiles) for (const InputFile *file : inputFiles)

if (auto *objFile = dyn_cast<ObjFile>(file)) if (auto *objFile = dyn_cast<ObjFile>(file))

for (Symbol *sym : objFile->symbols) for (Symbol *sym : objFile->symbols)

if (auto *defined = dyn_cast_or_null<Defined>(sym)) if (auto *defined = dyn_cast_or_null<Defined>(sym))

if (!defined->isExternal() && defined->noDeadStrip) if (!defined->isExternal() && defined->noDeadStrip)

int3Unsubmitted

Done

task.addSym(stubBinder);

}

- void populateDeadSrips(std::vector<Task> &tasks) {

+ void populateNoDeadStrips(std::vector<Task> &tasks) {

TimeTraceScope timeScope("populateDeadSrips");

I think the "no" makes it clearer :)

int3: I think the "no" makes it clearer :)

addSym(defined); addSym(defined);

if (auto *stubBinder = if (auto *stubBinder =

dyn_cast_or_null<DylibSymbol>(symtab->find("dyld_stub_binder"))) dyn_cast_or_null<DylibSymbol>(symtab->find("dyld_stub_binder")))

addSym(stubBinder); addSym(stubBinder);

for (ConcatInputSection *isec : inputSections) {

parallelForEachN(0, inputSections.size(), [&](size_t i) {

ConcatInputSection *isec = inputSections[i];

TimeTraceScope timeScope("scan input sections");

// Sections marked no_dead_strip // Sections marked no_dead_strip

if (isec->getFlags() & S_ATTR_NO_DEAD_STRIP) { if (isec->getFlags() & S_ATTR_NO_DEAD_STRIP) {

enqueue(isec, 0); enqueue(isec, 0);

continue; return;

} }

// mod_init_funcs, mod_term_funcs sections // mod_init_funcs, mod_term_funcs sections

if (sectionType(isec->getFlags()) == S_MOD_INIT_FUNC_POINTERS || if (sectionType(isec->getFlags()) == S_MOD_INIT_FUNC_POINTERS ||

sectionType(isec->getFlags()) == S_MOD_TERM_FUNC_POINTERS) { sectionType(isec->getFlags()) == S_MOD_TERM_FUNC_POINTERS) {

enqueue(isec, 0); enqueue(isec, 0);

continue; return;

}

} }

int3Unsubmitted

Done

work is usually a noun, not a verb... maybe this could also be run() (or have run above take a default argument)

int3: `work` is usually a noun, not a verb... maybe this could also be `run()` (or have `run` above…

});

int3Unsubmitted

Done

can we have a more descriptive name, like groupSize?

int3: can we have a more descriptive name, like `groupSize`?

// Dead strip runs before UnwindInfoSection handling so we need to keep // Dead strip runs before UnwindInfoSection handling so we need to keep

// __LD,__compact_unwind alive here. // __LD,__compact_unwind alive here.

int3Unsubmitted

Done

const size_t idx; // index of this task's queue in the pool

- bool markDone = false; // if true, this thread has signald that it's done.

+ bool markDone = false; // if true, this thread has signaled that it's done.

size_t start; // start inputsect

int3:

// But that section contains absolute references to __TEXT,__text and // But that section contains absolute references to __TEXT,__text and

// keeps most code alive due to that. So we can't just enqueue() the // keeps most code alive due to that. So we can't just enqueue() the

// section: We must skip the relocations for the functionAddress // section: We must skip the relocations for the functionAddress

// in each CompactUnwindEntry. // in each CompactUnwindEntry.

// See also scanEhFrameSection() in lld/ELF/MarkLive.cpp. // See also scanEhFrameSection() in lld/ELF/MarkLive.cpp.

for (ConcatInputSection *isec : in.unwindInfo->getInputs()) { auto unwindInfoInputs = in.unwindInfo->getInputs();

parallelForEachN(0, unwindInfoInputs.size(), [&](size_t i) {

ConcatInputSection *isec = unwindInfoInputs[i];

TimeTraceScope timeScope("scan unwind info");

isec->live = true; isec->live = true;

const int compactUnwindEntrySize = const int compactUnwindEntrySize =

target->wordSize == 8 ? sizeof(CompactUnwindEntry<uint64_t>) target->wordSize == 8 ? sizeof(CompactUnwindEntry<uint64_t>)

: sizeof(CompactUnwindEntry<uint32_t>); : sizeof(CompactUnwindEntry<uint32_t>);

for (const Reloc &r : isec->relocs) { for (const Reloc &r : isec->relocs) {

// This is the relocation for the address of the function itself. // This is the relocation for the address of the function itself.

// Ignore it, else these would keep everything alive. // Ignore it, else these would keep everything alive.

if (r.offset % compactUnwindEntrySize == 0) if (r.offset % compactUnwindEntrySize == 0)

continue; continue;

if (auto *s = r.referent.dyn_cast<Symbol *>()) if (auto *s = r.referent.dyn_cast<Symbol *>())

addSym(s); addSym(s);

else else

int3Unsubmitted

Done

i'm kind of confused here. why are we comparing start to idx -- isn't start an index into the size array of jobs, whereas idx is the index of the Task vector of size poolSize? seems like we're comparing numbers on different scales...

same thing for idx and size

int3: i'm kind of confused here. why are we comparing `start` to `idx` -- isn't `start` an index into…

oontvooAuthorUnsubmitted

Done

"idx" is the index of the Task vector, yes but it correlates to where its work items are:

So the division of labour is like this:

Task[0] : start=0, end = start + d -1 = d -1 
Task[1] : start=d, end = d + d - 1
Task[2]: start = d + d, end = d + d + d -1
....
Task[n]: start = n * d, end = start + d -1.

oontvoo: "idx" is the index of the Task vector, yes but it correlates to where its work items are: So…

int3Unsubmitted

Not Done

right but... what does start < idx mean here, and when do we expect it to be true? shouldn't it never be true, since n >= n * d for all nonnegative n and d?

int3: right but... what does `start < idx` mean here, and when do we expect it to be true? shouldn't…

oontvooAuthorUnsubmitted

Done

Ah- I misunderstood what you were confusing about.

shouldn't it never be true, since n >= n * d for all nonnegative n and d

d could be zero, which is non-negative :)

ie., when we have fewer tasks than the number of workers, then you dont need to use all of the workers, when that happens, you just need the first [0, x) workers, and give each of them 1 task.

Similarly, end >= size is true when start== d == 0 (because its type is size_t , hence the -1 will ensure it).

(but I guess you're right that this is written in a much more convoluted way than it should be ... will fix)

oontvoo: Ah- I misunderstood what you were confusing about. > shouldn't it never be true, since n >=…

enqueue(r.referent.get<InputSection *>(), r.addend); enqueue(r.referent.get<InputSection *>(), r.addend);

} }

} });

do { do {

// Mark things reachable from GC roots as live. // Mark things reachable from GC roots as live.

while (!worklist.empty()) { while (!worklist.empty()) {

ConcatInputSection *s = worklist.pop_back_val(); ConcatInputSection *s = worklist.pop_back_val();

assert(s->live && "We mark as live when pushing onto the worklist!"); assert(s->live && "We mark as live when pushing onto the worklist!");

// Mark all symbols listed in the relocation table for this section. // Mark all symbols listed in the relocation table for this section.

Show All 36 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Speed up markLive()Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 373398

lld/MachO/InputFiles.cpp

lld/MachO/InputSection.h

lld/MachO/MarkLive.cpp

[lld-macho] Speed up markLive()
Needs ReviewPublic