This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/COFF/
-
COFF/
2/7
DebugTypes.cpp

Differential D94555

[LLD][COFF] Avoid std::vector resizes during type merging
ClosedPublic

Authored by aganea on Jan 12 2021, 2:28 PM.

Download Raw Diff

Details

Reviewers

rnk

Commits

rGe7a371f9fd00: [LLD][COFF] Avoid std::vector resizes during type merging

Summary

As discussed in https://reviews.llvm.org/D87805#2491679 the MergedInfo::recs vector can get quite large and resizing it while remapping types creates unnecessary copies and virtual memory page faults.

This patch saves consistently approx. 0.6 sec (out of 18 sec) on a large dataset (400 MB EXE, 2 GB PDB).

With 12 hyper-threads:

Benchmark #1: before\lld-link.exe @link.rsp /threads:12
  Time (mean ± σ):     17.939 s ±  1.215 s    [User: 2.7 ms, System: 3.5 ms]
  Range (min … max):   15.537 s … 18.597 s    10 runs

Benchmark #2: after\lld-link.exe @link.rsp /threads:12
  Time (mean ± σ):     17.298 s ±  1.511 s    [User: 1.4 ms, System: 8.9 ms]
  Range (min … max):   15.512 s … 18.513 s    10 runs

With 36 hyper-threads (thus using only one CPU socket):

Benchmark #1: before\lld-link.exe @link.rsp /threads:36
  Time (mean ± σ):     17.787 s ±  0.747 s    [User: 4.2 ms, System: 5.6 ms]
  Range (min … max):   15.666 s … 18.059 s    10 runs

Benchmark #2: after\lld-link.exe @link.rsp /threads:36
  Time (mean ± σ):     17.102 s ±  1.323 s    [User: 2.6 ms, System: 4.0 ms]
  Range (min … max):   15.175 s … 18.023 s    10 runs

With 72 hyper-threads (using two CPU sockets, slower because kernel locks now cross CPUs)

Benchmark #1: before\lld-link.exe @link.rsp
  Time (mean ± σ):     18.085 s ±  0.764 s    [User: 2.7 ms, System: 3.3 ms]
  Range (min … max):   15.918 s … 18.444 s    10 runs

Benchmark #2: after\lld-link.exe @link.rsp
  Time (mean ± σ):     17.453 s ±  1.147 s    [User: 2.7 ms, System: 8.7 ms]
  Range (min … max):   15.766 s … 18.246 s    10 runs

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aganea requested review of this revision.Jan 12 2021, 2:28 PM

aganea created this revision.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2021, 2:28 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

aganea edited the summary of this revision. (Show Details)Jan 12 2021, 2:31 PM

aganea added a reviewer: rnk.

aganea edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B84923: Diff 316241.Jan 12 2021, 3:08 PM

Simplify.

Harbormaster completed remote builds in B84966: Diff 316291.Jan 12 2021, 7:23 PM

lgtm

This revision is now accepted and ready to land.Jan 13 2021, 9:30 AM

Closed by commit rGe7a371f9fd00: [LLD][COFF] Avoid std::vector resizes during type merging (authored by aganea). · Explain WhyJan 13 2021, 11:35 AM

This revision was automatically updated to reflect the committed changes.

aganea added a commit: rGe7a371f9fd00: [LLD][COFF] Avoid std::vector resizes during type merging.

dblaikie added a subscriber: dblaikie.Jan 19 2021, 10:45 AM

dblaikie added inline comments.

lld/COFF/DebugTypes.cpp
632–634	I know this is a bit of a stretch - but given the somewhat non-trivial extra work required to determine the capacity, I'm wondering if this issue might be partly due to this incremental/repeated resizing. I'm guessing that maybe the resize is thwarting the usual growth function of the vectors (eg: resize brings the size up to exactly the size specified, whereas repeated push_back hits the usual growth factor to offset the next push_backs, etc). The notes here: https://en.cppreference.com/w/cpp/container/vector/reserve discuss this sort of issue with reserve, though I'm pretty sure it applies to resize as well. If it's easy to do, I'd be curious to know whether removing the fix in this patch, and instead changing this resize+memcpy to loop+push_back or... oh, reference invalidation concerns. What about recs.insert(recs.begin(), recs.end()) ? Hmm, yeah, seems the spec for std::vector doesn't allow that, so I guess the nearest code would be: for (int i = 0; i != newSize; ++i) merged.recs.push_back(merged.recs[i]);

rnk added inline comments.Jan 19 2021, 10:53 AM

lld/COFF/DebugTypes.cpp
632–634	When I chose to use resize, I remember checking some reference material to see if it would be subject to the O(n^2) behavior of using reserve. I don't remember what I looked at exactly, probably libc++ source. Whatever I looked at, I assured myself that repeated resizing isn't O(n^2). Perhaps the behavior is not required by the standard, and it varies by implementation.

dblaikie added inline comments.Jan 19 2021, 11:09 AM

lld/COFF/DebugTypes.cpp
632–634	Seems libstdc++ and libc++ at some version does indeed do the right thing: $ clang++-tot test.cpp && ./a.out initial capacity: 0 capacity after resize(500): 500 capacity after resize(501): 1000 capacity after resize(500) + 1 * push_back: 1000 $ clang++-tot test.cpp -stdlib=libc++ && ./a.out initial capacity: 0 capacity after resize(500): 500 capacity after resize(501): 1000 capacity after resize(500) + 1 * push_back: 1000 But thinking about the code, it's always doubling anyway - so it sort of implements its own growth factor (ie: that first resize(500) in my example doesn't get any extra capacity because it's much bigger than the growth factor from the baseline size of 0, but the resize(501) does because it's less than the growth factor from the current capacity). So this code probably always gets only the size requested, because that size is always about as much as the growth factor would dictate from the previous size (ie: the resize probably produces the same number of growths as the push_back). So maybe no benefit but no harm/no wins to be had by switching to push_back either, I guess. Wouldn't mind the data if it's easy to gather, but yeah, I'm less confident it'll show any benefit.

aganea added inline comments.Jan 19 2021, 12:41 PM

lld/COFF/DebugTypes.cpp
632–634	One other way to fix this would be a growing allocator backed up by reserved, but uncommited memory pages. That would ideal in this situation I think. We could reserve say 1 GB of virtual memory upfront for each container, and commit memory pages on-the-go, as the container grows. This prevents memory moves and keeps iterators stable. I'm not too sure if the `std::allocator` allows this kind of trick though, so it would have to be a custom container (or perhaps a variant of `BumpPtrAllocator`). However that comes with another downside on Windows, which is that the VAD tree is deferred to a background System then when it's too large, I've mentionned that here: https://reviews.llvm.org/D86694#2277371 but in this case it's maybe less of an issue. @dblaikie What is `clang++-tot`? What does ToT mean, I've seen this mentionned in GCC as well?

mstorsjo added a subscriber: mstorsjo.Jan 19 2021, 12:46 PM

mstorsjo added inline comments.

lld/COFF/DebugTypes.cpp
632–634	Keep in mind that LLD still is usable in 32 bit builds, where one can't reserve 1 GB of memory like that. (It's a bit limited in such builds, it can't successfully link large images, but for common usage it's still quite ok.) ToT is Top of Tree, i.e. latest git/svn version.

aganea added inline comments.Jan 19 2021, 12:53 PM

lld/COFF/DebugTypes.cpp
632–634	@mstorsjo That's a good point, thanks for raising that.

dblaikie added inline comments.Jan 19 2021, 1:49 PM

lld/COFF/DebugTypes.cpp
632–634	Oh, I wouldn't worry/don't mean to suggest going all that far - only reason I brought it up was in case the code could get the desired performance characteristics with less complexity. Sounds like probably not.

Revision Contents

Path

Size

lld/

COFF/

DebugTypes.cpp

20 lines

Diff 316470

lld/COFF/DebugTypes.cpp

Show First 20 Lines • Show All 623 Lines • ▼ Show 20 Lines	void TpiSource::mergeTypeRecord(TypeIndex curIndex, CVType ty) {
// Decide if the merged type goes into TPI or IPI.		// Decide if the merged type goes into TPI or IPI.
bool isItem = isIdRecord(ty.kind());		bool isItem = isIdRecord(ty.kind());
MergedInfo &merged = isItem ? mergedIpi : mergedTpi;		MergedInfo &merged = isItem ? mergedIpi : mergedTpi;

// Copy the type into our mutable buffer.		// Copy the type into our mutable buffer.
assert(ty.length() <= codeview::MaxRecordLength);		assert(ty.length() <= codeview::MaxRecordLength);
size_t offset = merged.recs.size();		size_t offset = merged.recs.size();
size_t newSize = alignTo(ty.length(), 4);		size_t newSize = alignTo(ty.length(), 4);
merged.recs.resize(offset + newSize);		merged.recs.resize(offset + newSize);
auto newRec = makeMutableArrayRef(&merged.recs[offset], newSize);		auto newRec = makeMutableArrayRef(&merged.recs[offset], newSize);
memcpy(newRec.data(), ty.data().data(), newSize);		memcpy(newRec.data(), ty.data().data(), newSize);
		dblaikieUnsubmitted Not Done Reply Inline Actions I know this is a bit of a stretch - but given the somewhat non-trivial extra work required to determine the capacity, I'm wondering if this issue might be partly due to this incremental/repeated resizing. I'm guessing that maybe the resize is thwarting the usual growth function of the vectors (eg: resize brings the size up to exactly the size specified, whereas repeated push_back hits the usual growth factor to offset the next push_backs, etc). The notes here: https://en.cppreference.com/w/cpp/container/vector/reserve discuss this sort of issue with reserve, though I'm pretty sure it applies to resize as well. If it's easy to do, I'd be curious to know whether removing the fix in this patch, and instead changing this resize+memcpy to loop+push_back or... oh, reference invalidation concerns. What about recs.insert(recs.begin(), recs.end()) ? Hmm, yeah, seems the spec for std::vector doesn't allow that, so I guess the nearest code would be: for (int i = 0; i != newSize; ++i) merged.recs.push_back(merged.recs[i]); dblaikie: I know this is a bit of a stretch - but given the somewhat non-trivial extra work required to…
		rnkUnsubmitted Not Done Reply Inline Actions When I chose to use resize, I remember checking some reference material to see if it would be subject to the O(n^2) behavior of using reserve. I don't remember what I looked at exactly, probably libc++ source. Whatever I looked at, I assured myself that repeated resizing isn't O(n^2). Perhaps the behavior is not required by the standard, and it varies by implementation. rnk: When I chose to use resize, I remember checking some reference material to see if it would be…
		dblaikieUnsubmitted Not Done Reply Inline Actions Seems libstdc++ and libc++ at some version does indeed do the right thing: $ clang++-tot test.cpp && ./a.out initial capacity: 0 capacity after resize(500): 500 capacity after resize(501): 1000 capacity after resize(500) + 1 * push_back: 1000 $ clang++-tot test.cpp -stdlib=libc++ && ./a.out initial capacity: 0 capacity after resize(500): 500 capacity after resize(501): 1000 capacity after resize(500) + 1 * push_back: 1000 But thinking about the code, it's always doubling anyway - so it sort of implements its own growth factor (ie: that first resize(500) in my example doesn't get any extra capacity because it's much bigger than the growth factor from the baseline size of 0, but the resize(501) does because it's less than the growth factor from the current capacity). So this code probably always gets only the size requested, because that size is always about as much as the growth factor would dictate from the previous size (ie: the resize probably produces the same number of growths as the push_back). So maybe no benefit but no harm/no wins to be had by switching to push_back either, I guess. Wouldn't mind the data if it's easy to gather, but yeah, I'm less confident it'll show any benefit. dblaikie: Seems libstdc++ and libc++ at some version does indeed do the right thing: ``` $ clang++-tot…
		aganeaAuthorUnsubmitted Done Reply Inline Actions One other way to fix this would be a growing allocator backed up by reserved, but uncommited memory pages. That would ideal in this situation I think. We could reserve say 1 GB of virtual memory upfront for each container, and commit memory pages on-the-go, as the container grows. This prevents memory moves and keeps iterators stable. I'm not too sure if the `std::allocator` allows this kind of trick though, so it would have to be a custom container (or perhaps a variant of `BumpPtrAllocator`). However that comes with another downside on Windows, which is that the VAD tree is deferred to a background System then when it's too large, I've mentionned that here: https://reviews.llvm.org/D86694#2277371 but in this case it's maybe less of an issue. @dblaikie What is `clang++-tot`? What does ToT mean, I've seen this mentionned in GCC as well? aganea: One other way to fix this would be a growing allocator backed up by reserved, but uncommited…
		mstorsjoUnsubmitted Not Done Reply Inline Actions Keep in mind that LLD still is usable in 32 bit builds, where one can't reserve 1 GB of memory like that. (It's a bit limited in such builds, it can't successfully link large images, but for common usage it's still quite ok.) ToT is Top of Tree, i.e. latest git/svn version. mstorsjo: Keep in mind that LLD still is usable in 32 bit builds, where one can't reserve 1 GB of memory…
		aganeaAuthorUnsubmitted Done Reply Inline Actions @mstorsjo That's a good point, thanks for raising that. aganea: @mstorsjo That's a good point, thanks for raising that.
		dblaikieUnsubmitted Not Done Reply Inline Actions Oh, I wouldn't worry/don't mean to suggest going all that far - only reason I brought it up was in case the code could get the desired performance characteristics with less complexity. Sounds like probably not. dblaikie: Oh, I wouldn't worry/don't mean to suggest going all that far - only reason I brought it up was…

// Fix up the record prefix and padding bytes if it required resizing.		// Fix up the record prefix and padding bytes if it required resizing.
if (newSize != ty.length()) {		if (newSize != ty.length()) {
reinterpret_cast<RecordPrefix *>(newRec.data())->RecordLen = newSize - 2;		reinterpret_cast<RecordPrefix *>(newRec.data())->RecordLen = newSize - 2;
for (size_t i = ty.length(); i < newSize; ++i)		for (size_t i = ty.length(); i < newSize; ++i)
newRec[i] = LF_PAD0 + (newSize - i);		newRec[i] = LF_PAD0 + (newSize - i);
}		}

Show All 31 Lines	void TpiSource::mergeUniqueTypeRecords(ArrayRef<uint8_t> typeRecords,
else		else
llvm::sort(uniqueTypes);		llvm::sort(uniqueTypes);

// Accumulate all the unique types into one buffer in mergedTypes.		// Accumulate all the unique types into one buffer in mergedTypes.
uint32_t ghashIndex = 0;		uint32_t ghashIndex = 0;
auto nextUniqueIndex = uniqueTypes.begin();		auto nextUniqueIndex = uniqueTypes.begin();
assert(mergedTpi.recs.empty());		assert(mergedTpi.recs.empty());
assert(mergedIpi.recs.empty());		assert(mergedIpi.recs.empty());

		// Pre-compute the number of elements in advance to avoid std::vector resizes.
		unsigned nbTpiRecs = 0;
		unsigned nbIpiRecs = 0;
		forEachTypeChecked(typeRecords, [&](const CVType &ty) {
		if (nextUniqueIndex != uniqueTypes.end() &&
		*nextUniqueIndex == ghashIndex) {
		assert(ty.length() <= codeview::MaxRecordLength);
		size_t newSize = alignTo(ty.length(), 4);
		(isIdRecord(ty.kind()) ? nbIpiRecs : nbTpiRecs) += newSize;
		++nextUniqueIndex;
		}
		++ghashIndex;
		});
		mergedTpi.recs.reserve(nbTpiRecs);
		mergedIpi.recs.reserve(nbIpiRecs);

		// Do the actual type merge.
		ghashIndex = 0;
		nextUniqueIndex = uniqueTypes.begin();
forEachTypeChecked(typeRecords, [&](const CVType &ty) {		forEachTypeChecked(typeRecords, [&](const CVType &ty) {
if (nextUniqueIndex != uniqueTypes.end() &&		if (nextUniqueIndex != uniqueTypes.end() &&
*nextUniqueIndex == ghashIndex) {		*nextUniqueIndex == ghashIndex) {
mergeTypeRecord(beginIndex + ghashIndex, ty);		mergeTypeRecord(beginIndex + ghashIndex, ty);
++nextUniqueIndex;		++nextUniqueIndex;
}		}
++ghashIndex;		++ghashIndex;
});		});
▲ Show 20 Lines • Show All 482 Lines • Show Last 20 Lines