Page MenuHomePhabricator

[LLD][COFF] Avoid std::vector resizes during type merging
ClosedPublic

Authored by aganea on Jan 12 2021, 2:28 PM.

Details

Summary

As discussed in https://reviews.llvm.org/D87805#2491679 the MergedInfo::recs vector can get quite large and resizing it while remapping types creates unnecessary copies and virtual memory page faults.

This patch saves consistently approx. 0.6 sec (out of 18 sec) on a large dataset (400 MB EXE, 2 GB PDB).

With 12 hyper-threads:

Benchmark #1: before\lld-link.exe @link.rsp /threads:12
  Time (mean ± σ):     17.939 s ±  1.215 s    [User: 2.7 ms, System: 3.5 ms]
  Range (min … max):   15.537 s … 18.597 s    10 runs

Benchmark #2: after\lld-link.exe @link.rsp /threads:12
  Time (mean ± σ):     17.298 s ±  1.511 s    [User: 1.4 ms, System: 8.9 ms]
  Range (min … max):   15.512 s … 18.513 s    10 runs

With 36 hyper-threads (thus using only one CPU socket):

Benchmark #1: before\lld-link.exe @link.rsp /threads:36
  Time (mean ± σ):     17.787 s ±  0.747 s    [User: 4.2 ms, System: 5.6 ms]
  Range (min … max):   15.666 s … 18.059 s    10 runs

Benchmark #2: after\lld-link.exe @link.rsp /threads:36
  Time (mean ± σ):     17.102 s ±  1.323 s    [User: 2.6 ms, System: 4.0 ms]
  Range (min … max):   15.175 s … 18.023 s    10 runs

With 72 hyper-threads (using two CPU sockets, slower because kernel locks now cross CPUs)

Benchmark #1: before\lld-link.exe @link.rsp
  Time (mean ± σ):     18.085 s ±  0.764 s    [User: 2.7 ms, System: 3.3 ms]
  Range (min … max):   15.918 s … 18.444 s    10 runs

Benchmark #2: after\lld-link.exe @link.rsp
  Time (mean ± σ):     17.453 s ±  1.147 s    [User: 2.7 ms, System: 8.7 ms]
  Range (min … max):   15.766 s … 18.246 s    10 runs

Diff Detail

Event Timeline

aganea requested review of this revision.Jan 12 2021, 2:28 PM
aganea created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2021, 2:28 PM
aganea edited the summary of this revision. (Show Details)Jan 12 2021, 2:31 PM
aganea added a reviewer: rnk.
aganea edited the summary of this revision. (Show Details)
aganea updated this revision to Diff 316291.Jan 12 2021, 6:48 PM

Simplify.

rnk accepted this revision.Jan 13 2021, 9:30 AM

lgtm

This revision is now accepted and ready to land.Jan 13 2021, 9:30 AM
This revision was automatically updated to reflect the committed changes.
dblaikie added inline comments.
lld/COFF/DebugTypes.cpp
632–634

I know this is a bit of a stretch - but given the somewhat non-trivial extra work required to determine the capacity, I'm wondering if this issue might be partly due to this incremental/repeated resizing.

I'm guessing that maybe the resize is thwarting the usual growth function of the vectors (eg: resize brings the size up to exactly the size specified, whereas repeated push_back hits the usual growth factor to offset the next push_backs, etc). The notes here: https://en.cppreference.com/w/cpp/container/vector/reserve discuss this sort of issue with reserve, though I'm pretty sure it applies to resize as well.

If it's easy to do, I'd be curious to know whether removing the fix in this patch, and instead changing this resize+memcpy to loop+push_back or... oh, reference invalidation concerns. What about recs.insert(recs.begin(), recs.end()) ? Hmm, yeah, seems the spec for std::vector doesn't allow that, so I guess the nearest code would be:

for (int i = 0; i != newSize; ++i)
  merged.recs.push_back(merged.recs[i]);
rnk added inline comments.Jan 19 2021, 10:53 AM
lld/COFF/DebugTypes.cpp
632–634

When I chose to use resize, I remember checking some reference material to see if it would be subject to the O(n^2) behavior of using reserve. I don't remember what I looked at exactly, probably libc++ source. Whatever I looked at, I assured myself that repeated resizing isn't O(n^2). Perhaps the behavior is not required by the standard, and it varies by implementation.

dblaikie added inline comments.Jan 19 2021, 11:09 AM
lld/COFF/DebugTypes.cpp
632–634

Seems libstdc++ and libc++ at some version does indeed do the right thing:

$ clang++-tot test.cpp && ./a.out
initial capacity: 0
capacity after resize(500): 500
capacity after resize(501): 1000
capacity after resize(500) + 1 * push_back: 1000
$ clang++-tot test.cpp -stdlib=libc++ && ./a.out
initial capacity: 0
capacity after resize(500): 500
capacity after resize(501): 1000
capacity after resize(500) + 1 * push_back: 1000

But thinking about the code, it's always doubling anyway - so it sort of implements its own growth factor (ie: that first resize(500) in my example doesn't get any extra capacity because it's much bigger than the growth factor from the baseline size of 0, but the resize(501) does because it's less than the growth factor from the current capacity). So this code probably always gets only the size requested, because that size is always about as much as the growth factor would dictate from the previous size (ie: the resize probably produces the same number of growths as the push_back).

So maybe no benefit but no harm/no wins to be had by switching to push_back either, I guess.

Wouldn't mind the data if it's easy to gather, but yeah, I'm less confident it'll show any benefit.

aganea added inline comments.Jan 19 2021, 12:41 PM
lld/COFF/DebugTypes.cpp
632–634

One other way to fix this would be a growing allocator backed up by reserved, but uncommited memory pages. That would ideal in this situation I think.

We could reserve say 1 GB of virtual memory upfront for each container, and commit memory pages on-the-go, as the container grows. This prevents memory moves and keeps iterators stable. I'm not too sure if the std::allocator allows this kind of trick though, so it would have to be a custom container (or perhaps a variant of BumpPtrAllocator).

However that comes with another downside on Windows, which is that the VAD tree is deferred to a background System then when it's too large, I've mentionned that here: https://reviews.llvm.org/D86694#2277371 but in this case it's maybe less of an issue.

@dblaikie What is clang++-tot? What does ToT mean, I've seen this mentionned in GCC as well?

mstorsjo added inline comments.
lld/COFF/DebugTypes.cpp
632–634

Keep in mind that LLD still is usable in 32 bit builds, where one can't reserve 1 GB of memory like that. (It's a bit limited in such builds, it can't successfully link large images, but for common usage it's still quite ok.)

ToT is Top of Tree, i.e. latest git/svn version.

aganea added inline comments.Jan 19 2021, 12:53 PM
lld/COFF/DebugTypes.cpp
632–634

@mstorsjo That's a good point, thanks for raising that.

dblaikie added inline comments.Jan 19 2021, 1:49 PM
lld/COFF/DebugTypes.cpp
632–634

Oh, I wouldn't worry/don't mean to suggest going all that far - only reason I brought it up was in case the code could get the desired performance characteristics with less complexity. Sounds like probably not.