Page MenuHomePhabricator

[PDB] Merge types in parallel when using ghashing

Authored by rnk on Sep 16 2020, 4:30 PM.



This makes type merging much faster (-24% on chrome.dll) when multiple
threads are available, but it slightly increases the time to link (+10%)
when /threads:1 is passed. With only one more thread, the new type
merging is faster (-11%).

To give an idea, here is the /time output placed side by side:

                            BEFORE    | AFTER
Input File Reading:           956 ms  |  968 ms 
Code Layout:                  258 ms  |  190 ms 
Commit Output File:             6 ms  |    7 ms 
PDB Emission (Cumulative):   6691 ms  | 4253 ms 
  Add Objects:               4341 ms  | 2927 ms 
    Type Merging:            2814 ms  | 1269 ms  -55%!
    Symbol Merging:          1509 ms  | 1645 ms 
  Publics Stream Layout:      111 ms  |  112 ms 
  TPI Stream Layout:          764 ms  |   26 ms  trivial
  Commit to Disk:            1322 ms  | 1036 ms  -300ms

Total Link Time: 8416 ms 5882 ms -30% overall

The main source of the additional overhead in the single-threaded case
is the need to iterate all .debug$T sections up front to check which
type records should go in the IPI stream. See fillIsItemIndexFromDebugT.
With changes to the .debug$H section, we could pre-calculate this info
and eliminate the need to do this walk up front. That should restore
single-threaded performance back to what it was before this change.

This change will cause LLD to be much more parallel than it used to, and
for users who do multiple links in parallel, it could regress
performance. However, when the user is only doing one link, it's a huge
improvement. In the future, we can use NT worker threads to avoid
oversaturating the machine with work, but for now, this is such an
improvement for the single-link use case that I think we should land
this as is.


Before this change, we essentially used a
DenseMap<GloballyHashedType, TypeIndex> to check if a type has already
been seen, and if it hasn't been seen, insert it now and use the next
available type index for it in the destination type stream. DenseMap
does not support concurrent insertion, and even if it did, the linker
must be deterministic: it cannot produce different PDBs by using
different numbers of threads. The output type stream must be in the same
order regardless of the order of hash table insertions.

In order to create a hash table that supports concurrent insertion, the
table cells must be small enough that they can be updated atomically.
The algorithm I used for updating the table using linear probing is
described in this paper, "Concurrent Hash Tables: Fast and General(?)!":

The GHashCell in this change is essentially a pair of 32-bit integer
indices: <sourceIndex, typeIndex>. The sourceIndex is the index of the
TpiSource object, and it represents an input type stream. The typeIndex
is the index of the type in the stream. Together, we have something like
a ragged 2D array of ghashes, which can be looked up as:


By using these side tables, we can omit the key data from the hash
table, and keep the table cell small. There is a cost to this: resolving
hash table collisions requires many more loads than simply looking at
the key in the same cache line as the insertion position. However, most
supported platforms should have a 64-bit CAS operation to update the
cell atomically.

To make the result of concurrent insertion deterministic, the cell
payloads must have a priority function. Defining one is pretty
straightforward: compare the two 32-bit numbers as a combined 64-bit
number. This means that types coming from inputs earlier on the command
line have a higher priority and are more likely to appear earlier in the
final PDB type stream than types from an input appearing later on the
link line.

After table insertion, the non-empty cells in the table can be copied
out of the main table and sorted by priority to determine the ordering
of the final type index stream. At this point, item and type records
must be separated, either by sorting or by splitting into two arrays,
and I chose sorting. This is why the GHashCell must contain the isItem

Once the final PDB TPI stream ordering is known, we need to compute a
mapping from source type index to PDB type index. To avoid starting over
from scratch and looking up every type again by its ghash, we save the
insertion position of every hash table insertion during the first
insertion phase. Because the table does not support rehashing, the
insertion position is stable. Using the array of insertion positions
indexed by source type index, we can replace the source type indices in
the ghash table cells with the PDB type indices.

Once the table cells have been updated to contain PDB type indices, the
mapping for each type source can be computed in parallel. Simply iterate
the list of cell positions and replace them with the PDB type index,
since the insertion positions are no longer needed.

Once we have a source to destination type index mapping for every type
source, there are no more data dependencies. We know which type records
are "unique" (not duplicates), and what their final type indices will
be. We can do the remapping in parallel, and accumulate type sizes and
type hashes in parallel by type source.

Lastly, TPI stream layout must be done serially. Accumulate all the type
records, sizes, and hashes, and add them to the PDB.

Depends On D87736

Diff Detail

Event Timeline

rnk created this revision.Sep 16 2020, 4:30 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 16 2020, 4:30 PM
rnk requested review of this revision.Sep 16 2020, 4:30 PM
rnk updated this revision to Diff 292629.Sep 17 2020, 2:50 PM
  • rebase over previous changes
  • fix funcIdToType map for /Yu files

Thanks a lot for working on this Reid, really neat!

This is a huge change, how do you validate that the resulting PDB is equivalent to that of the previous version (before patch) or non-ghash merging?


Fun fact, microsoft-pdb does (mistakenly?) < not <=:
However it does reserve cbRecMax bytes (L942).


We're doing this all over the place, it'd be nice to eventually converge all variants (later).


I am wondering if this doesn't belong in a new file? Since the code is quite small, we could possibly have different implementations (in the future), depending on the dataset. 32-bit, 64-bit or 128-bit with no ghash indirection (if the CPU supports 128-bit CAS).


I can't help thinking that this smells like Clang's SourceManager index, where all sources all collated into a single index (not a 2D array).
If you did that, it would reduce the size of the cell data to 32-bit, iff we limit ourselves to 2^32 input records.
Am I being too optimistic? ;-)


It'd be interesting to collect statistics on how many collisions you get. And also compare linear (current behavior) vs. quadratic probing.

One issue I can see is that since the table will be 99.9% full at the end of the insertion pass, there will lots of collisions toward the end. What about making the table 25% bigger, like DenseHash does?


Looks like the limit in the PDB is 28-bit wide indices, probably because the PDB limit is 4 GB and because the smallest type record cannot be less that 8 bytes (4-byte header + 1 byte payload + padding).

In practice, I never saw more that a few tens of millions of type records in a 2-GB PDB. It is very unlikely that we'll ever reach this 28-bit limit.

However in this case you're talking about the cumulative (input) records count, right? That can be pretty large, I've seen 1 billion input type records (when we link our games without Unity/Jumbo files).

How many input type records do you see on your largest EXE/DLL? (we could add the total input type records count to /summary)




Remove ;


After this patch, /DEBUG:GHASH could become the new default?


I know @MaskRay recently changed all < to cmd-line input and > to -o. Do you need < > here?


Since you probably already know how many records/hashes you're inserting, can you .reserve() TypeRecBuffers and TypeHashes in advance?


Same here (.reserve).

rnk marked an inline comment as done.Sep 18 2020, 3:20 PM

Thanks a lot for working on this Reid, really neat!

Thanks. :)

This is a huge change, how do you validate that the resulting PDB is equivalent to that of the previous version (before patch) or non-ghash merging?

Truly, I think I haven't done enough validation. But, the resulting PDB should actually be identical: it should contain the same type records, in the same order as before.


  • More validation
  • Look at that stale comment
  • Stale ; } thing
  • Maybe gather more numbers in other scenarios (concurrent links)

Right, I remember this was a source of difference between clang type records and MSVC type records. This comes up pretty regularly in the LF_FIELDLIST record of a long enum (LLVM intrinsics) for example. With an off-by-one error, you get cascading differences. It's not really a goal for the compiler to emit byte-identical types with MSVC, though, it just results in extra type info.


It's true. This patch does reimplement a lot of library code, rather than improving the library, which is unfortunate. I just found it really difficult to restructure the library in a way that would still be high performance.


Maybe it does, but I really wanted GHashTable::insert to get internal linkage from the anonymous namespace. If this becomes a template, then it matters less.


Hah, this comment is stale. The table actually doesn't support lookup at all anymore. It used to, before I figured out the trick of saving the insert position from the parallel insert step.


It's an idea, but it's expensive to decompose a SourceLocation into a file id and file offset. However... we could build one giant array of ghashes indexed by this new combined index. This would copy all .debug$H sections, but it could be worth it. This would save a level of indirection during ghash insertion collision resolution, which might be worth a lot. Hm.

Another thing to consider is that MSVC emits many more types than clang. Users mostly use /Zi, which pre-duplicates them, but if they use /Z7, it would probably break this 32-bit limit on the number of input type records. There are already perf issues with lld /debug:ghash + cl /Z7 (extra .debug$T pass), so maybe it's not worth worrying about.


I don't have collision stats, but I can say that the load factor in the tests I was using goes from 70% (small PDBs) to 14% (big programs, lots of duplicate types to eliminate). So, the more inputs you run it on, the more memory gets allocated, the fewer collisions their are, and the shorter the chains are.


Yeah, this is input records. This table size ends up being really large and this allocates a lot of memory, but remember, the .debug$T was in theory already memory mapped anyway, and this hash table is smaller than that at 8 bytes of cell vs minimum 8 bytes per record.

I logged the load factor and capacity of the table later, and this is what I got for chrome.dll:

lld-link: ghash table load factor: 26.25% (size 17307224 / capacity 65942084)

That is 65,942,084 input type records, and essentially 73.75% of them ended up being duplicates.


We don't know how many cells are empty until we iterate over the table. The load factor varies widely depending on the input. I think it's better in this case to dynamically resize.


I hope so, but let's do it separately.


Oh, I probably flubbed the conflict resolution. No reason.


I can't, this is the old entry point API, which takes one type record from the caller at a time.


I could reserve here, but that might actually defeat the dynamic resizing. Consider that this loop is N^2:

std::vector<int> vec;
for (int i =0 ; i < n; ++i) {

addTypeRecords gets called for each TpiSource, so we would end up reallocating the vector for every type source that contributes types, and maybe not increasing the size enough to remain O(n). IMO it's better to let resizing do its thing here.

In D87805#2283132, @rnk wrote:

Truly, I think I haven't done enough validation. But, the resulting PDB should actually be identical: it should contain the same type records, in the same order as before.

I suppose a simple way could be to temporarily retain the old algorithm and have them both run side-by-side and assert if things are different.


This would copy all .debug$H sections

I am wondering if we couldn't combine all ghashes into a contiguous virtual range of memory (both the pre-existing .debug$H and the locally computed, "owned" ones). The MapViewOfFile2/3 APIs allow changing the destination BaseAddress. There will be some dangling data around .debug$H mappings because the mapping only works on 64K-ranges, but it's maybe better than copying around a few GB worth of .debug$H sections (which also implies duplicating the memory storage for ghashes, because of the file mapping, unless we munmap after each copy).

There are already perf issues with lld /debug:ghash + cl /Z7 (extra .debug$T pass)

Like I mentionned in D55585, once ghashes computation is parallelized, it is faster on a 6-core to use /DEBUG:GHASH rather than the default /DEBUG. Were you thinking of anything else, when you say "there are already perf issues"? We've been using MSVC cl+LLD+D55585 for a long time and the timings of LLD are close to that of Clang+LLD.


It is clearer now, thanks. I am wondering if LLD could let the user know of an optimal table size, and let them provide that value on the cmd-line. But then it is a trade-off between the increased number of collisions (which imply an extra ghash indirection) and the smaller table size which would reduce cache misses. Just thinking out loud.


Never mind, I misunderstood how the algorithm worked. It is clear now. For some reason I thought you were constructing equivalence classes.




Yes, you're right.

aganea added a comment.EditedSep 28 2020, 12:24 PM

Some figures with this patch (6-core Xeon) -- link times only:

VS2019 16.6.3 link.exeLLD 10 (/DEBUG)LLD 12 + this patch (/DEBUG)LLD 12 + this patch (/DEBUG:GHASH)
Game - Editor MSVC1 min 2 sec51 sec27 sec
Game - Editor Clang28 sec23 sec16 sec
Game - Engine Release Clang20 sec17 sec12 sec
Game - Engine Retail Clang17 sec15 sec11 sec

This needs to be:

Expected<TypeServerSource *> tsSrc = getTypeServerSource();
if (!tsSrc)
  return; // ignore errors at this point.

Since a missing PDB is not en error, we just won't have types & symbols for that .OBJ - and we're already handling that later in mergeDebugT. Could you also please modify pdb-type-server-missing.yaml as it lacks /DEBUG:GHASH coverage, which should catch this case?


Note for future developement: It would be nice to support other kinds of hashers in .debug$H. SHA1 is not the best choice, see - xxHash64 seems like a better solution. I've also tried MeowHash, and since it uses AES instructions it run pretty much at memory bandwidth speed:


Can you please add a timer for this part? (just the ghash generation for all files)

rnk added a comment.Sep 28 2020, 1:18 PM

Some figures with this patch (6-core Xeon) -- link times only:

Thanks! For the MSVC configuration, what debug info flags are used for the compile? /Zi or /Z7, and is /Yu used?

I'm a bit distracted by other things, so I haven't done more validation on this yet.


Regarding the MapViewOfFile APIs, maybe. But it might be cheaper to copy the memory into huge pages anyway.

Regarding the perf issues with MSVC /Z7, I mean that MSVC /Z7 objects tend to be truly massive, containing many duplicate types. Those massive objects are usually slow to link. MSVC users typically use /Zi or /Yu to pre-deduplicate some of those types. If they were to use /Z7 instead, there might be more than 4 billion input type records, meaning we can't create a single 32-bit input type index space. But, if you have 4 billion input type records, you already have a size problem, and you can fix it by using either /Zi or clang-cl, which emits less type info.


I suppose the way to do this would be to receive GloballyHashedType as a template parameter. Probably necessary, but I worked so hard to make this code untemplated. :)

rnk marked 2 inline comments as done.Sep 29 2020, 5:23 PM

The next patch has some changes to make the output PDB identical to the old ghash PDB. I had to sort the dependency type sources to the front so that dependency types appear earlier in the final type stream. I did a few links and MD5'd the before and after PDBs with /Brepro, and they are the same. I'm reasonably confident in this at this point.

Hopefully I addressed all the outstanding comments. I think this is probably ready to land, and then we should probably make ghashing the default, since it's basically faster than the standard type merging implementation at this point. I don't have exhaustive testing for /Zi objects, though, so you might want to do some validation there.


Done, and added the coverage.


I updated the comments here.


I didn't end up collecting more stats, but the load factor is in the /verbose output if you want to check.


Sure, the new output looks like:

  Input File Reading:          7367 ms ( 25.9%)
  Code Layout:                 1434 ms (  5.0%)
  Commit Output File:            44 ms (  0.2%)
  PDB Emission (Cumulative):  17956 ms ( 63.2%)
    Global Type Hashing:        651 ms (  2.3%)
    GHash Type Merging:        2533 ms (  8.9%)
    Add Objects:              10098 ms ( 35.5%)
      Symbol Merging:          6882 ms ( 24.2%)
    Publics Stream Layout:     1027 ms (  3.6%)
    TPI Stream Layout:          111 ms (  0.4%)
    Commit to Disk:            5410 ms ( 19.0%)
Total Link Time:              28427 ms (100.0%)
rnk updated this revision to Diff 295154.Sep 29 2020, 5:28 PM
rnk marked 2 inline comments as done.
  • respond to comments
  • sort dependency sources first
aganea accepted this revision.Sep 30 2020, 1:59 PM

Thanks again!

LGTM with a few minor things (please see inlines below).

In D87805#2299073, @rnk wrote:

For the MSVC configuration, what debug info flags are used for the compile? /Zi or /Z7, and is /Yu used?

/Z7 because of the distribution and caching, and /Yc /Yu for locally-compiled files. We also link with a bunch of third-party libs (not ours) that are built with /Zi.


Isn't all this equivalent to llvm::stable_sort(instances, [](auto *a, auto *b) { return a->isDependency() < b->isDependency(); });?


Just fyi, there's a bug here when dealing with PCH (it was there since rG54a335a2f60b0f7bb85d01780bb6bbf653b1f399).
L318 should be added with: unsigned nbHeadRecords = indexMapStorage.size();
L335 should be: uint32_t srcIdx = nbHeadRecords;


Remove this line and the one after.

This revision is now accepted and ready to land.Sep 30 2020, 1:59 PM
rnk added a comment.Sep 30 2020, 2:10 PM

Thanks again!

LGTM with a few minor things (please see inlines below).


In D87805#2299073, @rnk wrote:

For the MSVC configuration, what debug info flags are used for the compile? /Zi or /Z7, and is /Yu used?

/Z7 because of the distribution and caching, and /Yc /Yu for locally-compiled files. We also link with a bunch of third-party libs (not ours) that are built with /Zi.

I see. But, you are using unity builds, right? Each /Z7 object will contain the world of STL types (std::string), but because of the unity build, the multiplier (number of objects) isn't as large.


Almost (reverse order), but I need to loop over the sources anyway to count how many dependencies there are to build the ArrayRefs. I figured it was better to write one loop to do both things. Maybe that's too much micro-optimization.


I wonder what is causing this. It seems related to the arcanist + clang-format presubmit, so it gets added when I upload a diff.

aganea added a comment.EditedSep 30 2020, 2:12 PM

I've tried building one of our worst targets with MSVC. That is with /Z7, Unity/Jumbo files, static executable:

           4862 Input OBJ files (expanded from all cmd-line inputs)
             61 PDB type server dependencies
             38 Precomp OBJ dependencies
   >> 131346190 Input type records <<
        7993312 Merged TPI records
        2810451 Merged IPI records
          58136 Output PDB strings
        8313160 Global symbol records
       23959902 Module symbol records
        2098602 Public symbol records

That's 131 million type records, out of about 26 GB of .OBJ files.


VS2019 16.6.3 link.exe1 min 2 sec
LLD 12 + this patch51 sec/DEBUG
LLD 12 + this patch21 sec/DEBUG:GHASH

Now the same target NoUnity/NoJumbo:

          21351 Input OBJ files (expanded from all cmd-line inputs)
             61 PDB type server dependencies
             38 Precomp OBJ dependencies
  >> 1420669231 Input type records <<
    78665073382 Input type records bytes
        8801393 Merged TPI records
        3177158 Merged IPI records
          59194 Output PDB strings
       71576766 Global symbol records
       25416935 Module symbol records
        2103431 Public symbol records

That's 1.4 billion type records, spanning over more than 100 GB of .OBJ files. We never compile this target with debug info, because historically it wasn't even linking (link.exe was crashing, since VS2019 it seems to be fine). This is only a validation target for the build system. The resulting PDB is 2.6 GB.


VS2019 16.6.3 link.exe51 min 4 sec
LLD 12 + this patch11 min 21 sec/DEBUG
LLD 12 + this patch6 min 38 sec/DEBUG:GHASH

With NoUnity/NoJumbo, memory commit peaks with link.exe at over 125 GB and LLD at about 115 GB.

Given these figures, I'm not sure how we can ever reach 2^32 input records. That would be several hundreds of GB of OBJ files, and would need 256 or 512 GB of RAM to link. That could happen if the PCH files were disabled maybe. But then compilation times would skyrocket, and so would the link times.

(all timings on a 6-core Xeon)

This revision was landed with ongoing or failed builds.Sep 30 2020, 2:22 PM
This revision was automatically updated to reflect the committed changes.
aganea added inline comments.Jan 11 2021, 4:03 PM

@rnk: Just to follow up on, the .resize() here takes 3.5 sec out of 74 sec (cumulated thread time on 72 hyper-threads).

I've modified the code to do instead two passes, then .reserve(), and that saves about 0.6 sec median walltime. Although I think it is better to wait on prefetching mmap'ed memory pages first.

Benchmark #1: before\lld-link.exe @link.rsp /threads:12
  Time (mean ± σ):     17.939 s ±  1.215 s    [User: 2.7 ms, System: 3.5 ms]
  Range (min … max):   15.537 s … 18.597 s    10 runs

Benchmark #2: after\lld-link.exe @link.rsp /threads:12
  Time (mean ± σ):     17.298 s ±  1.511 s    [User: 1.4 ms, System: 8.9 ms]
  Range (min … max):   15.512 s … 18.513 s    10 runs

As you see, there's also quite some variability in execution time, mostly because of the contention issues that I've mentionned in D94267.

rnk added inline comments.Jan 12 2021, 9:31 AM

I see, makes sense. I think it's worth doing something about this to avoid the memcpy resize overhead. We don't really need a flat buffer of type records here, I just thought it would be more efficient to flatten them than to have separate allocations for each type when most of them are small.


Doing an up front size calculation requires iterating all types in the TpiSource twice. The second pass will probably be hot in cache, so it's probably faster to precalculate the size with two iterations than it is to use a different, dynamic buffering strategy.