This is an archive of the discontinued LLVM Phabricator instance.

RFC: [LLD][COFF] Parallel GHASH generation at link-time -- NOT FOR SUBMIT
AbandonedPublic

Authored by aganea on Dec 11 2018, 2:46 PM.

Details

Summary

This is just a proof-of-concept, just to demonstrate generating GHASHes at link-time (in parallel).
In my test, all the source OBJs are from MSVC, so there's no prior GHASH stream.

All in all, things are about 5sec faster (in my large DLL test), even if we're generating GHASHes.
The end-gain comes also from the Types/IDs hash tables, which are much faster with GHASHes

I've also optimized the Type hash table (through the new GlobalTypeDenseMap class) by making the buckets smaller (8 bytes vs. 12 bytes for regular GHASH). This makes merging about 35% faster.

I've also thown in parallel sorting the globals stream which makes this pass 2x faster.

Before patch, with regular Type merging:

-------------------------------------------------
  Input File Reading:          1658 ms (  4.7%)  
  Code Layout:                  621 ms (  1.8%)  
  PDB Emission (Cumulative):  30380 ms ( 86.7%)  
    Add Objects:              22615 ms ( 64.6%)  
                                                 
      Type Merging:           19205 ms ( 54.8%)  
      Symbol Merging:          3385 ms (  9.7%)  
    TPI Stream Layout:          897 ms (  2.6%)  
    Globals Stream Layout:     1418 ms (  4.1%)  
    Commit to Disk:            4559 ms ( 13.0%)  
  Commit Output File:          1717 ms (  4.9%)  
-------------------------------------------------
Total Link Time:              35021 ms (100.0%)

With this patch, GHASH-only merging:

------------------------------------------------
  Input File Reading:          1647 ms (  5.4%)
  Code Layout:                  576 ms (  1.9%)
  PDB Emission (Cumulative):  27537 ms ( 89.6%)
    Add Objects:              21088 ms ( 68.6%)
      Global hashing:         10723 ms ( 34.9%)  <<<< parallel
      Type Merging:            7419 ms ( 24.1%)  <<<< 12-byte buckets
      Symbol Merging:          2861 ms (  9.3%)
    TPI Stream Layout:          941 ms (  3.1%)
    Globals Stream Layout:     1545 ms (  5.0%)  <<<< no parallel
    Commit to Disk:            3184 ms ( 10.4%)
  Commit Output File:           353 ms (  1.1%)
------------------------------------------------
Total Link Time:              30728 ms (100.0%)

With this patch, GHASH-only merging:

  Input File Reading:          1620 ms (  5.6%)
  Code Layout:                  598 ms (  2.1%)
  PDB Emission (Cumulative):  23715 ms ( 81.6%)
    Add Objects:              17933 ms ( 61.7%)
      Global hashing:          9734 ms ( 33.5%)  <<<< parallel
      Type Merging:            5293 ms ( 18.2%)  <<<< 8-byte buckets
      Symbol Merging:          2823 ms (  9.7%)
    TPI Stream Layout:          900 ms (  3.1%)
    Globals Stream Layout:      953 ms (  3.3%)  <<<< parallel
    Commit to Disk:            3161 ms ( 10.9%)
  Commit Output File:          2512 ms (  8.6%)
-------------------------------------------------
Total Link Time:              29067 ms (100.0%)

Sorry for the messy patch, I am just looking for overall advice. If this is the right direction, I'll split down the patch in smaller pieces.

Diff Detail

Event Timeline

aganea created this revision.Dec 11 2018, 2:46 PM

The source OBJ/LIBs are ~10GB.
Total records across all OBJs: 88,932,692
Merged types: 3,539,517 - indices: 1,194,366
The final DLL is 141MB, the PDB is ~1GB.

Timings are on a 6-core Intel Xeon Haswell 3.5Ghz.

rnk added a comment.Dec 21 2018, 3:12 PM

Neat! I didn't have a chance to review with a fine toothed comb, but I like the idea, and here's how I analyze it. We've already talked about how merging a type record without ghash is O(size of type), and with a content hash, you can make that O(size of hash), i.e. look at 8 or 20 bytes instead of up to 64K. But, computing the hash requires reading the type record data, so we made the compiler do it and put it in .debug$H.
If one can't do that because MSVC is being used, parallelizing the hash lookup makes it so that the linker only takes O(size of all types / n cores) wall time to do it, and then it does O(# types) (not size of types) work to deduplicate them.

So it looks like this is the list of things to do:

  1. custom ghash densemap: this is probably the most valuable, but probably the most complex and needs the most work.
  2. parallelize publics sorting: small enough to send and commit separately ASAP
  3. parallelize ghash fetch or computation: actually small in isolation as well, let's do it, I like the code for it

Another thing about the custom hash table is, we might end up ripping it out if we try to parallelize type merging.

lld/trunk/COFF/PDB.cpp
345–1

Wow, I'm surprised that matters. Feel free to send it separately, it seems like a small win.

382

All work up to here (except for dependency loading) can be kicked off from SymbolTable::addFile, which is the earliest point that we know we are including an object file in the link. The way it is written here, it's easy to understand that we are computing or fetching hashes in parallel. That clarify is worth something. Do you think it's worth restructuring things so that addFile is responsible for starting ghash computation and loading PDB and PCH dependencies? I think it might help reduce link time further by overlapping independent work, but it might make the unnecessarily complicated. Maybe it's best to keep things this way for now.

387–401

We need to find a way to parallelize type merging... but that is future work.

llvm/trunk/include/llvm/DebugInfo/CodeView/GlobalTypeDenseMap.h
78

I haven't read this implementation yet, it's quite long, but broadly I'm in favor of having a custom hash table here. This is the most performance critical thing LLD does.

This seems like a good separable change, since this map isn't actually used in parallel.

343

How much do huge pages matter relative to the custom hash table?

rnk added a comment.Dec 21 2018, 3:35 PM

One more thing I just realized that was probably obvious to you: Computing ghash in parallel completely eliminates the need for the /debug:ghash flag, and we can eliminate the non-ghash codepath. That's great.

aganea marked 4 inline comments as done.Dec 29 2018, 5:56 PM
In D55585#1339859, @rnk wrote:

One more thing I just realized that was probably obvious to you: Computing ghash in parallel completely eliminates the need for the /debug:ghash flag, and we can eliminate the non-ghash codepath. That's great.

Yes, and even further it might eliminate the need for GHASHes streams at all, if porting this step works out well on GPU (currently still WIP).

lld/trunk/COFF/PDB.cpp
345–1

Ok - I'll send another patch. Some of our large EXEs can take up to 3.5sec in this "Globals" step.

382

I was thinking about that. Maybe as a subsequent change, once all this lands?

387–401

I already have a good idea of what to do to make GlobalTypeDenseMap thread-safe and lock-free. It's more a matter of finding the time to do it :-)

llvm/trunk/include/llvm/DebugInfo/CodeView/GlobalTypeDenseMap.h
78

It's mostly a copy of DenseMap, but without tombstones, and with a different DenseHashInfo API.
I was wondering if a insert-only DenseMap would be useful in other parts of LLVM/Clang?

aganea marked an inline comment as done.Jan 11 2019, 9:33 AM
aganea added inline comments.
llvm/trunk/include/llvm/DebugInfo/CodeView/GlobalTypeDenseMap.h
343

It's quite significant:

without 2MB pagesType Merging: 6588 ms ( 23.0%)
with 2MB pagesType Merging: 4856 ms ( 19.3%)

I only removed the flag sys::Memory::MF_HUGE for this test.

Here are some stats for the data used:

                                    Summary
--------------------------------------------------------------------------------
            156 Input OBJ files (expanded from all cmd-line inputs)
              0 Dependent PDB files
              1 Dependent PCH OBJ files
       81556098 Input type records (across all OBJ and dependencies)
     5108516032 Input type records bytes (across all OBJ and dependencies)
        4588516 Output merged type records
       10067321 Output merged symbol records
          23157 Output PDB strings

This is the perfect use-case for large pages: a large contiguous structure, used with random accesses, in a tight loop. The GlobalTypeDenseMap fits just perfectly with 2MB pages. In this precise testcase, the hashtable is 64MB (32x 2MB pages), which also happen to fit perfectly the max DTLB slots on modern Intel CPUs. If my understanding is correct, TLB slots for large pages come in addition to 4KB pages (at least for L1 DTLB).

I think I'll make this sys::Memory::MF_HUGE flag indicate a hint . On many OSes, you need to manually enable large pages (at least on W10 and Linux), so this might not be available by default. And even at that, on Windows at least, large pages are physical-only (not swappable). When specifying this flag, Memory::allocateMappedMemory should only "try" to use large pages, and fallback to regular (4KB) pages instead.

aganea updated this revision to Diff 181365.EditedJan 11 2019, 2:45 PM
aganea added subscribers: mstorsjo, thakis.

Rebased on r350764.

I've played around a bit with different hashing algorithms. I've added a flag /hasher:(sha1|md5|cityhash|xxhash|crc) and /summary along the way.
Figures are on a 3.5 GHz, 6-core Intel Xeon Haswell:

Dataset 1, large DLL, the resulting PDB is 950 MB:

(at r350764) lld-linkType Merging: 20816 ms ( 59.4%)Total Link Time: 35065 ms
(this patch) lld-link ... /hasher:sha1 (default)Global hashing: 9635 ms ( 34.3%)Type Merging: 4814 ms ( 17.1%)Total Link Time: 28128 ms
(this patch) lld-link ... /hasher:md5Global hashing: 5658 ms ( 23.4%)Type Merging: 4813 ms ( 19.9%)Total Link Time: 24137 ms
(this patch) lld-link ... /hasher:cityhashGlobal hashing: 3640 ms ( 16.5%)Type Merging: 4822 ms ( 21.8%)Total Link Time: 22120 ms
(this patch) lld-link ... /hasher:xxhashGlobal hashing: 3594 ms ( 16.4%)Type Merging: 4730 ms ( 21.6%)Total Link Time: 22896 ms
(this patch) lld-link ... /hasher:crcGlobal hashing: 3276 ms ( 14.5%)Type Merging: 4886 ms ( 21.6%)Total Link Time: 22626 ms
                                    Summary
--------------------------------------------------------------------------------
            156 Input OBJ files (expanded from all cmd-line inputs)
              0 Dependent PDB files
              1 Dependent PCH OBJ files
       81556098 Input type records (across all OBJ and dependencies)
     5108516032 Input type records bytes (across all OBJ and dependencies)
        4588516 Output merged type records
       10067321 Output merged symbol records
          23157 Output PDB strings

And with a slightly larger input the difference becomes more apparent.
Dataset 2, large EXE, the resulting PDB is 2 GB:

(r350764) lld-linkType Merging: 38413 ms ( 56.8%)Total Link Time: 67609 ms
(this patch) lld-link ... /hasher:sha1 (default)Global hashing: 15807 ms ( 29.7%)Type Merging: 9906 ms ( 18.6%)Total Link Time: 53135 ms
(this patch) lld-link ... /hasher:md5Global hashing: 8354 ms ( 17.9%)Type Merging: 9977 ms ( 21.3%)Total Link Time: 46745 ms
(this patch) lld-link ... /hasher:cityhashGlobal hashing: 6077 ms ( 13.7%)Type Merging: 9957 ms ( 22.5%)Total Link Time: 43291 ms
(this patch) lld-link ... /hasher:xxhashGlobal hashing: 5783 ms ( 13.2%)Type Merging: 9912 ms ( 22.7%)Total Link Time: 43695 ms
(this patch) lld-link ... /hasher:crcGlobal hashing: 5587 ms ( 12.9%)Type Merging: 9938 ms ( 23.0%)Total Link Time: 43206 ms
                                    Summary
--------------------------------------------------------------------------------
           4768 Input OBJ files (expanded from all cmd-line inputs)
             70 Dependent PDB files
             27 Dependent PCH OBJ files
      142150698 Input type records (across all OBJ and dependencies)
     8623310584 Input type records bytes (across all OBJ and dependencies)
        9699343 Output merged type records
       33727100 Output merged symbol records
          48382 Output PDB strings

To the light of all this, does it still makes sense to compute and emit GHASH streams in clang? I'm pretty sure that the cost for serialization and I/O for those streams would be much higher that just computing GHASHes on-the-fly in the LLD. The only marginal benefit would be for incrementally linking, however even that is debatable.

If you have no major concerns over all this, I'll start sending smaller patches.

aganea added a comment.EditedJan 14 2019, 2:02 PM

Just for reference, added timings for /hasher:(xxhash|crc) in the comment above.

xxHash is using llvm::xxHash().
crc is a naive implementation with two concatenated crc32 values, using the SSE4.2 intrinsic _mm_crc32_u32 - thus using only two instructions per 4 bytes. However this creates about 0.0001% hash collisions, which would not be suitable for this application.

Also providing timings for a 2.3 GHz, 36-cores Intel Xeon Skylake:

Dataset 2, large EXE, the resulting PDB is 2 GB:

(trunk) lld-link ...Type Merging: 34582 ms ( 52.7%)Total Link Time: 65572 ms (100.0%)
(this patch) lld-link ... /hasher:sha1 (default)Global hashing: 6910 ms ( 14.8%)Type Merging: 12573 ms ( 26.9%)Total Link Time: 46768 ms
(this patch) lld-link ... /hasher:md5Global hashing: 5785 ms ( 12.6%)Type Merging: 12479 ms ( 27.2%)Total Link Time: 45902 ms
(this patch) lld-link ... /hasher:cityhashGlobal hashing: 5153 ms ( 11.5%)Type Merging: 12008 ms ( 26.8%)Total Link Time: 44728 ms
(this patch) lld-link ... /hasher:xxhashGlobal hashing: 5041 ms ( 10.9%)Type Merging: 11917 ms ( 25.9%)Total Link Time: 46057 ms
(this patch) lld-link ... /hasher:crcGlobal hashing: 4994 ms ( 11.1%)Type Merging: 12089 ms ( 26.8%)Total Link Time: 45076 ms

Even though the global hashing is faster than the 6-cores, the single-threaded parts drag timings down because of the slower CPU frequency.

ruiu added a comment.Jan 14 2019, 4:02 PM

Overall this is a huge patch, and I think it needs much more comments or a large file comment that gives the big picture what you are doing. I'm worried that that would not be able to understood once people who wrote or reviewed left the project.

lld/trunk/COFF/Driver.cpp
608–613

What is the point of making it selectable to users? It feels to me that you should pick up the one that you think the best and just use it.

614

parse() functions should return a value instead of mutating a variable as a side effect.

lld/trunk/COFF/PDB.cpp
96–97

A dependency represented as an "input file" feels a odd concept. Only an object file and the like should inherit InputFile.

936

Formatting.

lld/trunk/Common/Summary.cpp
20–21

Please don't use llvm::Any. Pass StringRef. If you need, define your own lld::toString to stringize an object you need to apss.

26

Please don't use formatv

42

Please use clang-format to format the entire patch.

rnk added a comment.Jan 14 2019, 4:35 PM

To the light of all this, does it still makes sense to compute and emit GHASH streams in clang? I'm pretty sure that the cost for serialization and I/O for those streams would be much higher that just computing GHASHes on-the-fly in the LLD. The only marginal benefit would be for incrementally linking, however even that is debatable.

I'd definitely consider it. At the very least, with this data, I don't think it ever makes sense to enable -gcodeview-ghash by default in clang-cl. Looking at your data, it looks like pre-computing hashes into the obj would still improve link time by 12-20%, and our measurements show it costs 6% object file size. It seems like it could still be worth it.

If you have no major concerns over all this, I'll start sending smaller patches.

Looking forward to them. :)

lld/trunk/COFF/Driver.cpp
608–613

I think it's important to at least have the flexibility for developers to test different hashing schemes. We don't have to document all these flags. Users don't have to know about them, and we can remove them later.

lld/trunk/COFF/PDB.cpp
936

@ruiu This is still a proof of concept patch mostly to express ideas, not quite ready for review.
@aganea There is a bunch of unreachable, break-after-return in this switch that we won't want in the long run.

I thought this was demo code, but I was confused by rnk's comment about the hash flag below, so replying to that.

lld/trunk/COFF/Driver.cpp
608–613

Having this while working on this patch makes sense, but the experiment show that xhash is fast enough, so we wouldn't want to check this flag in, right? (And an implementation of e.g. cityhash just for this flag even though cityhash doesn't have any advantages over hashes using code we already have in tree.)

dmajor added a subscriber: dmajor.Jan 31 2019, 11:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 25 2019, 6:29 AM
Herald added a subscriber: jdoerfert. · View Herald Transcript
aganea planned changes to this revision.Mar 14 2019, 11:49 AM
aganea retitled this revision from RFC: [LLD][COFF] Parallel GHASH generation at link-time to RFC: [LLD][COFF] Parallel GHASH generation at link-time -- NOT FOR SUBMIT.
santagada added inline comments.Nov 9 2019, 3:52 PM
llvm/trunk/include/llvm/DebugInfo/CodeView/GlobalTypeDenseMap.h
78

I know this patch is old, but I was reading it again and I don't understand how you reconstruct the key if there were a collision on the bucket while inserting... as the bucketno will not be the same bits that were stripped when the key got packed.

In other words: LookupBucketFor starts looking for an empty bucket at the extracted bits from the key, but might have probed ahead of that number by the time we insert it in the bucket. At that time it seems that any other searches for the key on the map will fail as the key computed from that bucket doesn't have the correct value were the bucketmask is.

Also bucketmask needs to be all the lower bits of key, else there is information loss there.

Or maybe I'm just very confused on how this packing of having half the key as bucketno and half in the KV value is happening.

aganea marked an inline comment as done.May 9 2020, 8:10 AM
aganea added inline comments.
llvm/trunk/include/llvm/DebugInfo/CodeView/GlobalTypeDenseMap.h
78

You're right, collision is the biggest challange here. I ran with this for a while, with both the old map and this new map side-by-side, asserting if there was a divergence, and there was not. In practice, what I am proposing here increases the chances of a key collision, and thus of a hash collision in the table. However, even with very large inputs, in the range of 1 billions type records from .OBJs, I couldn't see a single collision. However, that doesn't mean it couldn't happen :-)

I wanted to do this to make reading & writing to a bucket an atomic operation. I'm not very comftable yet with this change, I think in the long run it'd be better to rely on 128-bit data per bucket (64-bit for the key and 64-bit for the index), and do two atomic operations. Which makes things a bit more challenging for writing a truly lock-free and wait-free hash table that can resize.