This is an archive of the discontinued LLVM Phabricator instance.

[PDB] Optimize public symbol processing
ClosedPublic

Authored by rnk on May 5 2020, 7:48 PM.

Details

Summary

Reduces time to link PGO instrumented net_unittets.exe by 11% (9.766s ->
8.672s, best of three). Reduces peak memory by 65.7MB (2142.71MB ->
2076.95MB).

Use a more compact struct, BulkPublic, for faster sorting. Sort in
parallel. Construct the hash buckets in parallel. Try to use one vector
to hold all the publics instead of copying them from one to another.
Allocate all the memory needed to serialize publics up front, and then
serialize them in place in parallel.

Diff Detail

Event Timeline

rnk created this revision.May 5 2020, 7:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2020, 7:48 PM
hans added a comment.May 6 2020, 7:10 AM

This looks good as far as I can tell. Is there good test coverage here, e.g. if there's an off-by-one somewhere, would it be caught?

lld/COFF/PDB.cpp
1715

Before it ran at the end of addObjectsToPDB(). Does the order matter?

rnk marked 2 inline comments as done.May 6 2020, 7:47 AM

This looks good as far as I can tell. Is there good test coverage here, e.g. if there's an off-by-one somewhere, would it be caught?

I will say that the existing tests caught many issues, and that gives me some confidence in this. All the hash bucket table stuff is pretty well tested.

lld/COFF/PDB.cpp
1715

My aim was to improve cache locality, but I'm not sure it matters. Previously we would add and sort the publics, then add the section contributions (slow, flushes cache), then commit the publics to disk. I figured this order would be better. In isolation, I don't think I was able to measure a difference above the noise though. In any case, the publics aren't really part of adding objects.

llvm/lib/DebugInfo/PDB/Native/GSIStreamBuilder.cpp
123

Speaking of undertested things, these might need to be LF_PAD bytes. I will double check.

aganea added a comment.May 6 2020, 8:39 AM

Can you diff the full output of microsoft-pdb\cvdump\cvdump.exe your_exe.pdb and llvm-pdbutil dump -all your_exe.pdb before and after this patch? (with any large exe). This can be a bit tedious because the text dump is very large, but you can at least validate that things are still right (you're probably already doing this).

llvm/lib/DebugInfo/PDB/Native/GSIStreamBuilder.cpp
213

Maybe provide a link to the reference implemention repo? (ie.https://github.com/microsoft/microsoft-pdb)

rnk marked an inline comment as done.May 6 2020, 5:35 PM

Can you diff the full output of microsoft-pdb\cvdump\cvdump.exe your_exe.pdb and llvm-pdbutil dump -all your_exe.pdb before and after this patch? (with any large exe). This can be a bit tedious because the text dump is very large, but you can at least validate that things are still right (you're probably already doing this).

There is a diff in the globals stream due to different global hash bucket sorting. An example hunk looks like this:

$ diff -u lld*-pubs.txt | head -40
--- lld1-pubs.txt       2020-05-06 17:25:41.132778100 -0700
+++ lld2-pubs.txt       2020-05-06 17:25:37.352925100 -0700
@@ -197,11 +197,11 @@
            original type = 0x111F9D
   22547392 | S_UDT [size = 12] `HDC`
            original type = 0xC5A17
-  23582124 | S_GDATA32 [size = 28] `__newclmap`
+  23513760 | S_GDATA32 [size = 28] `__newclmap`
            type = 0x11445B (), addr = 0002:3648064
   23517196 | S_GDATA32 [size = 28] `__newclmap`
            type = 0x114534 (), addr = 0002:3648064
-  23513760 | S_GDATA32 [size = 28] `__newclmap`
+  23582124 | S_GDATA32 [size = 28] `__newclmap`
            type = 0x11445B (), addr = 0002:3648064
   11925196 | S_CONSTANT [size = 24] `SHRD32rri8`
            type = 0x24E8D (llvm::X86::<unnamed-tag>), value = 2704
@@ -1595,10 +1595,10 @@

The original code uses llvm::sort with gsiRecordLess, which is unstable, although deterministic. I modified the code to sort by symoffset to stabilize the output, but the new stable order doesn't match the old unstable order.

Other than that, all the streams and PDBs have the same size.

I checked the padding, and I believe the old publics are padded with zero bytes:
https://github.com/llvm/llvm-project/blob/01fc85dc9618394868b795c5087d9da03df9c58b/llvm/lib/DebugInfo/CodeView/SymbolRecordMapping.cpp#L42

llvm/lib/DebugInfo/PDB/Native/GSIStreamBuilder.cpp
213

I put one in the file header comment like the other files in here.

rnk updated this revision to Diff 262510.May 6 2020, 5:35 PM
  • stabilize sort by symoffset, comments

Seems good to me, just a few things:

lld/COFF/PDB.cpp
1346

Maybe reserve in advance to avoid reallocations?

unsigned count{};
symTab->forEachSymbol([](Symbol *s) {
  auto *def = dyn_cast<Defined>(s);
  count += (def && def->isLive() && def->getChunk());
});
publics.reserve(count);

Do you think there would be a gain to do the creation in parallel? (and do .resize instead in that case)

llvm/include/llvm/DebugInfo/PDB/Native/GSIStreamBuilder.h
60

I know the call will be optimized anyway, but shouldn't we clarify the intention of transferring the ownership? std::vector<BulkPublic> &&Publics (just to prevent accidents in the future, if someone removes std::move at the caller site during a refactoring)

llvm/lib/DebugInfo/PDB/Native/GSIStreamBuilder.cpp
72

Same thing here: std::vector<BulkPublic> &&

196

finalizeBuckets(RecordZeroOffset, std::move(Globals))

200

std::vector<BulkPublic> &&Globals

203

Nice!

I am wondering if we shouldn't do union { uint16_t Segment; uint16_t BucketIdx; }; to make things a bit more explicit. Up to you!

215

...and in that case, do we need these two assignements?

if (L.BucketIdx != R.BucketIdx)
  return L.BucketIdx < R.BucketIdx;
342

// parallelSort is unstable, so we have to do name

352

Do you think we can avoid .push_back altogether, and simplify to:

PubAddrMap.resize(Publics.size());
ulittle32_t *Next = &PubAddrMap.front();
for (const BulkPublic &Pub : Publics)
  *Next++ = Pub.SymOffset;
rnk updated this revision to Diff 262871.May 8 2020, 7:24 AM
rnk marked 10 inline comments as done.
  • address comments
lld/COFF/PDB.cpp
1346

This loop was relatively short compared to what comes next. I think that is because it doesn't touch the name strings. I think we'd lose more from checking all symbols twice than we spend copying and growing the vector.

There's more to be gained from parallelizing section contribution creation or the main object file processing, so I plan to look at that instead.

llvm/include/llvm/DebugInfo/PDB/Native/GSIStreamBuilder.h
60

Sure, that's a good idea.

llvm/lib/DebugInfo/PDB/Native/GSIStreamBuilder.cpp
203

I'll do it. I went the other way while working on the code because the struct is declared in a header, so it affects the PDB.cpp code. We also can't use a nameless union field because I believe that is an extension. The zero initialization also becomes complicated.

352

I think this is better because it avoids the zero-initialization of resize. This loop doesn't show up in the profile, FWIW.

aganea accepted this revision.May 8 2020, 7:53 AM

Looks good, thanks!

This revision is now accepted and ready to land.May 8 2020, 7:53 AM
rnk added a comment.May 8 2020, 10:23 AM

This is the (poorly annotated in paint) profile that I'm looking at, BTW:

The big remaining costs are input reading and the PDB processing of each object file. I think both operations really depend on having a good, generic, parallel primitive for building a map.

To build up the symbol table, we maintain a map from symbol name to symbol definition. To process debug info, we need to build a map from type record (ghash, really) to dst type index, and then symbols can be processed in parallel. If you have any pending patches in this area, please remind me, I've forgotten the state of things.

Unfortunately, there isn't an obviously best algorithm for building a map in parallel. One way would be to do what I did for publics: put all external symbols in a vector, sort it in parallel, remove the duplicates, but this is more work overall in the single-threaded case (O(n log n), presumably with a high constant factor to move large objects around) than our simple sequential hash table insertion.

We could do a map-reduce style hash & bucket computation: hash keys in parallel, divide key space equally among workers, each worker builds a map, merge each map. We could *try* to share the table since we could arrange the workers to insert into distinct buckets, but this requires careful handling of re-hashing or estimating the table size up front.

This seems like it would be a well-studied problem, but my initial searches haven't gotten me anywhere that I like yet. Apologies if I'm retreading something you've already researched, I vaguely recall a super-parallel ghash implementation.

This revision was automatically updated to reflect the committed changes.
aganea added a comment.EditedMay 8 2020, 11:50 AM

Yes, I wanted to get back to that GHash parallelization at some point, but I'm swamped in deploying Clang on production and shipping some of our games. I was planning to get back to that eventually: D55585 -- However the plan was to first move all Type-related code from PDB.cpp to DebugTypes.cpp, to ease things a bit: D59226 -- this still needs to be completed (steps 5-7).

D55585 was only parallelizing the hashing, not the type merging. It would be possible however to do hashing in parallel, without dividing the keyspace, however I'm not sure yet what is the best strategy. Internally at Ubisoft we have a lock-free hashmap which is being used for the past 12 years, it's very stable and well tested. We will be happy to open source it or reimplement it in LLVM. However it is lock-free, not wait-free, so I wanted to try it first (on the type merging).

The other strategy I was pursuing was a lock-free and wait-free hashmap. But that requires atomic operations, with fixed 64-bit (or 128-bit) buckets (key+value). A 2x 64-bit is also doable I suppose, if the target architecture doesn't have 128-bit atomics. The big problem then is resizing the hashmap, but there again, there could be a lock-free solution, by re-hashing in a thread while the other threads are still inserting in the old hashmap. It's tricky, but there's prior art in this domain.
I think this solution would scale better that our current lock-free hashmap, which requires spinlocks when inserting nodes. That can potentially give back time slices to the kernel in the form of Sleep() or SuspendThread(), and that is a bad thing IMHO. I don't see that scaling well past hunderd-core mark; only a atomic hashmap could work in my sense (if we want to plan ahead for the future decade).

Another subject is how to scale this kind of algorithm across NUMA nodes. Any operation that crosses the CPU socket or NUMA boundary is very expensive. This maybe requires a latent strategy, where operation could be synchonized in bluk, not independelty, across NUMA boundaries. Again, this is a hot topic in my sense, and I don't know if there's active research there (aside of decades-long MPI knowledge in the super computing world, which could apply maybe). If we build today a parallel Type merging which won't scale well in two years on the future EPYC, that would be a pity. I don't know, maybe it's worth doing it until it doesn't scale anymore?

If you feel like modifying or landing any of the patches above, feel free! If not, I'll eventually get back to them.

rnk added a comment.May 8 2020, 2:51 PM

Right, I remember actually reviewing D59226, and then thinking that it had landed. I will look into rebasing it, thanks for the reminder.

@akhuang, I was thinking of the TpiSource class in that patch when I was saying that there should be an easy way to merge in types from dependent DLLs.

Speaking of which, @aganea, since you are using clang to compile now, you should try adding -Xclang -debug-info-kind=constructor if you haven't already. It greatly reduces the amount of duplicate type info that clang emits.

Regarding the lock free hash table, I guess it's the only solution in the long run. I think there is still low hanging fruit before we get there. For me, the bottleneck is no longer type merging, it's symbol merging, which can be trivially parallelized. So, if we do things like:

  • sequentially for all obj, run mergeDebugT
  • parallel for all obj, merge .debug$S using type index maps
aganea added a comment.May 8 2020, 3:17 PM
In D79467#2027736, @rnk wrote:

Speaking of which, @aganea, since you are using clang to compile now, you should try adding -Xclang -debug-info-kind=constructor if you haven't already. It greatly reduces the amount of duplicate type info that clang emits.

We were discussing about that today. Is there any drawback for doing so?

rnk added a comment.May 8 2020, 3:44 PM
In D79467#2027736, @rnk wrote:

Speaking of which, @aganea, since you are using clang to compile now, you should try adding -Xclang -debug-info-kind=constructor if you haven't already. It greatly reduces the amount of duplicate type info that clang emits.

We were discussing about that today. Is there any drawback for doing so?

There could be missing type info. If you use prebuilt third party libraries that were built without debug info, then it is likely that those types will be missing in the final PDB, even if all the type info is exposed in public headers that you compile.

However, this was already an issue with clang's default type info mode. The new flag moves the heuristic from the vtable to constructors, so it could make the situation worse.

You can work around this kind of issue by picking one object to compile with -fstandalone-debug, and then put static_assert(sizeof(DesiredType) >= 0, "require complete type"); in it somewhere. It might be nice to have a better way to say that without adjusting compiler flags, though.

abhinavgaba added inline comments.
llvm/lib/DebugInfo/PDB/Native/GSIStreamBuilder.cpp
93

This is causing the following warnings:

ignoring packed attribute because of unpacked non-POD field ‘llvm::codeview::RecordPrefix {anonymous}::PublicSym32Layout::Prefix’
   RecordPrefix Prefix;
                ^
...
ignoring packed attribute because of unpacked non-POD field ‘llvm::codeview::PublicSym32Header {anonymous}::PublicSym32Layout::Pub’ 
   PublicSym32Header Pub;
                     ^
uabelho added inline comments.
llvm/lib/DebugInfo/PDB/Native/GSIStreamBuilder.cpp
93

We see this too, we're using gcc 7.4.0 when compiling clang.

We did some small experiments and the struct does get different sizes when compiled with clang vs gcc which might be a problem?

With LLVM_PACKED_START/LLVM_PACKED_END around PublicSym32Layout instead of LLVM_PACKED the struct seems to get packed both with clang and gcc so perhaps that's a solution?

uabelho added inline comments.May 13 2020, 3:01 AM
llvm/lib/DebugInfo/PDB/Native/GSIStreamBuilder.cpp
93

Btw seems like someone already noticed that clang and gcc sometimes behaves differently regarding the packed attribute:
https://bugs.llvm.org/show_bug.cgi?id=28571#c3

rnk added a comment.May 13 2020, 2:27 PM

Hopefully rG409274274 addresses the GCC warnings. I didn't get a chance to locally repro and confirm that the fix works.

In D79467#2035176, @rnk wrote:

Hopefully rG409274274 addresses the GCC warnings. I didn't get a chance to locally repro and confirm that the fix works.

It's silent for me now at least. Thanks!