This is an archive of the discontinued LLVM Phabricator instance.

[PDB] Add symbol records in bulk
ClosedPublic

Authored by rnk on Nov 14 2018, 4:34 PM.

Details

Summary

This speeds up linking clang.exe/pdb with /DEBUG:GHASH by 31%, from
12.9s to 9.8s.

Symbol records are typically small (16.7 bytes on average), but we
processed them one at a time. CVSymbol is a relatively "large" type. It
wraps an ArrayRef<uint8_t> with a kind an optional 32-bit hash, which we
don't need. Before this change, each DbiModuleDescriptorBuilder would
maintain an array of CVSymbols, and would write them individually with a
BinaryItemStream.

With this change, we now add symbols that happen to appear contiguously
in bulk. For each .debug$S section (roughly one per function), we
allocate two copies, one for relocation, and one for realignment
purposes. For runs of symbols that go in the module stream, which is
most symbols, we now add them as a single ArrayRef<uint8_t>, so the
vector DbiModuleDescriptorBuilder is roughly linear in the number of
.debug$S sections (O(# funcs)) instead of the number of symbol records
(very large).

Some stats on symbol sizes for the curious:

PDB size: 507M
sym bytes: 316,508,016
sym count:  18,954,971
sym byte avg: 16.7

As future work, we may be able to skip copying symbol records in the
linker for realignment purposes if we make LLVM write them aligned into
the object file. We need to double check that such symbol records are
still compatible with link.exe, but if so, it's definitely worth doing,
since my profile shows we spend 500ms in memcpy in the symbol merging
code. We could potentially cut that in half by saving a copy.
Alternatively, we could apply the relocations *after* we iterate the
symbols. This would require some careful re-engineering of the
relocation processing code, though.

Diff Detail

Repository
rLLD LLVM Linker

Event Timeline

rnk created this revision.Nov 14 2018, 4:34 PM

Very nice! You're talking about /DEBUG:GHASH, but your optimisation is also beneficial when GHASH isn't used?

lld/COFF/PDB.cpp
1051 ↗(On Diff #174124)

What about:

unsigned Aligned = alignTo(Sym.length(), alignOf(CodeViewContainer::Pdb));
NeedsRealignement |= Aligned != Sym.length();
RealignedSize += Aligned;
1088 ↗(On Diff #174124)

Quick question here - if all symbols are already aligned in source stream, aren't they all already sequentially (linearly) arranged in memory? Couldn't we just set BulkSymbols just once in that case, and prevent lines 1332-1138? This already applies for NeedsRealignement case.

1133 ↗(On Diff #174124)

In same way we have ArrayRef<>::drop_front(), a extend() or alike function would be nice:

BulkSymbols = BulkSymbols.extend(Sym.length());

However - this depends on the answer to comment at line 1088.

1144 ↗(On Diff #174124)

Remove if if you're early exiting below in addSymbolsInBulk()

llvm/include/llvm/DebugInfo/PDB/Native/DbiModuleDescriptorBuilder.h
95 ↗(On Diff #174124)

In a subsequent change, you could count how many symbols subsections are there in a File, and .reserve this vector in advance.

llvm/lib/DebugInfo/PDB/Native/DbiModuleDescriptorBuilder.cpp
67 ↗(On Diff #174124)

This function should probably be removed in the short term. Other places using it would also benefit from adding symbols in bulk.

73 ↗(On Diff #174124)

Early exit if empty? It looks like the first time you're calling this function from mergeSymbolRecords() the collection is empty.

rnk added a comment.Nov 15 2018, 10:30 AM

Very nice! You're talking about /DEBUG:GHASH, but your optimisation is also beneficial when GHASH isn't used?

It should. I'd expect to get the same absolute 3 second improvement for non-ghash links, which are currently at around 20s, so as a percentage improvement, it's smaller. I haven't measured, so there could be other cache effects that make the improvement smaller or larger. I was mainly using ghash because it makes the symbol rewriting stand out more in the profile, and it means I can iterate and re-profile more quickly.

I think the improvement is probably less pronounced in debug builds, since they should have fewer small records, which I think mostly come from inlined call sites and their variables.

lld/COFF/PDB.cpp
1051 ↗(On Diff #174124)

I like it. I think, at the time, I was thinking, "NeedsRealignment is captured by reference, I wouldn't want to unconditionally store to memory every iteration", but that's irrelevant since after profiling I saw that this whole forEachCodeViewRecord loop helper template gets inlined.

1088 ↗(On Diff #174124)

Not quite. Not all symbols in the .debug$S section get copied to the module dbi stream. See symbolGoesInModuleStream for the list of records that don't get copied. In practice, the common ones are S_GDATA32 and S_UDT. I think S_UDT can appear locally scoped inside of a function, which means any .debug$S section can be discontiguous, but in practice, all S_UDT records are packed together at the beginning or end.

My plan was to first measure if aligning the symbols in LLVM, thereby avoiding the copy below, is a big win. If it is, then this up front traversal of all the symbols will often be unnecessary. I was going to try disabling it and setting NeedsRealignment to false, and then checking if that's a measurable performance win. If it is, then we really want to specialize for two cases: 1. object file has pre-aligned symbols, 2. object file needs realignment. I can imagine having one loop that does the work, validates that each record is aligned, fails if it isn't, the caller then copies and realigns the rest of the symbols, and restarts the loop from where it failed.

1133 ↗(On Diff #174124)

I think joining two adjacent ArrayRefs is a pretty corner-case use case. Usually we slice big things up into small pieces. Going the other way around is scary. I'm not sure I want to add an API to encourage it. :)

llvm/include/llvm/DebugInfo/PDB/Native/DbiModuleDescriptorBuilder.h
95 ↗(On Diff #174124)

True. If we did that, it'd probably best to have LLD manage the std::vector, and then pass ownership of it over to the DbiModuleDescriptorBuilder with std::move.

llvm/lib/DebugInfo/PDB/Native/DbiModuleDescriptorBuilder.cpp
67 ↗(On Diff #174124)

Maybe, but I think it's out of scope for this change. yaml2pdb uses this, and there are a handful of other callers that use this pattern:

ObjNameSym ONS(SymbolRecordKind::ObjNameSym);
Compile3Sym CS(SymbolRecordKind::Compile3Sym);
EnvBlockSym EBS(SymbolRecordKind::EnvBlockSym);
... fill them in
Mod.addSymbol(codeview::SymbolSerializer::writeOneSymbol(
    ONS, Allocator, CodeViewContainer::Pdb));
Mod.addSymbol(codeview::SymbolSerializer::writeOneSymbol(
    CS, Allocator, CodeViewContainer::Pdb));
Mod.addSymbol(codeview::SymbolSerializer::writeOneSymbol(
    EBS, Allocator, CodeViewContainer::Pdb));

Perhaps we should have some kind of symbol serializer that encapsulates the pattern I'm using. One of the other things we do a lot of is accumulate global S_PUB32, S_SECTION, and other global symbols. I guess I didn't do it that way because I liked the idea of estimating the amount of memory we need for symbols up front.

73 ↗(On Diff #174124)

Good idea.

zturner added inline comments.Nov 16 2018, 11:12 AM
lld/COFF/PDB.cpp
813 ↗(On Diff #174124)

Why did this change from 4 to 8?

1007–1008 ↗(On Diff #174124)

ModIndex should maybe be a uint16_t since that's what it is in the underlying file. We just cast it below anyway, might as well remove any chance of truncation by having the function take a uint16_t directly.

1070 ↗(On Diff #174124)

Is the 4 here supposed to be the alignment? How about alignOf(CodeViewContainer::Pdb)?

1078 ↗(On Diff #174124)

I feel like this is do-able with just less than 2 passes.

What if we kept a vector of offsets into the SymData buffer, and the offsets would represent the position of records which required an alignment adjustment. If you get through the loop and the vector is empty, then that is the same as NeedsAlignment == false.

As you iterate the records, you remap the type indices in the original buffer, which may still require alignment, but you don't actually do any re-alignment yet. For each record that does require realignment, you push its offset onto the vector. Note that you can collapse adjacent alignments, so for example if you have Record 1 which is at offset 3 mod 4, and Record 2 is also offset 3 mod 4, then only 1 re-alignment is required. Because as soon as you realign the output buffer to 0 mod 4, then both records will be re-aligned.

So you're basically pushing offsets onto the vector when offset % 4 changes.

At the end of this process, all remapping has been done, but no re-alignment has occurred.

If the offset vector is empty, you're done. Just return, and you only made 1 pass over the records.

If the offset vector is not empty, you walk it, and for each offset you add Offset[n] - Offset[n-1] bytes of padding, then copy in bulk up to the next offset.

Now, in the best case scenario, you only have 1 pass over the inputs, but in the worst case scenario, you don't have O(2n), you have O(n+m) where m < n.

rnk marked 2 inline comments as done.Nov 16 2018, 2:47 PM
rnk added inline comments.
lld/COFF/PDB.cpp
813 ↗(On Diff #174124)

I changed it to take the entire record including the 4 byte prefix, which made more sense to me. It simplifies the caller.

1070 ↗(On Diff #174124)

It is. Would you be open to coming up with a less noisy of writing that, like (codeview::)SymbolAlignmentForPDB?

1078 ↗(On Diff #174124)

I like the idea of remapping in place in the first pass, and doing the realignment as an optional second pass, that's a good improvement. This is also similar to what I described below, starting with "My plan". There are basically two cases:

  1. every record is aligned (not implemented yet)
  2. most records are unaligned

Your proposal requires memcpying every unaligned record in a second pass, but it uses a temporary vector on the side to hold the offsets. I don't think the vector is needed, we can just iterate again with forEachCodeViewRecord.

Basically, the plan is to combine the two loops that we have here, and do a second pass to realign. The drawback is that we have to "filter" records into global and module streams during the second pass if it runs, which means we're doing "wasted" work on the first pass if the object file isn't prealigned.

I don't want to implement this optimization in this patch. I'd rather combine it with an LLVM change to pre-align the records, do what we've described here, and benchmark that by itself to see if it's worth the complexity.

What we have here with the up-front pass is probably fastest for unaligned symbols, so IMO we should stick with it for now.

zturner added inline comments.Nov 16 2018, 2:59 PM
lld/COFF/PDB.cpp
1078 ↗(On Diff #174124)

Well the whole point was to avoid calling forEachCodeViewRecord twice, because a) it's slow since you can't know where record N+1 is until you've parsed record N, and b) There are more CodeView records than there are unaligned records. You could imagine a .debug$S section with, say, a few hundred records in it, only 1 of which is unaligned. So if you call forEachCodeViewRecord twice, that's 200 iterations (iterate all 100, then iterate all 100 again), whereas with my proposal it's at most 103 (iterate all 100 and do remappings, memcpy the first N, memcpy the N+1'th record after alignment, memcpy the last 100-N-1 records)

rnk added inline comments.Nov 16 2018, 3:18 PM
lld/COFF/PDB.cpp
1078 ↗(On Diff #174124)

a) it's slow since you can't know where record N+1 is until you've parsed record N, and

Unless we start doing things in parallel, I don't think that matters, except for the benefit you mentioned in 'b'.

b) There are more CodeView records than there are unaligned records.

Sure, but by some (probably low) constant factor. For example, variables are described by S_LOCAL/S_DEFRANGE pairs. I'd argue local variable names are randomly distributed, so 75% of the time an S_LOCAL is going to need realignment, and S_LOCALs are probably the majority of symbol records in a debug build.

So, the vector of unaligned records will still be O(#symrecs), so we might as well save memory and iterate from the top.

And, to realign, we either have to copy every byte of symbol data to make it contiguous, or it becomes discontiguous, with lots of holes. What I noticed is that the average symbol record is really small, so it's better to do the copy and point to the entire .debug$S section in bulk.

Basically, I don't think there's anything inherently slow about forEachCodeViewRecord.

rnk added a comment.Nov 16 2018, 3:33 PM

The main reservation I have about landing this is whether to make it more generic and abstracted. I kind of feel like the main SymbolSerializer class should be powerful enough to do all these optimizations for us, or maybe some replacement for it.

zturner accepted this revision.Nov 16 2018, 3:58 PM
zturner added inline comments.
lld/COFF/PDB.cpp
1078 ↗(On Diff #174124)

It isn't so much about forEachCodeViewRecord being slow, it's more that doing 1 large memcpy is faster than doing many small memcpys. So if you can batch up the memcpy calls into the smallest possible set of calls that is both necessary and sufficient, I think it will give you a good performance gain, and that looking at it terms of O(2n) being "equivalent" to O(n) is going to be misleading.

That said, I'm ok with deferring that optimization to a later patch, but I do think it's worth trying, and probably doesn't add much (if any) complexity.

This revision is now accepted and ready to land.Nov 16 2018, 3:58 PM
rnk added inline comments.Nov 19 2018, 1:55 PM
lld/COFF/PDB.cpp
1078 ↗(On Diff #174124)

Fair enough, saying something is a "constant" factor is always pretty lame. 2x faster is only a "constant" of "2" faster, but it means a lot in reality. We'll have to consider it in the next round of optimization.

rnk added a comment.Nov 27 2018, 10:57 AM

In D54802, I noticed that the linker is going to be responsible for synthesizing increasingly more symbol records. I think I'm going to land this as is to get the gains and then try to refactor things to make the general SymbolSerializer abstraction more efficient.

This revision was automatically updated to reflect the committed changes.
rnk added a comment.Dec 17 2018, 3:05 PM

Regarding the alignment optimization, I implemented it, and I only measured a 3% speedup when making clang.pdb (without ghash, I think, since I was lazy and didn't rebuild...). However, I put in stats to show that it saves 433MB of memory allocations, and that's a significant fraction of the overall process memory usage, so it's definitely worth it. I'll post a patch with harder numbers soon.