[pdb] Don't build the entire source file list up front.
ClosedPublic

Authored by zturner on May 4 2017, 10:04 AM.

Details

Summary

I decided to run llvm-pdbdump on a very large (~1.5GB) PDB to try and identify show-stopping performance problems. This is my attempt at fixing the first such problem.

When we load the Dbi Stream, before anyone even tries to access any data from it, we build up a set of information about every compiland. This includes, for each compiland, a list of all files. In this particular PDB file, there were about 85 million files. Note these aren't unique files, but since many two compilands can reference the same file (think headers), the combined list is very large.

There is no point doing this unless the user wants to look at the list of files to begin with, and even in that case there is no point building up a vector. If someone requests file 7 of module 6, we should do everything on the fly: Get the list of file offsets for module 6, take the 7th one, seek to that offset in the file names buffer, and return it.

In other words, for a fixed module with n source files, random access to that module's list of files is already O(1), so there is no point to spend O(n) time constructing a vector just to give you O(1) access, which you already had.

Moreover, precomputing the entire table (as was done before) is O(m*n) where m is the number of modules and n is the average number of source files / module.

This patch changes the up-front cost to O(m). It iterates the list of modules once, because they are variable length structures in a flat buffer. So by iterating it once we can get offsets to the beginning of each record in the buffer, thereby providing random access, thereby giving us future constant-time random access to both modules and files within modules.

There are some more performance improvements that need to be made in follow up patches, but I'll keep them separate.

Diff Detail

Repository
rL LLVM
zturner created this revision.May 4 2017, 10:04 AM
amccarth added inline comments.May 4 2017, 11:09 AM
llvm/include/llvm/DebugInfo/PDB/Native/DbiModuleList.h
89 ↗(On Diff #97840)

It seems ModeIndexArray is not needed? At least, we don't need to keep it around. Perhaps it should just be a local variable where you read it in order to skip past it.

llvm/lib/DebugInfo/PDB/Native/DbiModuleList.cpp
40 ↗(On Diff #97840)

s/They/Their/

118 ↗(On Diff #97840)

The naming here seems misleading. Once you add Filei, it's no longer the _first_ offset. How about:

uint32_t Offset = Modules->ModuleInitialFileIndex[Modi] + Filei;
135 ↗(On Diff #97840)

If I'm following, it seems >= is just defensive and that == should be sufficient. In that case, what if you assert(Filei <= Modules->getSourceFileCount(Modi)) to help catch bugs that the defensiveness would ordinarily hide?

151 ↗(On Diff #97840)

Huh? Couldn't they both be the end of the same module? I don't see where you've ruled that out at this point.

198 ↗(On Diff #97840)

I'm having trouble understanding this comment, for a few reasons:

  1. I'm having trouble figuring out exactly what some of the "it"s refer to.
  2. It says we're not going to use the first array, but, in fact, you save it into a member variable ModIndexArray. If we're not going to use it, why keep it around?
  3. Are you using "modules" to mean more than one thing here? We're talking about compilands here, right?
  4. I don't follow that logic that >64K source files means >64K object files. Does source files include header files? If so, isn't it likely that there are fewer object files than source files?
207 ↗(On Diff #97840)

But the comment above just said we don't use NumSourceFiles, so why sum them up. I assume that's a comment bug.

zturner updated this revision to Diff 97861.May 4 2017, 11:53 AM

Fixed issues pointed out by amccarth.

zturner marked 7 inline comments as done.May 4 2017, 11:55 AM
zturner added inline comments.
llvm/lib/DebugInfo/PDB/Native/DbiModuleList.cpp
151 ↗(On Diff #97840)

Good catch. I think this newer logic is much easier to follow.

198 ↗(On Diff #97840)

The comment was actually all kinds of wrong, hopefully it's more clear now.

207 ↗(On Diff #97840)

It was confusingly referring to NumSourceFiles inside of the header. Fixed.

216–219 ↗(On Diff #97861)

This comment was also wrong btw, so it's also fixed.

zturner updated this revision to Diff 97864.May 4 2017, 12:09 PM
zturner marked 3 inline comments as done.

*Still* got the isCompatible check wrong. I think this is right now.

amccarth accepted this revision.May 4 2017, 12:48 PM

Thanks for the changes. Those make it much easier to understand.

This revision is now accepted and ready to land.May 4 2017, 12:48 PM
This revision was automatically updated to reflect the committed changes.