I decided to run llvm-pdbdump on a very large (~1.5GB) PDB to try and identify show-stopping performance problems. This is my attempt at fixing the first such problem.
When we load the Dbi Stream, before anyone even tries to access any data from it, we build up a set of information about every compiland. This includes, for each compiland, a list of all files. In this particular PDB file, there were about 85 million files. Note these aren't unique files, but since many two compilands can reference the same file (think headers), the combined list is very large.
There is no point doing this unless the user wants to look at the list of files to begin with, and even in that case there is no point building up a vector. If someone requests file 7 of module 6, we should do everything on the fly: Get the list of file offsets for module 6, take the 7th one, seek to that offset in the file names buffer, and return it.
In other words, for a fixed module with n source files, random access to that module's list of files is already O(1), so there is no point to spend O(n) time constructing a vector just to give you O(1) access, which you already had.
Moreover, precomputing the entire table (as was done before) is O(m*n) where m is the number of modules and n is the average number of source files / module.
This patch changes the up-front cost to O(m). It iterates the list of modules once, because they are variable length structures in a flat buffer. So by iterating it once we can get offsets to the beginning of each record in the buffer, thereby providing random access, thereby giving us future constant-time random access to both modules and files within modules.
There are some more performance improvements that need to be made in follow up patches, but I'll keep them separate.