This is an archive of the discontinued LLVM Phabricator instance.

[CodeView] Add an amortized O(1) random access type visitor
ClosedPublic

Authored by zturner on May 9 2017, 10:56 AM.

Details

Summary

Type streams are notoriously slow to access, because records are variable length, and so given a particular TypeIndex, finding the record with the given TypeIndex is O(n). But if we know the offsets of *some* of the elements, we can do better.

The type visitor here provides lazy, random access to an arbitrary type stream, only ever doing as much work as is necessary to find the requested type, and never doing a linear scan more than once for any record, even if it is visited more than once.

To do so, it makes use of a "type index offset array", which is essentially an array of pairs, where the first item is a TypeIndex, and the second item is a byte offset into the global type stream. Such an array is present as part of the serialized format of a PDB file, but in theory we could construct these on the fly for non-serialization purposes as well.

Using this visitor, When a given TypeIndex is requested, there are three possible outcomes.

  1. The TypeIndex has already been requested by the user before. In this case the absolute offset of this record will be cached. We just seek to the appropriate offset, deserialize one record, and invoke the user's callback immediately.
  2. The TypeIndex has not been requested by the user before, but it *has* visited internally while doing a linear scan for a different TypeIndex that the user did request. In this case, we also know the absolute offset of the record (see #3 below for how), and we proceed as in #1
  3. Neither 1 or 2 above apply. In this case we begin by finding the nearest item from the type index offset array that comes before the requested item. This is a log(M) operation where M is the number of known offsets, assuming the entries are roughly evenly spaced. There is guaranteed to be at least one, because TypeIndex 0 has an offset of 0.

Then, we also look for the nearest TypeIndex that meets the criteria in either 1 or 2 above. Whichever of these two is larger gives us the nearest known offset to the requested index. From there, we do a linear scan forward to the requested TypeIndex, caching offsets along the way as we deserialize each record. On subsequent attempts to fetch a record with a given TypeIndex, any of the records scanned during the execution of this step will have O(1) lookups.

While this process is going on, a TypeDatabase that the user supplies is being populated with the various individual records providing quick lookup to type names and data of known records without having to go through the visitor.

Diff Detail

Repository
rL LLVM

Event Timeline

zturner created this revision.May 9 2017, 10:56 AM
zturner updated this revision to Diff 98499.May 10 2017, 11:31 AM

A few minor cleanups. Mostly added a doxygen blurb at the beginning of the visitor class, and also changed the TypeDatabase from being stored by reference to being stored by value inside the visitor. It doesn't really make sense to have the user be responsible for creating the database, since the two things are so tightly coupled that neither can exist without the other.

krytarowski added inline comments.
llvm/unittests/DebugInfo/CodeView/ErrorChecking.h
10 ↗(On Diff #98499)

LLVM_UNITTESTS_DEBUGINFO_CODEVIEW_ERRORCHECKING_H ?

krytarowski added inline comments.May 10 2017, 12:11 PM
llvm/unittests/DebugInfo/CodeView/RandomAccessVisitorTest.cpp
1 ↗(On Diff #98499)

llvm/unittest/DebugInfo/CodeView/RandomAccessVisitorTest.cpp ?

zturner updated this revision to Diff 98541.May 10 2017, 3:13 PM

Completely deleted RandomAccessTypeDatabase. It made it really awkward not having the random and sequential databases being able to interoperate with each other. For example, you might run llvm-pdbdump with the option to dump everything, and include another option which will trigger a type lookup. But since they used different databases, the two operations couldn't be made to work together. The only real difference between the two was that one used a BitVector and the other didn't, so it wasn't too much work to combine them into a single TypeDatabase that supports both appending and random access.

rnk edited edge metadata.May 10 2017, 3:49 PM

I think you probably want to go to the simpler strategy we described at lunch, which looks more like:

  • Maintain the KnownOffsets map from TypeIndex to offset, where all offsets start off unknown. You could use a sentinel value (~0U) or a separate BitVector to track if the offset is known.
  • If an offset is unknown, binary search the IndexToOffset array to find the start and end offsets of that range of records.
  • Scan that range of records and update KnownOffsets for all records in that range.

This avoids bad worst case behavior that might come up if the user scans forwards or backwards through type indices. The difference between this algorithm and what you have is that you're trying to avoid scanning the complete range, but that means you have to do scans through the BitVector to try to resume scanning where you left off. That isn't strictly necessary to get the fast random access that we're looking for.

llvm/include/llvm/DebugInfo/CodeView/RandomAccessTypeVisitor.h
71 ↗(On Diff #98541)

You should try to use the Doxygen /// comments when it makes sense here and below. I don't consume the Doxygen, but there are people who do.

llvm/include/llvm/DebugInfo/CodeView/TypeDatabase.h
27 ↗(On Diff #98541)

No caller uses IL_FirstAvailable. What's the use case? Do we need this overload at all, or can we standardize on calling recordType with TypeIndex?

llvm/lib/DebugInfo/CodeView/TypeDatabase.cpp
156 ↗(On Diff #98541)

This will most likely be O(n^2) because it bypasses the growth multiplier in vector push_back.

165 ↗(On Diff #98541)

This is linear, it might scan to the beginning of the BitVector without stopping at the beginning of the range.

In D33009#751659, @rnk wrote:

I think you probably want to go to the simpler strategy we described at lunch, which looks more like:

  • Maintain the KnownOffsets map from TypeIndex to offset, where all offsets start off unknown. You could use a sentinel value (~0U) or a separate BitVector to track if the offset is known.
  • If an offset is unknown, binary search the IndexToOffset array to find the start and end offsets of that range of records.
  • Scan that range of records and update KnownOffsets for all records in that range.

This avoids bad worst case behavior that might come up if the user scans forwards or backwards through type indices. The difference between this algorithm and what you have is that you're trying to avoid scanning the complete range, but that means you have to do scans through the BitVector to try to resume scanning where you left off. That isn't strictly necessary to get the fast random access that we're looking for.

I'm still not entirely convinced. At lunch we discussed doing this as a way to avoid having a BitVector at all. But it turns out we need the BitVector regardless, because the TypeDatabase needs to know what records are valid, and it is decoupled from the visitor. So that aspect doesn't really buy us anything.

I think it comes down to access patterns. 90% or more of the types in a PDB are never going to be accessed. You're looking for a symbol, it says the type of the symbol is 329082, what are the odds you're going to later access symbol 329075? Pretty low by my estimation. Type accesses will be sparse, and so while it will increase our cache hit ratio to scan the whole block up front, it doesn't necessarily seem like a win to me. Most type accesses are going to come from un-examined chunks.

zturner added inline comments.May 10 2017, 4:04 PM
llvm/lib/DebugInfo/CodeView/TypeDatabase.cpp
165 ↗(On Diff #98541)

I could update BitVector to have an overload that stops at a certain point?

zturner updated this revision to Diff 98661.May 11 2017, 11:30 AM

Fix the linear scan to the beginning of the BitVector as well as the poorly performing resize operation.

rnk added a comment.May 11 2017, 1:10 PM

I'm still not entirely convinced. At lunch we discussed doing this as a way to avoid having a BitVector at all. But it turns out we need the BitVector regardless, because the TypeDatabase needs to know what records are valid, and it is decoupled from the visitor. So that aspect doesn't really buy us anything.

I think it comes down to access patterns. 90% or more of the types in a PDB are never going to be accessed. You're looking for a symbol, it says the type of the symbol is 329082, what are the odds you're going to later access symbol 329075? Pretty low by my estimation. Type accesses will be sparse, and so while it will increase our cache hit ratio to scan the whole block up front, it doesn't necessarily seem like a win to me. Most type accesses are going to come from un-examined chunks.

I think it avoids extra complexity that isn't necessary to get the same algorithmic guarantees. It lets us reduce the API surface area of TypeDatabase to avoid findPrevoius / find_last_in. I feel like data-structurey code needs to be as simple as humanly possible to make it easy to reason about the algorithmic properties. If we can get by with less ifs, the better.

llvm/include/llvm/DebugInfo/CodeView/TypeDatabase.h
27 ↗(On Diff #98541)

What's the plan for IL_FirstAvailable?

zturner updated this revision to Diff 98674.May 11 2017, 2:11 PM

Updated with fixes.

rnk added a comment.May 12 2017, 10:54 AM

Thanks, one more round of comments and I think we're done. I'm just staring at data structure code trying to make it simple.

llvm/lib/DebugInfo/CodeView/RandomAccessTypeVisitor.cpp
34 ↗(On Diff #98674)

I think extracting this if block into a private method would improve readability. Using upper_bound is always tricky.

35 ↗(On Diff #98674)

Shouldn't this be .begin() not .end()? If it's empty, they'll be equivalent anyway.

48 ↗(On Diff #98674)

I think you can simplify this to three cases:

  1. PartialOffsets.empty(): Scan all type indices. You can probably handle this step first and return early, so that Begin is always valid.
  2. std::next(Begin) == PartialOffsets.end(): In this case, TIE is based on capacity like you have it.
  3. Otherwise, TIE is std::next(Begin)->Type
50 ↗(On Diff #98674)

Maybe std::next?

llvm/lib/DebugInfo/CodeView/TypeDatabase.cpp
159 ↗(On Diff #98674)

Isn't this a linear scan from the end of the bitvector? Does Count work instead?

zturner added inline comments.May 12 2017, 11:10 AM
llvm/lib/DebugInfo/CodeView/TypeDatabase.cpp
159 ↗(On Diff #98674)

This is something I've been kinda struggling with. Count works as long as you assume the user is *only* append or *only* inserting randomly. The API doesn't enforce that, and there's no good way to make it enforce that. In practice that's how it's used though.

zturner updated this revision to Diff 98805.May 12 2017, 11:12 AM
rnk accepted this revision.May 12 2017, 11:14 AM

lgtm

llvm/include/llvm/DebugInfo/CodeView/RandomAccessTypeVisitor.h
71 ↗(On Diff #98541)

Thoughts?

This revision is now accepted and ready to land.May 12 2017, 11:14 AM
This revision was automatically updated to reflect the committed changes.