This is an archive of the discontinued LLVM Phabricator instance.

[NativePDB] Improve support for reconstructing a clang AST from PDB debug info
ClosedPublic

Authored by zturner on Nov 7 2018, 11:21 AM.

Details

Summary

This is an alternative to D54053 which uses a different approach. The goal of both is the same - to be able to improve the quality of the AST that we reconstruct when parsing the debug info. D54053 attempts to address this by demangling the unique name of each type, and using the structure of the demangler's AST to try to reconstruct a clang AST.

However, there are some complications with this approach. The two biggest ones are:

a) The mangling does not always provide enough information to disambiguate between two types, depending on where it occurs in the mangling.
b) The mangling provides no way to differentiate outer classes from outer namespaces, so in A::B::C, we don't know if A and B are (class, class), (namespace, namespace), or (namespace, class).

b) sounds like it could be an unimportant distinction, but since LLDB works by gradually building up an AST over time that grows as more and more debug info is parsed, you can very quickly end up in a situation where there are ambiguities in your AST. For example, you may decide that B is probably a namespace, so you create a NamespaceDecl for it in the AST, and then later someone instatiates a variable of type A::B and you have precise debug info telling you it's a class. This will create two decls at the same scope in the AST hierarchy with the same name, causing ambiguities and these will slowly build up over time leading to instability.

The approach here is based off of the observation that the PDB contains information about nested classes in the parent -> child direction, just not the other way around. That is to say, if you have code such as: struct A { struct B {}; }; Then the debug info record for A will tell you that it contains a nested type call B, along with an index for the full definition of B in the debug info. The problem we are facing all along is that if someone declares a variable of type A::B, they need the reverse mapping, and PDB doesn't offer that.

So, the simple solution employed here is to simply pre-process all types up front and build the reverse mapping. This gives us perfect information about class hierarchy, and allows us to precisely determine if a part of a scope is a namespace (specifically, it will have no parent in the reverse mapping).

But we can even re-purpose this pre-processing step for other things down the line. For example, we may wish to find all types name Foo, but maybe Foo is a template and the instantion is Foo<int>. We could use this pre-processing step to build this kind of hash table. And many other things as well.

Note that the idea of demangling a type and using the structured demangler AST is not totally abandoned. For example, if you have a template instantiation named Foo<int>, the patch here will simply create a class with the name Foo<int>. In other words, we make no attempt to parse template parameters and create the appropriate instantiations in the AST.

We also do not yet handle scoped classes (i.e. classes that are defined inside the body of a funtion). But we can handle those later.

Note that I started adding a new kind of test, an ast test. I even retrofitted existing tests with ast testing functionality. I think this is a useful testing strategy to ensure we are generating correct ASTs from debug info.

Diff Detail

Repository
rL LLVM

Event Timeline

zturner created this revision.Nov 7 2018, 11:21 AM
aleksandr.urakov accepted this revision.Nov 8 2018, 2:43 AM

Looks good, thank you!

The only question is performance, haven't you checked how much time takes the preprocessing on huge PDBs? Intuitively it seems that it shouldn't take too much time (n*log(n) where n is the count of LF_NESTTYPE records), but may be you have checked this?

This revision is now accepted and ready to land.Nov 8 2018, 2:43 AM

Looks good, thank you!

The only question is performance, haven't you checked how much time takes the preprocessing on huge PDBs? Intuitively it seems that it shouldn't take too much time (n*log(n) where n is the count of LF_NESTTYPE records), but may be you have checked this?

I checked on clang.pdb. For my local build of LLVM this about 780MB. It's quite slow in debug build (14 seconds for ParseSectionContribs and 60 seconds for PreprocessTpiStream), but in release build the combined total is less than 2 seconds for both function calls

I checked on clang.pdb. For my local build of LLVM this about 780MB. It's quite slow in debug build (14 seconds for ParseSectionContribs and 60 seconds for PreprocessTpiStream), but in release build the combined total is less than 2 seconds for both function calls

I think that less than 2 seconds for a 780MB PDB in release is very good! Thank you!

This revision was automatically updated to reflect the committed changes.