Page MenuHomePhabricator

Initial pass at API design for DebugInfo/PDB

Authored by zturner on Feb 2 2015, 1:30 PM.



*IN PROGRESS* This is in progress and is not supposed to be final. Many interfaces are incomplete / unimplemented. I'm submitting this early to solicit comments on the overall design, so that anyone interested can help shape the design of the library.

I approached this design with the following assumptions in mind:

  1. The interface to the PDB reading code should expose no Windows-specific types.
  2. LLVMDebugInfoPDB should compile on non-Windows
  3. LLVMDebugInfoPDB should support multiple simultaneous implementations of a PDB reader, and you should be able to choose which one you create at runtime.

#1 means that, essentially, I need to "shim" or clone the interface to DIA. A full description of DIA and its available interfaces can be found here: Since DIA is COM-based, there is really no way to hide the Windows nature of these methods and interfaces without making an entirely new interface that looks about the same, and has a DIA version as one possible implementation.

#3 is assumed in order to leave open the possibility that someone in the future may create a different implementation of this PDB reader. Even if that happens though, I believe we may still want a DIA-based version of the itnerface in the code. The reason for this is to make it possible to understand how PDBs evolve with new versions of Visual Studio and new releases of the DIA SDK which expose new functionality which may not be recognized by a non-DIA based implementation.

The DIA SDK has a somewhat awkward object model, so before thinking about this patch it helps to understand at a fundamental level how PDBs are queried in DIA. The full set of interfaces is at the aforementioned URL, but one of those interfaces is of particular importance. The IDiaSymbol interface []. This is the *only* access to symbols in the PDB. This means functions, executables, global variables, and more are all queried through this interface (A complete list of symbol types is enumerated in the PDB_SymType enum included in this patch). The value of this enum for a particular symbol determines which subset of methods on the IDiaSymbol are valid to call. For example, the methods which are valid to call on an IDiaSymbol whose symbol type is PDB_SymType::Function are documented here [].

I have attempted to model this in my patch as follows:

  1. An abstract interface named IPDBSymbolBase includes the same set of methods defined by IDiaSymbol. Implementers of a particular PDB reading strategy (e.g. DIA) are expected to implement this interface.
  2. For each concrete symbol type, a non-abstract class is provided. In this initial patch, this includes PDBFunction and PDBExecutable, which expose the exact set of methods that are valid for that symbol type. This is essentially a wrapper about the base type, whose purpose is to limit the set of methods which the user can call.
  3. A user can obtain one of the concrete wrapper types by either manually constructing it with an IPDBSymbolBase pointer, or by using a dyn_cast<> like template called pdb_symbol_cast<>.

Note that DIA provides access to other information from a PDB that is not just symbols. For example, line numbers, source files, and various other things are simply independent interfaces and do not follow this tag-based / casting model. For those, I have simply cloned the interface and will expect the implementor to provide an implementation, and the user to use it through a base pointer.

Sorry for the long winded introduction, but since some people on this list have 0 experience with DIA, I hope this helps. I'm open to completely gutting this / re-working it if people think a different design would be more appropriate.

Diff Detail


Event Timeline

zturner updated this revision to Diff 19185.Feb 2 2015, 1:30 PM
zturner retitled this revision from to Initial pass at API design for DebugInfo/PDB.
zturner updated this object.
zturner edited the test plan for this revision. (Show Details)
zturner added a subscriber: Unknown Object (MLST).

The value of this enum for a particular symbol determines which subset of methods on the IDiaSymbol are valid to call.

One additional point about this line. The subset of methods which are valid to call for a particular IDiaSymbol is not easy to organize into a useful object hierarchy. For some values of the tag field, the subsets will be disjoint, while for others they will overlap, and for others they might even be the same. If you take the intersection across all tag values, for example, the only common method will be the function which actually returns the tag value.

So it's difficult to organize this into a meaningful object hierarchy.

dblaikie edited edge metadata.Feb 2 2015, 2:36 PM

Few random comments provided.

23 ↗(On Diff #19185)

In theory the LLVM coding convention requires that types with vtables have an anchor function (one non-inline virtual function).

I'm not sure if the type /only/ has pure virtual functions whether that matters (but it probably does - even a vtable of only pure virtual functions still has to exist, I think)

25 ↗(On Diff #19185)

This is the total number of things this stream will visit? (or the number remaining? or the current index?)

26 ↗(On Diff #19185)

What's this the name of?

27 ↗(On Diff #19185)

I take it this returns the current item but doesn't move the cursor? Maybe it should be const?

28 ↗(On Diff #19185)

return Optional<RecordType> rather than an out parameter?

(I'd probably call this "next" but that might be a bit vague (& collides with std::next))

32 ↗(On Diff #19185)

missing newline at end of file

25 ↗(On Diff #19185)

Wouldn't this return value be passing ownership (via unique_ptr or similar)? and similarly on the next two functions.

27 ↗(On Diff #19185)

Not sure if these typedefs help readability - it feels like they hinder it by hiding a common construct (unique_ptr) in something opaque when reading the code. But I tend to be a bit verbose...

zturner added inline comments.Feb 2 2015, 2:43 PM
25 ↗(On Diff #19185)

The former. It's the total number of times it's valid to call getNext() if you're at the beginning of the stream (when it's newly constructed, or you've just called reset).

26 ↗(On Diff #19185)

The name of the data stream (which in turn identifies what kind of data it has). The stream named "FPO" for example, contains fpo data.

27 ↗(On Diff #19185)

In theory it probably could be, but the underlying DIA interface doesn't have any const methods. I could probably use const on almost every single method of IPDBSymbol if we're ok const_casting. I didn't make anything const though since the underlying DIA interface has no const methods.

Since you poitned out this method though, I also noticed that I left off the index.

25 ↗(On Diff #19185)

Yes, thanks. I forgot to update these to return unique_ptrs.

27 ↗(On Diff #19185)

Does llvm tend to write out unique_ptr<T> everywhere? It leads to a lot of line wrapping.

dblaikie added inline comments.Feb 2 2015, 2:57 PM
25 ↗(On Diff #19185)

Not sure if there might be a better name for this. (size()?)

26 ↗(On Diff #19185)

Is that useful to clients? Would there be a better API like just having different types (templated, perhaps) of streams?

27 ↗(On Diff #19185)

Chances are you won't need to const_cast much since you'll have indirection to the underlying API anyway, I assume (the concrete implementation of this interface will have a member pointer to some DIA structure, right? And you call functions on the thing pointed to - since const isn't "deep" that'll just work without const_cast)

(open question about the threading guarantees of the underlying DIA APIs, though - if those operations that are conceptually const but not marked const in the DIA API are not thread safe, that might surprise users of this API to call a non-thread-safe const function, but if that's the case, a comment that this whole API is thread incompatible would be fine I think)

27 ↗(On Diff #19185)

I think so, at least in my experience - but I don't claim to have lots of data.

(I tend to be much more conservative about using typedefs (rarely use them) than most people, so feel free to treat this feedback with a grain of salt)

zturner added inline comments.Feb 2 2015, 3:10 PM
26 ↗(On Diff #19185)

I actually pondered not including access to the underlying streams at all. The reason is that the streams simply provide access to the raw bytes in the PDB file. All of the information present in the streams is available through the typed methods as well. But I figured it was useful still for the dumper. From what I can tell, the dumper will be the only real client of the stream data, so just returning the name sems kind of reasonable. Another thing is that it's not documented what all of the different streams that might be returned are. So far I've observed a few different ones through some dumps I've run on my own, but they're not always the same, so there could be some we don't know about.

As an aside, one thing I actually thought about with regards to the object model was only including 1 method in the base class. the one that gets the symbol tag. Then having each concrete type derive from the base tag and then provide pure virtual methods for the methods that it would support. So instead of having this PDBFunction, for example, which is ultimately just a facade that forwards methods to the underlying monolithic interface, instead have an IPDBFunction whose vtable contains the exact set of methods relevant for a function.

The advantages of this approach is that

  1. you can no longer call invalid methods through the IPDBSymbolBase interface.
  2. pdb_symbol_cast<> has more natural behavior now. Instead of casting a pointer type to a value type, it actually behaves more like a traditional dynamic cast.

And the disadvantages are:

  1. Implementor has to do more work now. It has to implement the same function many times, for example when valid methods overlap.
zturner updated this revision to Diff 19247.Feb 3 2015, 10:42 AM
zturner edited edge metadata.

Addressed all (I think?) of David's comments.

dblaikie added inline comments.Feb 3 2015, 11:25 AM
28 ↗(On Diff #19247)

This cast machinery seems a bit confusing owing to the mix of pointers and value types. (& you mention in the summary that you can also use the traditional llvm::*cast machinery too? That would be odd as well - to cast to concrete types from pointer types)

One other way that PDBFunction et, al could be implemented that might be more natural (& require less of this machinery) - make PDBSymbol opaque (virtual dtor, no other functions) and have various PDBFunction/etc derive from that. Use LLVM's standard cast machinery here. Implement them in terms of a PDBRawSymbol/PDBSymbolImpl, something like that (which would be what your current PDBSymbol is) - maybe the base opaque PDBSymbol could have a non-virtual "getRaw/getImpl/whatever" to access the raw thing when desired (in dumpers, etc).

I just hesitate to add more 'odd' casting machinery - we already stretch that idiom a bit further than I think is helpful with some of our hierarchies.

21 ↗(On Diff #19247)

I don't think you need an out-of-line ctor /and/ dtor, just the dtor would be fine - and maybe a comment about these dtors being anchors.

zturner updated this revision to Diff 19275.Feb 3 2015, 2:50 PM

Reworked the class hierarchy per David's suggestions. Library now should work with llvm::isa<> and llvm::dyn_cast<> instead of relying on custom casting machinery. This was done by injecting a new base class below all the concrete PDBFunction / etc types.

dblaikie added inline comments.Feb 3 2015, 3:05 PM
24 ↗(On Diff #19275)

Name's OK, but "count" still seems like it might be a bit ambiguous with index, or number remaining. Another option (which seems more idiomatic in LLVM) might be "getNumChildren" (while I would like "size()" it's annoying that the count is attached to the iterator and not to some separate structure that produces iterators and I don't know how to properly communicate the separation of these concepts, unfortunately)

193 ↗(On Diff #19275)

How're these functions going to communicate failure?

33 ↗(On Diff #19275)

return by reference rather than pointer here, perhaps?

46 ↗(On Diff #19275)

could probably make this private and the derived classes can just call "getRawSymbol" - the derived classes don't need the ability to mutate (null, repoint, etc) this variable. (optional, not a big deal either way, just a little more stringent encapsulation)

The member could actually be const, too/instead.

20 ↗(On Diff #19275)

"return nullptr" will work here

zturner added inline comments.Feb 3 2015, 3:11 PM
193 ↗(On Diff #19275)

Undefined in theory, probably false in practice. You shouldn't call methods on the raw interface unless you know it's a valid method.

For the purposes of a dumper who wanted to detect unknown / unexpected fields, that knowledge could all be encapsulated in the implementation of the raw interface. For example, you could have PDBSymbol::dump() which calls RawSymbol->dump(), and that particular implementation can go to the native API instead of calling the friendly accessors.

I was thinking it might still be useful for people who knew what they were
doing and who wanted to write some implementation-specific code to be able
to get at the native native interface. Like in this case a raw COM
pointer. In theory this shouldn't ever be necessary, but it's hard to say
without knowing anything about implementation strategies other than DIA how
much compatibility someone would be able to achieve, and if there ever
might be things that are only possible to implement in terms of one API but
not the other. So I didn't want to prematurely remove the possibility to
get at the native interface, for example by writing

zturner updated this revision to Diff 19361.Feb 4 2015, 3:51 PM

Ok this CL just got way bigger, but it should be fairly complete in terms of stuff that will live in the generic API. There is now one concrete implementation of PDBSymbol for every tag value, and this uses some macro C++11 magic to forward the calls through to the underlying raw interface.

Additionally, this provides a unit test which uses a mock implementation of the underlying raw symbol to verify that the dyncast semantics are as expected. Confirmed that the unit tests are hooked up correctly, as I changed one line in the test to something that would obviously fail, and my test failed.

ping. Don't expect you to read through this whole patch again thoroughly since it's gotten long, but I think all issues have been addressed. And many of the files are fairly boilerplate. LMK if you're ok with this going in.

dblaikie accepted this revision.Feb 6 2015, 10:10 AM
dblaikie edited edge metadata.

Few optional bits of feedback but otherwise seems fine.

If you want some more help/details on the .def stuff, let me know - happy to go through it.

64 ↗(On Diff #19361)

This still means the vtable is emitted weakly, though it will only be emitted in one place. So this doesn't quite meet the style guide's requirement & I'd rather use the virtual dtor or an explicit anchor function, but I'm not so fussed. I do think the rule is a bit silly, but I'm not sure this particular approach is worthwhile (I suppose my thought is that the rule is silly enough that it's not worth partly/tangentially/sort-of adhering to, might as well just not bother adhering to it at all, perhaps - I dunno)

56 ↗(On Diff #19361)

Could possibly move the #define closer to the use (even inside the function) and maybe undef it afterwards.

63 ↗(On Diff #19361)

Would it be useful to do this more as a .def file style - Take a look at Duncan's recent use of include/llvm/IR/Metadata.def for an example (essentially have the big long list of all the PDBSymbols written once, in one file - then #define a macro for what you want to do with those names, then include the file - thus stamping out all sorts of things you want, for all PDBSymbols)

53 ↗(On Diff #19361)

Probably prefer the LHS return type. The only reason to use the -> return type formulation is if you want to refer to function parameters, generally.

Also, rather than casting nullptr, the usual way to "give me a thing of some particular kind in an unevaluated context" is a utility called declval:

decltype(declval<IPDBRawSymbol>().Func()) Func() const override {
62 ↗(On Diff #19361)

Use override rather than virtual (for all of these).

This revision is now accepted and ready to land.Feb 6 2015, 10:10 AM
This revision was automatically updated to reflect the committed changes.