This is an archive of the discontinued LLVM Phabricator instance.

[llvm-pdbdump] Add the beginning of PDB diffing support
ClosedPublic

Authored by zturner on Mar 13 2017, 2:11 PM.

Details

Summary

For now we only support diffing the MSF Super Block and the Stream Directory. Subsequent patches will drill down into individual streams.

Diff Detail

Repository
rL LLVM

Event Timeline

zturner created this revision.Mar 13 2017, 2:11 PM
ruiu edited edge metadata.Mar 13 2017, 2:18 PM

Compared to dumping two files using pdb2yaml and diffing them, what is the advantage of this tool?

llvm/tools/llvm-pdbdump/Diff.cpp
103 ↗(On Diff #91613)

No newline at end?

In D30908#699783, @ruiu wrote:

Compared to dumping two files using pdb2yaml and diffing them, what is the advantage of this tool?

Well, as an extreme (but obviously hypothetical) example, consider the case where pdb2yaml outputs a constant yaml file no matter what input it gets. The diff would always be the same, but the underlying pdbs could be entirely unrelated.

This is obviously contrived, but the point is that if there are bugs in tool X, it doesn't make sense to use the output of tool X to verify that tool X is working correctly.

Beyond that though, pdb2yaml is only intended to give you a textual representation of the contents. It's useful to be able to understand how 2 PDB files differ at the byte level. It's entirely possible that two PDBs which differ in subtle ways could produce the same YAML. Also, even with a 50MB pdb, pdb2yaml takes an extremely long time to run. If we wanted to do it on something the order of chrome.dll.pdb which is maybe 1GB, it would be impractical.

zturner updated this revision to Diff 91622.Mar 13 2017, 2:40 PM

Minor updates.

As one more point about "why not just diff the YAML", I just ran pdb2yaml and yaml2pdb on a random PDB on my drive. Then I ran llvm-pdbdump diff on it. With just the code in this patch, I get the following output:

Found 8 common streams.  Searching for intra-stream differences.
  DBI Stream (Left: 283128 bytes, Right: 160570 bytes)
  Named Stream "/names" (Left: 58244 bytes, Right: 55870 bytes)
  Old MSF Directory (Left: 40 bytes, Right: 0 bytes)
  PDB Stream (Left: 118 bytes, Right: 114 bytes)
  TPI Stream (Left: 382784 bytes, Right: 382632 bytes)

Note that the TPI stream on the left (the original file) is 152 bytes larger. If I use pdb2yaml again on the new file and diff their contents, nothing appears different about the TPI stream. So we need other tools to figure this kind of thing out.

rnk edited edge metadata.Mar 13 2017, 2:42 PM

It's also possible to have two PDBs with equivalent type streams but different type index numberings. This diffing tool would make it possible to look past those superficial differences.

rnk accepted this revision.Mar 13 2017, 2:43 PM

lgtm

This revision is now accepted and ready to land.Mar 13 2017, 2:43 PM
amccarth added inline comments.Mar 13 2017, 3:08 PM
llvm/tools/llvm-pdbdump/Diff.cpp
117 ↗(On Diff #91613)

Should you report if P.size() != Q.size()?

122 ↗(On Diff #91613)

Should that be !=? (Next line as well.)

150 ↗(On Diff #91613)

Is the Comparator necessary? Isn't the default comparator std::less, which should do exactly the same thing?

166 ↗(On Diff #91613)

In practice, I suspect .data() will work, but .begin() would seem more principled, since these algorithms work with pairs of iterators. The fact that SmallVector iterators are pointers (as returned by .data()) is an implementation detail I wouldn't rely on.

llvm/tools/llvm-pdbdump/StreamUtil.h
20 ↗(On Diff #91613)

Could you add a comment clarifying what this method does?

zturner added inline comments.Mar 13 2017, 3:37 PM
llvm/tools/llvm-pdbdump/Diff.cpp
117 ↗(On Diff #91613)

That should get reported indirectly (because we will just report that one file has streams not present in the other file).

122 ↗(On Diff #91613)

Yes, fixed in latest diff.

150 ↗(On Diff #91613)

PI is the value type of the enumerator range, which in this case is something like result_pair<SmallString<32>>. Even if that weren't the case and it were just std::pair<size_t, std::string> though, this comparator is specifically designed to only look at the second item of the pair (the name) and not the index.

166 ↗(On Diff #91613)

begin() would actually be wrong here. I've only called reserve, not resize, so size() == 0. SmallVector<> has this optimization so that you can copy objects into it without default constructing them first, and then calling set_size() (which is different than resize()) after you write into its internal buffer.

amccarth added inline comments.Mar 13 2017, 4:05 PM
llvm/tools/llvm-pdbdump/Diff.cpp
166 ↗(On Diff #91613)

I understand you've only reserved space using this SmallVector feature. I would expect begin on an empty SmallVector with a reservation to return an iterator at the beginning of the buffer and for end to return the same thing.

In fact, the implementation of SmallVector::data() is to return begin() coerced to the container's pointer type (which happens to be identical to its iterator type).

/// Return a pointer to the vector's buffer, even if empty().
pointer data() { return pointer(begin()); }

I normally wouldn't go diving into a container's implementation details, but it's the only place I can find that documents this feature of using the reserved buffer of an empty container.

Mixing pointers and iterators of a SmallVector works because they're interchangeable, but it seems to make this code fragile if someone were ever to come along and select a different container type, where the same effect could be achieved with a reserve and a back_insert_iterator.

This revision was automatically updated to reflect the committed changes.