This is an archive of the discontinued LLVM Phabricator instance.

[pdbdump] Verify LF_{CLASS,ENUM,INTERFACE,STRUCTURE,UNION} records with different hash function.
ClosedPublic

Authored by ruiu on Jun 15 2016, 11:42 AM.

Details

Summary

This patch adds a new hash function hashBufferV8 and use it to verify TPI records.

Diff Detail

Event Timeline

ruiu updated this revision to Diff 60870.Jun 15 2016, 11:42 AM
ruiu retitled this revision from to [pdbdump] Verify LF_{CLASS,ENUM,INTERFACE,STRUCTURE,UNION} records with different hash function..
ruiu updated this object.
ruiu added a reviewer: zturner.
ruiu added a subscriber: llvm-commits.
zturner edited edge metadata.Jun 15 2016, 11:47 AM

I think we should have some tests confirming that these functions produce the same values that are in MS-generated PDBs for the same record contents.

ruiu added a comment.Jun 15 2016, 11:48 AM

Do we already have unit tests for PDB?

ruiu added a comment.Jun 15 2016, 11:49 AM

Ah, we have tests under unittests/DebugInfo/PDB. I'll add a new file there.

ruiu added a comment.Jun 15 2016, 11:52 AM

Actually, empty.pdb has type records that uses this hash function, so if it's broken the test would break. You can confirm that by returning some random value from hashBufferV8. Do we need more tests?

Well but empty.pdb is not generated by us. Since now we're generating the hash, rather than just reading the hash that someone else generated I think we should write a test that verifies that a) we're using the correct hash function, and b) we're hashing the right buffers. So I was thinking of a test that would work like the following:

  1. Read type records, and the corresponding hashes out of empty.pdb
  2. Use our functions to hash those records and generate hash values.
  3. verify that the hash we generate is the same one that was found in the PDB file.

Similar to how you ran that test locally a week or so ago to check whether the hashes match so we know we're using it correctly, but in a test.

ruiu added a comment.Jun 15 2016, 12:01 PM

I think that's exactly what we are doing -- in this code, we read type records, compute hash values using our hash function, and compare them with hash values on the PDB file. If they are different, pdbdump reports that the file is corrupted.

Ahh, I think I get it. I'm a little concerned about the performance impact
of this though. The whole reason the hash is stored in the file is so that
it doesn't have to recompute the hash up front. If we're re-computing it
every time we load the type streams, it seems like it will be a big
performance problem. I think we should only compute the hash when we
modify a record and/or write the file.

ruiu added a comment.Jun 15 2016, 12:43 PM

Yeah, I don't think this code will live here forever. I'm adding code here to verify that my understanding on how the hash values are computed is correct. Once everything is set, I'll probably remove it from here and move this to a writer.

In D21393#459087, @ruiu wrote:

Yeah, I don't think this code will live here forever. I'm adding code here to verify that my understanding on how the hash values are computed is correct. Once everything is set, I'll probably remove it from here and move this to a writer.

Ok, for the time being can you add a command line option --verify-tpi-hash-integrity and only do this check if it is set? Sometimes I dump really large files, and I'm afraid this would double the amount of time it takes which is already >= 1 minute in some cases.

ruiu updated this revision to Diff 60978.Jun 16 2016, 8:00 AM
ruiu edited edge metadata.
  • Rebased
zturner added inline comments.Jun 16 2016, 11:12 AM
lib/DebugInfo/PDB/Raw/TpiStream.cpp
73

I was imagining that we wouldn't even have this TpiHashVerifier class, and that everything would just go in TypeDumper. Does this not work? For example, imagine you delete all of this code, and delete the TpiHashVerifier class, and delete lines 205-209. Then, in TypeDumper.cpp, you change CVTypeDumper::visitClass to look like this:

Error CVTypeDumper::visitClass(ClassRecord &Class) {
  uint16_t Props = static_cast<uint16_t>(Class.getOptions());
  W->printNumber("MemberCount", Class.getMemberCount());
  W->printFlags("Properties", Props, makeArrayRef(ClassOptionNames));
  printTypeIndex("FieldList", Class.getFieldList());
  printTypeIndex("DerivedFrom", Class.getDerivationList());
  printTypeIndex("VShape", Class.getVTableShape());
  W->printNumber("SizeOf", Class.getSize());
  W->printString("Name", Class.getName());
  if (Props & uint16_t(ClassOptions::HasUniqueName))
    W->printString("LinkageName", Class.getUniqueName());
  Name = Class.getName();

  /* NEW CODE HERE */
  if (auto EC = verifyHash(Class))
    return EC;
  return Error::success();
}

With the current implementation, we walk the entire type array twice. Once to verify hashes, and once to dump them. So if we could do it all in one pass, it seems better.

ruiu added a comment.Jun 16 2016, 11:33 AM

It may work (though I didn't try), but I think the current structure is better since we are going to remove this code from here and add code to actually generate hash values to the writer. In the current structure, the code to generate hash values are separated and concentrated. It's better than scatter it into multiple methods.

zturner accepted this revision.Jun 16 2016, 11:38 AM
zturner edited edge metadata.
This revision is now accepted and ready to land.Jun 16 2016, 11:38 AM
This revision was automatically updated to reflect the committed changes.