This is an archive of the discontinued LLVM Phabricator instance.

[CodeView] Allow Debug Subsections to be read/written in any order
ClosedPublic

Authored by zturner on Jun 1 2017, 3:43 PM.

Details

Summary

Debug Subsections (Checksums, Line Info, Frame Data, etc) present an interesting challenge when it comes to reading and writing, especially to and from YAML. Some of these challenges are:

  1. Depending on whether the CodeView container is a PDB or an Object File, these show up in slightly different ways. For example, in an Object File, string tables appear in the .debug$S section as a "debug subsection", while in a PDB file the string table is global to the entire file and exists
  2. They can appear in any order, even though some subsections reference other subsections. For example, you might have a Lines subsection followed by a Checksums subsection followed by a String Table subsection. But the Lines subsection references the Checksums subsection, and the Checksum subsection references the String Table subsection. So you can't interpret the data until you've read everything.

When going from Obj to Yaml, you might read (for example) a File Checksums subsection, which references a String Table subsection which you haven't actually read yet. But in Pdb to Yaml, you'll already have the string table section because it's global to the file.

Prior to this patch, we handled in this in sort of a crude way. When going from PDB to YAML, we would find the checksums subsection and write it to YAML first, then write subsequent sections later. In a sense mandating a topological sorting similar to type record streams. But this has a drawback. It means that we can't easily test our ability to work with arbitrary files, because the only files we can produce are laid out a specific way. Furthermore, it means that round-tripping will end up producing a different PDB than what you started with, because some of the subsections will be in a different order.

Finally, it makes things even more complicated when we start trying to do Obj -> Yaml -> Obj, because now these have a String Table subsection and various other subsections with cross subsection dependencies, and maintaining the topological sorting becomes cumbersome. So I consider this work a precursor to getting CodeView Obj <-> Yaml support working.

In short, if it is a valid object (or PDB) file, we would like to be able to produce it.

This patch allows subsections to be specified in any order. While this complicates some of the internal Yaml <-> Native data structure conversion logic, it greatly simplifies the portion of the library which actually writes subsections out, because you no longer have to have complicated sanitization checks to ensure things are written out in a specified order. You can just write whatever you want and it will just work.

Diff Detail

Repository
rL LLVM

Event Timeline

zturner created this revision.Jun 1 2017, 3:43 PM
inglorion requested changes to this revision.Jun 1 2017, 4:01 PM

Happy you're doing this, but see inline comments for a few requested changes.

llvm/lib/DebugInfo/CodeView/DebugStringTableSubsection.cpp
79 ↗(On Diff #101127)

This will call StringToId.find(S) twice when asserts are enabled. Since that returns an iterator, can we instead do

auto It = StringToId.find(S);
assert(It != StringToId.end() && "getStringId called with string that does not exist in the string table");
return It->second;

and save the second lookup?

llvm/lib/DebugInfo/PDB/Native/DbiModuleDescriptorBuilder.cpp
42 ↗(On Diff #101127)

Do we need this? I thought we made it so that SymbolByteSize is always a multiple of 4 when writing PDBs. Also, is this related to the other changes here or a holdover from your earlier alignment fixes?

llvm/lib/DebugInfo/PDB/Native/ModuleDebugStream.cpp
93 ↗(On Diff #101127)

If I'm reading the code in ModuleDebugStreamRef::reload() correctly, we read the subsections before it makes sense to get to this code. Given that, can we store which one is the FileChecksums subsection at reload() time so that we don't have to iterate over all the subsections to find it?

This revision now requires changes to proceed.Jun 1 2017, 4:01 PM
zturner added inline comments.Jun 1 2017, 4:18 PM
llvm/lib/DebugInfo/PDB/Native/DbiModuleDescriptorBuilder.cpp
42 ↗(On Diff #101127)

I originally started investigating the alignment issue on the same branch as I was doing this patch. Then when I committed the other one and rebased on top of trunk, these changes were still here. I need to find all these remnants and remove them.

zturner added inline comments.Jun 1 2017, 5:25 PM
llvm/lib/DebugInfo/CodeView/DebugStringTableSubsection.cpp
79 ↗(On Diff #101127)

For a debug build I think readability should take priority over performance unless there's a good reason not to. Which is why I used exists(). That said, I also think it's not a big deal and it's a minor preference, so I don't mind changing it.

llvm/lib/DebugInfo/PDB/Native/ModuleDebugStream.cpp
93 ↗(On Diff #101127)

This is a ModuleDebugStreamRef, which literally contains no information about the subsections other than a big flat buffer consisting of their data. In order to do what you suggest, we'd have to iterate them up front during the reload, and then you would suffer that performance hit even in situations where you never accessed the module debug streams. I fixed a similar performance problem in pdbdump a few weeks ago where we were iterating all types up front to compute their hashes during the reload operatino, and then nobody ended up looking at the hash values. Usually this isn't noticeable, but when I tried it on a 1.5GB PDB file, it took several minutes.

I could see doing something like caching the checksums subsection in the ModuleDebugStreamRef class and only doing the full scan the first time this function is called, but short of that I don't think we should change it.

zturner updated this revision to Diff 101147.Jun 1 2017, 5:41 PM
zturner edited edge metadata.

Updated based on previous feedback.

zturner updated this revision to Diff 101149.Jun 1 2017, 5:45 PM

Minor update. Removed dead function, renamed variable back to its original name, and changed assertion as suggested previously.

zturner updated this revision to Diff 101153.Jun 1 2017, 5:53 PM

More minor cleanup. Deleted an unused class and file

This revision is now accepted and ready to land.Jun 2 2017, 12:29 PM
This revision was automatically updated to reflect the committed changes.