This is an archive of the discontinued LLVM Phabricator instance.

[llvm-pdbutil] Add -type-ref-stats to help find unused type info
ClosedPublic

Authored by rnk on Mar 20 2019, 4:22 PM.

Details

Summary

This considers module symbol streams and the global symbol stream to be
roots. Most types that this considers "unreferenced" are referenced by
LF_UDT_MOD_SRC_LINE id records, which VC seems to always include.
Essentially, they are types that the user can only find in the debugger
if they call them by name, they cannot be found by traversing a symbol.

In practice, around 80% of type information in a PDB is referenced by a
symbol. That seems like a reasonable number.

I don't really plan to do anything with this tool. It mostly just exists
for informational purposes, and to confirm that we probably don't need
to implement type reference tracking in LLD. We can continue to merge
all types as we do today without wasting space.

Event Timeline

rnk created this revision.Mar 20 2019, 4:22 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 20 2019, 4:22 PM
zturner accepted this revision.Mar 20 2019, 4:26 PM

If we ever wanted more aggressive pruning of dead type information, it probably makes sense to do that as a post-processing step where we re-write the PDB, something like an llvm-pdbutil gc subcommand.

This revision is now accepted and ready to land.Mar 20 2019, 4:26 PM
aganea added a comment.EditedMar 21 2019, 5:45 AM

If we ever wanted more aggressive pruning of dead type information, it probably makes sense to do that as a post-processing step where we re-write the PDB, something like an llvm-pdbutil gc subcommand.

@zturner You mean in terms of execution time, or in terms of code complexity in LLD?

@rnk At a later stage, would it be possible to add an option to only print the referenced/unreferenced types? ie. llvm-pdbutil dump -types=[all(default),ref,unref]

Just wanted to mention that we've just hit the 2 GB bar for PDBs in some of our production builds. So even if this patch provides a 20% reduction for PDBs, it would make a difference for us.
The other related issue is the symbol servers, which use CAB as a compressed file format. Unfortunately, CAB has a uncompressed file size limit of 2 GB, so we can't store them compressed anymore, which is a bit annoying to say the least.
I'll run your patch on some PDBs to see what numbers I come with.

Another question - do you think we could replace forward references by the the concrete ones?
ie.replace references to 0x100E with 0x1016, and remove 0x100E:

CHECK:   0x100E | LF_STRUCTURE [size = 108, unreferenced] `__vc_attributes::event_sourceAttribute`
CHECK:            unique name: `.?AUevent_sourceAttribute@__vc_attributes@@`
CHECK:            vtable: <no type>, base list: <no type>, field list: <no type>
CHECK:            options: forward ref (-> 0x1016) | has unique name, sizeof 0
CHECK:   0x1016 | LF_STRUCTURE [size = 108, unreferenced] `__vc_attributes::event_sourceAttribute`
CHECK:            unique name: `.?AUevent_sourceAttribute@__vc_attributes@@`
CHECK:            vtable: <no type>, base list: <no type>, field list: 0x1015
CHECK:            options: has ctor / dtor | contains nested class | has unique name, sizeof 12

Another question - do you think we could replace forward references by the the concrete ones?
ie.replace references to 0x100E with 0x1016, and remove 0x100E:

CHECK:   0x100E | LF_STRUCTURE [size = 108, unreferenced] `__vc_attributes::event_sourceAttribute`
CHECK:            unique name: `.?AUevent_sourceAttribute@__vc_attributes@@`
CHECK:            vtable: <no type>, base list: <no type>, field list: <no type>
CHECK:            options: forward ref (-> 0x1016) | has unique name, sizeof 0
CHECK:   0x1016 | LF_STRUCTURE [size = 108, unreferenced] `__vc_attributes::event_sourceAttribute`
CHECK:            unique name: `.?AUevent_sourceAttribute@__vc_attributes@@`
CHECK:            vtable: <no type>, base list: <no type>, field list: 0x1015
CHECK:            options: has ctor / dtor | contains nested class | has unique name, sizeof 12

I don't think we could replace forward references with the full declarations. Forward references exist to deal with cycles, as well as to make sure the entire type stream can be topologically sorted. Without forward references you wouldn't be able to do this. For example, consider this struct:

struct Node {
  Node *n;
};
rnk added a comment.Mar 21 2019, 10:14 AM

Another question - do you think we could replace forward references by the the concrete ones?

I don't think we could replace forward references with the full declarations. Forward references exist to deal with cycles, as well as to make sure the entire type stream can be topologically sorted.

While the linker certainly doesn't want to read PDBs with cycles in the type graph, it's possible that debuggers could handle it just fine. In fact, it might even speed things up. There are cases (unnamed types in C) where the compiler can't use forward references because there is no unique mangled name for the type, and the debuggers handle that.

I think it would be reasonable to add an option to LLD to write PDBs that replace all forward references with the appropriate complete type. Then we can experiment with debuggers and see if it works, what the size and perf effect is, etc.

zturner added a comment.EditedMar 21 2019, 10:20 AM
In D59620#1438186, @rnk wrote:

Another question - do you think we could replace forward references by the the concrete ones?

I don't think we could replace forward references with the full declarations. Forward references exist to deal with cycles, as well as to make sure the entire type stream can be topologically sorted.

While the linker certainly doesn't want to read PDBs with cycles in the type graph, it's possible that debuggers could handle it just fine. In fact, it might even speed things up. There are cases (unnamed types in C) where the compiler can't use forward references because there is no unique mangled name for the type, and the debuggers handle that.

I think it would be reasonable to add an option to LLD to write PDBs that replace all forward references with the appropriate complete type. Then we can experiment with debuggers and see if it works, what the size and perf effect is, etc.

I don't think you can have a cycle to an unnamed type though because it's impossible to reference it.

Or maybe you can, by doing something like:

struct {
  auto foo() { return this; }
  
  decltype(foo()) node;
};

I'd be curious to see how well the debugger handles that.

rnk added a comment.Mar 21 2019, 10:28 AM

I don't think you can have a cycle to an unnamed type though because it's impossible to reference it.

In C, yes, so far as I can tell it's impossible to reference an unnamed struct or union from inside it.

Or maybe you can, by doing something like:

struct {
  auto foo() { return this; }
  
  decltype(foo()) node;
};

I'd be curious to see how well the debugger handles that.

The decltype bit isn't necessary, just having a method is enough to make a cycle between the method type and the class. So, since pretty much all types in C++ have methods, MSVC gives all C++ types "unique" names. Of course, if they are at global scope, they may not in fact be unique across the whole program, but that just means there might type confusion later on during a debugging session.

For this example, MSVC creates a fwd decl with the unique name .?AU<unnamed-type-node>@@:

struct {
  auto foo() { return this; }
  int x;
} node;
int bar() { return node.foo()->x; }
In D59620#1438210, @rnk wrote:

I don't think you can have a cycle to an unnamed type though because it's impossible to reference it.

In C, yes, so far as I can tell it's impossible to reference an unnamed struct or union from inside it.

Or maybe you can, by doing something like:

struct {
  auto foo() { return this; }
  
  decltype(foo()) node;
};

I'd be curious to see how well the debugger handles that.

The decltype bit isn't necessary, just having a method is enough to make a cycle between the method type and the class. So, since pretty much all types in C++ have methods, MSVC gives all C++ types "unique" names. Of course, if they are at global scope, they may not in fact be unique across the whole program, but that just means there might type confusion later on during a debugging session.

For this example, MSVC creates a fwd decl with the unique name .?AU<unnamed-type-node>@@:

struct {
  auto foo() { return this; }
  int x;
} node;
int bar() { return node.foo()->x; }

The reason I put the variable with the decltype is because it's the easiest way to force the debugger to deal with the cycle. Without that, sure you can generate a cycle in the debug info, but it's not clear under what conditions the debugger would actually require that record.

This way, it's easy to add a watch in the debugger for foo.node and the debugger is forced to resolve the cycle.

rnk added a comment.Mar 21 2019, 10:51 AM

The reason I put the variable with the decltype is because it's the easiest way to force the debugger to deal with the cycle. Without that, sure you can generate a cycle in the debug info, but it's not clear under what conditions the debugger would actually require that record.

This way, it's easy to add a watch in the debugger for foo.node and the debugger is forced to resolve the cycle.

Well, MSVC rejected it, so I had to remove it: https://gcc.godbolt.org/z/0V3lLY ¯\_(ツ)_/¯
If it did work, they'd still use the unique name.

This revision was automatically updated to reflect the committed changes.
rnk added a comment.Mar 21 2019, 11:03 AM

@rnk At a later stage, would it be possible to add an option to only print the referenced/unreferenced types? ie. llvm-pdbutil dump -types=[all(default),ref,unref]

(forgot to answer this) Definitely, we already have a -dependents -type-index=0x1005,0x1009 flag combo that lets users selectively dump some types, so adding another mode should be simple.

I ran llvmpdbutil dump -type-ref-stats for one of our game projects:

Build typePDB sizeType records referencedType bytes referencedPotential savings (bytes)
game, Static, Production, Win64, Gameplay Debug2 GB5,914,383 / 7,050,382 83.89%377,342,432 / 536,798,320 70.30%~159 MB
game, Static, Production, Win64, Release1.9 GB5,888,309 / 7,025,672 83.81%376,582,688 / 535,410,704 70.34%~159 MB
game DLL A, Production, Win64, Debug972 MB3,046,539 / 3,493,601 87.20%187,676,328 / 287,416,200 65.30%~100 MB
game DLL A, Production, Win64, Gameplay Debug953 MB3,041,243 / 3,481,091 87.36%187,480,768 / 286,838,516 65.36%~99 MB
game DLL A, Production, Win64, Release928 MB3,031,469 / 3,468,271 87.41%187,134,900 / 286,046,296 65.42%~99 MB
game DLL B, Production, Win64, Debug928 MB2,877,786 / 3,478,483 82.73%187,420,152 / 253,259,260 74.00%~66 MB
game DLL B, Production, Win64, Gameplay Debug842 MB2,859,184 / 3,406,422 83.94%186,752,780 / 246,835,552 75.66%~60 MB
game DLL C, Production, Win64, Release246 MB1,052,608 / 1,207,096 87.20%75,336,256 / 95,823,496 78.62%~20 MB
game DLL D, Production, Win64, Release75 MB406,147 / 524,558 77.43%26,030,952 / 34,267,400 75.96%~8 MB
game, Engine, Win64, Release1.5 GB4,715,892 / 5,718,803 82.46%309,142,248 / 407,502,104 75.86%~98 MB
game, Engine, Win64, Profile1.2 GB4,244,060 / 5,177,703 81.97%275,567,308 / 367,489,660 74.99%~92 MB
game, Engine, Win64, Retail981 MB3,936,867 / 4,848,175 81.20%257,537,964 / 347,997,212 74.01%~90 MB

The savings seem quite interesting if we had this in LLD or as a post-processing step as Zachary suggested.

thakis added a subscriber: thakis.Mar 22 2019, 11:48 AM

Zach, are you saying this should be a postprocessing step since not writing dead information in the first place would be very slow?

If we ever wanted more aggressive pruning of dead type information, it probably makes sense to do that as a post-processing step where we re-write the PDB, something like an llvm-pdbutil gc subcommand.

@zturner You mean in terms of execution time, or in terms of code complexity in LLD?

@rnk At a later stage, would it be possible to add an option to only print the referenced/unreferenced types? ie. llvm-pdbutil dump -types=[all(default),ref,unref]

Just wanted to mention that we've just hit the 2 GB bar for PDBs in some of our production builds. So even if this patch provides a 20% reduction for PDBs, it would make a difference for us.
The other related issue is the symbol servers, which use CAB as a compressed file format. Unfortunately, CAB has a uncompressed file size limit of 2 GB, so we can't store them compressed anymore, which is a bit annoying to say the least.
I'll run your patch on some PDBs to see what numbers I come with.

We're at 1.5GB PDB size for chrome too (with split main dlls, something we might want to undo), so for Chromium PDB size reductions are useful too.