Page MenuHomePhabricator

[ThinLTO] Add summary entries for index-based WPD
ClosedPublic

Authored by tejohnson on Nov 21 2018, 2:54 PM.

Details

Summary

If LTOUnit splitting is disabled, the module summary analysis computes
the summary information necessary to perform single implementation
devirtualization during the thin link with the index and no IR. The
information collected from the regular LTO IR in the current hybrid WPD
algorithm is summarized, including:

  1. For vtable definitions, record the function pointers and their offset

within the vtable initializer (subsumes the information collected from
IR by tryFindVirtualCallTargets).

  1. A record for each type metadata summarizing the vtable definitions

decorated with that metadata (subsumes the TypeIdentiferMap collected
from IR).

Also added are the necessary bitcode records, and the corresponding
assembly support.

The index-based WPD will be sent as a follow-on.

Depends on D53890.

Diff Detail

Event Timeline

tejohnson created this revision.Nov 21 2018, 2:54 PM
tejohnson updated this revision to Diff 176222.Nov 30 2018, 4:25 PM

Enable promotion of type ids in the non-split ThinLTO case and test it.

ormris added a subscriber: ormris.Jan 4 2019, 10:09 AM
ormris removed a subscriber: ormris.
ormris added a subscriber: ormris.

Update patch to reflect changes at head.

A very high level meta question: can thinLink phase read symbol table from each module? If yes, why duplicate the symtab information into summary?

A very high level meta question: can thinLink phase read symbol table from each module? If yes, why duplicate the symtab information into summary?

We do in fact require reading the IRSymtab along with the summary, e.g. the symbol strings are shared. What duplication are you concerned about? I don't think anything added in this patch would be available in the symbol table.

Vtable funcs info should be in the initializers of vtable global variables right?

Vtable funcs info should be in the initializers of vtable global variables right?

The IRSymtab doesn't hold the initializers of global variable symbols. It only holds the information necessary for LTO to provide information to the linkers to drive linker based symbol resolution.

As we discussed offline, one could imagine moving this information in the IRSymtab (it is currently in the IR MODULE_BLOCK), and then sharing with the summary, but this is a much bigger restructuring that would need to be considered separately. E.g. in the MODULE_BLOCK the global var prototype is in a different record than its definition initializer, and other than the string table there isn't any sharing between the info in the MODULE_BLOCK and the IRSymtab AFAIK. Before moving the initializer into the IRSymtab it would make sense to start by deduplicating other symbol information between the MODULE_BLOCK and the IRSymtab (I recall discussing this with @pcc at some point but don't recall what conclusion we came to). And then this would also require integration between the summary block and the IRSymtab which, again other than the string table, are separate entities without shared information or cross references.

One thing to note about this patch is that we already hold in the GLOBALVAR summary entry all the references to other symbols in its initializer (which are maintained as an array of essentially pointers to other summary index entries), however with the new FS_PERMODULE_VTABLE_GLOBALVAR_INIT_REFS record, and as well in the in memory summary representation, we end up essentially duplicating those references for vtable defs. I could presumably come up with a way of consolidating those. I had convinced myself that there weren't enough vtable defs to warrant too much restructuring, but let me see if I can come up with a clean way to do this that doesn't bloat non-vtable globalvar summaries.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 17 2019, 7:53 AM
This revision was automatically updated to reflect the committed changes.

This was a mistaken commit, reverting!

tejohnson reopened this revision.Jan 17 2019, 8:09 AM

Reverted in r351455.

Ping on review plus some numbers and a small improvement that I will upload shortly.

I did some measurements to see how much overhead we add to the bitcode files for an important internal app. The overhead from adding the new records is pretty minor. Most of the increase over current non-WPD ThinLTO bitcode sizes comes from the addition of the type test instructions and the type test metadata - this just comes from enabling the options to get those in the IR (independent of which type of WPD we do we'll need those). Comparing the current hybrid WPD bitcode with the new summarized WPD bitcode, the bitcode files are actually smaller with the index summarized type info compared to that with the current hybrid WPD. I believe this is due to the overhead of splitting the LTO unit into two modules. And the size of the minimized index-only bitcode we create just for the distributed thin link process in our build system is significantly smaller with the new approach - this goes back to my finding that we aren't able to split the LTO unit in many cases so we end up having to put all the IR for some of these files into the regular LTO partition.

All sizes below are in bytes. "minimized bitcode" means the additional index-only bitcode file for a distributed thin link emitted by clang with -fthin-link-bitcode (i.e. just contains symtab, summaries, strtab).

ThinLTO baseline (no LTO unit, no WPD):
aggregate bitcode size: 21464737248
aggregate minimized bitcode: 3251583939

ThinLTO hybrid WPD (current WPD implementation):
aggregate bitcode size: 22013358720 (2.6% bigger than baseline)
aggregate minimized bitcode: 7521978208 (1039% bigger than baseline)

ThinLTO index-only WPD (this patch):
aggregate bitcode size: 7521978208 (2.3% bigger than baseline, smaller than hybrid WPD)
aggregate minimized bitcode: 850885040 (28.6% bigger than baseline, much much smaller than current hybrid WPD)

Breaking down the increase in the "minimized bitcode" from this patch:

  • 3% of increase is due to the additional VTable definition record fields (the vptr ref/offset pairs)
  • 13% is due to the new Type ID metadata records (was about 14% before my latest tweak to make those more efficient that I will upload shortly)
  • The remainder of the increase as I noted earlier is simply due to the additional IR from the type info (which is going to be there regardless of WPD scheme).

So I don't think it is necessary to gate this on removing any redundancies of the vtable summaries with the IR bitcode, the increase from that is pretty small. And

tejohnson updated this revision to Diff 183660.Jan 25 2019, 4:38 PM

Use an abbrev id in the bitcode for the new TYPE_ID_METADATA records.
This makes them a bit more compact.

Two clarifications below:

Ping on review plus some numbers and a small improvement that I will upload shortly.

I did some measurements to see how much overhead we add to the bitcode files for an important internal app. The overhead from adding the new records is pretty minor. Most of the increase over current non-WPD ThinLTO bitcode sizes comes from the addition of the type test instructions and the type test metadata - this just comes from enabling the options to get those in the IR (independent of which type of WPD we do we'll need those). Comparing the current hybrid WPD bitcode with the new summarized WPD bitcode, the bitcode files are actually smaller with the index summarized type info compared to that with the current hybrid WPD. I believe this is due to the overhead of splitting the LTO unit into two modules. And the size of the minimized index-only bitcode we create just for the distributed thin link process in our build system is significantly smaller with the new approach - this goes back to my finding that we aren't able to split the LTO unit in many cases so we end up having to put all the IR for some of these files into the regular LTO partition.

All sizes below are in bytes. "minimized bitcode" means the additional index-only bitcode file for a distributed thin link emitted by clang with -fthin-link-bitcode (i.e. just contains symtab, summaries, strtab).

The minimized bitcode also contains anything that needs to be compiled in the regular LTO partition. Which for the hybrid WPD means all the vtables, plus all the IR in the case when we cannot split it but have vtable defs (which is why there is such a huge increase in that for the hybrid WPD below).

ThinLTO baseline (no LTO unit, no WPD):
aggregate bitcode size: 21464737248
aggregate minimized bitcode: 3251583939

ThinLTO hybrid WPD (current WPD implementation):
aggregate bitcode size: 22013358720 (2.6% bigger than baseline)
aggregate minimized bitcode: 7521978208 (1039% bigger than baseline)

ThinLTO index-only WPD (this patch):
aggregate bitcode size: 7521978208 (2.3% bigger than baseline, smaller than hybrid WPD)
aggregate minimized bitcode: 850885040 (28.6% bigger than baseline, much much smaller than current hybrid WPD)

Breaking down the increase in the "minimized bitcode" from this patch:

  • 3% of increase is due to the additional VTable definition record fields (the vptr ref/offset pairs)
  • 13% is due to the new Type ID metadata records (was about 14% before my latest tweak to make those more efficient that I will upload shortly)
  • The remainder of the increase as I noted earlier is simply due to the additional IR from the type info (which is going to be there regardless of WPD scheme).

For the minimized bitcode the increased IR shows up because the symtab and strtab are much larger.

So I don't think it is necessary to gate this on removing any redundancies of the vtable summaries with the IR bitcode, the increase from that is pretty small. And

Thanks for the data. It is certainly a great improvement over the hybrid WPD.

tejohnson updated this revision to Diff 184804.Feb 1 2019, 11:47 AM

Restore a couple of new tests that were lost when I updated the patch last.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 1 2019, 11:47 AM
davidxl added inline comments.Mar 12 2019, 9:18 AM
include/llvm/Bitcode/LLVMBitCodes.h
271 ↗(On Diff #184804)

It is a little difficult to parse the comment here. Is it just mapping from type id to global value id of the vtable? Also what is 'vtable definition decorated with a type metadata?'

include/llvm/IR/ModuleSummaryIndex.h
818 ↗(On Diff #184804)

Perhaps add that this is a mapping from type (id) to vtable info?

821 ↗(On Diff #184804)

Why is it a one-to-many mapping?

841 ↗(On Diff #184804)

TypeIdMetadataMap -- the name sounds too generic. Is it possible to rename it to carry more semantics?

1214 ↗(On Diff #184804)

Returns an ArrayRef?

tejohnson marked 5 inline comments as done.Mar 15 2019, 10:36 AM
tejohnson added inline comments.
include/llvm/Bitcode/LLVMBitCodes.h
271 ↗(On Diff #184804)

This is summarizing the type metadata described here:
https://llvm.org/docs/TypeMetadata.html

Each vtable def can be decorated with multiple type metadata that it is compatible with (due to inheritance). See the example below in test/Assembler/thinlto-vtable-summary.ll. E.g. B and C each derived from A so they each have 2 type metadata attachments. Therefore the typeIdMetadata summary for A includes both vtables.

I will try to add a more intuitive explanation, and point to the doc.

include/llvm/IR/ModuleSummaryIndex.h
818 ↗(On Diff #184804)

This one does not include the type id, it is the vtable def's ValueInfo and the byte offset from one of its attached type metadata (the offset is the offset into the vtable of the address point for that type as defined in the Itanium ABI: https://itanium-cxx-abi.github.io/cxx-abi/abi.html#vtable-general). The attached type metadata for a given vtable def are all the the types it is compatible with. It's a little hard to explain, the example in https://llvm.org/docs/TypeMetadata.html helps illustrate this. I can try to clarify and point to that doc.

821 ↗(On Diff #184804)

Because each type id may be compatible with multiple vtables, due to inheritance.

841 ↗(On Diff #184804)

Maybe TypeIdCompatibleVtableMap?

1214 ↗(On Diff #184804)

Yeah I can change this.

tejohnson updated this revision to Diff 196711.Apr 25 2019, 1:06 PM

Address review comments

@pcc - do you have any comments on this patch or are you ok with @davidxl completing the review?

I've done some renaming to try to make the types and variable names more intuitive. I've also added in explanations of the type metadata with pointers to the documentation in several places.

pcc added inline comments.
include/llvm/IR/ModuleSummaryIndex.h
683 ↗(On Diff #196711)

Instead of declaring this and TypeIdOffsetVtablePair as a pair, would it be better to declare them as structs with appropriately named fields? Seems like it would make the usage code easier to understand.

lib/Transforms/IPO/ThinLTOBitcodeWriter.cpp
447 ↗(On Diff #196711)

No else after return

tejohnson marked 2 inline comments as done.Apr 29 2019, 11:59 AM

Implement review suggestions

ping - @pcc any more comments?

pcc accepted this revision.Jun 28 2019, 7:06 PM

LGTM

lib/AsmParser/LLParser.cpp
834 ↗(On Diff #197155)

Redundant break

This revision is now accepted and ready to land.Jun 28 2019, 7:06 PM
tejohnson marked an inline comment as done.Jul 2 2019, 12:32 PM
tejohnson updated this revision to Diff 207604.Jul 2 2019, 12:37 PM

Address review comment, and update test for recent auto hide summary changes

This revision was automatically updated to reflect the committed changes.
Prazek added inline comments.Jul 2 2019, 12:50 PM
llvm/include/llvm/IR/ModuleSummaryIndex.h
896

Do we care about the alphabetical order in this mapping?
If not, the map can be changed to std::unordered_map (or other llvm-specific string map), which should be faster and more memory efficient.

llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
3691

Isn't this line redundant?

3857

Isn't this line redundant?

tejohnson marked 3 inline comments as done.Jul 2 2019, 1:20 PM
tejohnson added inline comments.
llvm/include/llvm/IR/ModuleSummaryIndex.h
896

We care about consistent ordering, as the iteration order will affect the order of the summary entries emitted into the bitcode. So unordered_map is not appropriate here. Ditto for StringMap.

llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
3691

yep, will fix.

3857

yep will fix.

Prazek marked an inline comment as done.Jul 2 2019, 1:33 PM
Prazek added inline comments.
llvm/include/llvm/IR/ModuleSummaryIndex.h
896

That's what I thought, sorry for late review and thanks for fixing other nits :)

tejohnson marked an inline comment as done.Jul 2 2019, 2:08 PM
tejohnson added inline comments.
llvm/include/llvm/IR/ModuleSummaryIndex.h
896

No prob, thanks for the comments. Committed fixes in r364968