This is an archive of the discontinued LLVM Phabricator instance.

ThinLTOBitcodeWriter: Write available_externally copies of VCP eligible functions to merged module.
ClosedPublic

Authored by pcc on Feb 7 2017, 7:52 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

pcc created this revision.Feb 7 2017, 7:52 PM
pcc added a comment.Feb 7 2017, 7:56 PM

Note: I patched in https://github.com/pcc/llvm-project/commit/5a5904d6721f895eafdd2fc476872b98806c36e6 to measure perf impact. Total wall time spent in addRegularLTO when linking chrome was 8.2743s, as compared to about 6 seconds before (see D27324).

pcc updated this revision to Diff 87896.Feb 9 2017, 3:37 PM
  • Use AARGetter to simplify and make this pass easier to use from the new PM
tejohnson edited edge metadata.Feb 10 2017, 9:18 PM

I assume these get dropped in the usual place where we eliminate available externally. Does VCP benefit from any of the other optimizations we would typically apply before that point (e.g. inlining)?

llvm/lib/Transforms/IPO/ThinLTOBitcodeWriter.cpp
284 ↗(On Diff #87896)

Add comment about what these are testing (I think I have correctly reasoned them out, but would be good to be explicit).

297 ↗(On Diff #87896)

There are now comments about the eligible virtual funcs being added to the regular LTO, but I see there are no comments about the original condition here for putting a GVar in the regular LTO module - maybe add that or point to someplace that describes what we're doing.

llvm/test/Transforms/ThinLTOBitcodeWriter/split-vfunc-internal.ll
9 ↗(On Diff #87896)

Where do the long hex strings come from in the names in this test?

mehdi_amini edited edge metadata.Feb 10 2017, 9:23 PM
In D29701#670366, @pcc wrote:

Note: I patched in https://github.com/pcc/llvm-project/commit/5a5904d6721f895eafdd2fc476872b98806c36e6 to measure perf impact. Total wall time spent in addRegularLTO when linking chrome was 8.2743s, as compared to about 6 seconds before (see D27324).

So >30% overhead on addRegularLTO, but what is the overhead for runRegularLTO?

mehdi_amini added inline comments.Feb 10 2017, 9:30 PM
llvm/lib/Transforms/IPO/ThinLTOBitcodeWriter.cpp
311 ↗(On Diff #87896)

(One-line comment)

pcc marked an inline comment as done.Feb 13 2017, 6:08 PM
In D29701#670366, @pcc wrote:

Note: I patched in https://github.com/pcc/llvm-project/commit/5a5904d6721f895eafdd2fc476872b98806c36e6 to measure perf impact. Total wall time spent in addRegularLTO when linking chrome was 8.2743s, as compared to about 6 seconds before (see D27324).

So >30% overhead on addRegularLTO, but what is the overhead for runRegularLTO?

Just did a fresh set of runs on chrome with https://github.com/pcc/llvm-project/commits/lto-timers patched in, and took median of 5 for each measurement:

               before after
addRegularLTO 8.0407s 9.7432s (+21%)
runRegularLTO 9.6064s 10.4748s (+9%)

Note that we haven't done everything we can -- a couple more things I can think of are to exclude functions that take non-integer arguments (other than "this"), and to set optnone on functions in the merged module. Taking a look at chrome's merged module, there are 6727 available_externally functions, of which 1340 take a pointer argument other than "this".

pcc added a comment.Feb 13 2017, 6:21 PM

I assume these get dropped in the usual place where we eliminate available externally.

Yes (i.e. at the end of the regular LTO pipeline).

Does VCP benefit from any of the other optimizations we would typically apply before that point (e.g. inlining)?

I suppose that is possible in principle, but I wouldn't typically expect it to; I've found that VCP eligible functions tend to be relatively trivial functions that are fully inlined into leaf functions at compile time.

pcc updated this revision to Diff 88293.Feb 13 2017, 6:37 PM
pcc marked 2 inline comments as done.
  • Address review comments
pcc added inline comments.Feb 13 2017, 6:37 PM
llvm/test/Transforms/ThinLTOBitcodeWriter/split-vfunc-internal.ll
9 ↗(On Diff #87896)

They are module IDs. See getModuleId in ThinLTOBitcodeWriter.

pcc added a comment.Feb 13 2017, 6:46 PM
In D29701#675809, @pcc wrote:
In D29701#670366, @pcc wrote:

Note: I patched in https://github.com/pcc/llvm-project/commit/5a5904d6721f895eafdd2fc476872b98806c36e6 to measure perf impact. Total wall time spent in addRegularLTO when linking chrome was 8.2743s, as compared to about 6 seconds before (see D27324).

So >30% overhead on addRegularLTO, but what is the overhead for runRegularLTO?

Just did a fresh set of runs on chrome with https://github.com/pcc/llvm-project/commits/lto-timers patched in, and took median of 5 for each measurement:

               before after
addRegularLTO 8.0407s 9.7432s (+21%)
runRegularLTO 9.6064s 10.4748s (+9%)

Note that we haven't done everything we can -- a couple more things I can think of are to exclude functions that take non-integer arguments (other than "this"), and to set optnone on functions in the merged module. Taking a look at chrome's merged module, there are 6727 available_externally functions, of which 1340 take a pointer argument other than "this".

Another data point for chrome is that this change causes us to merge 53487 struct type definitions, most of which are likely unnecessary (I think they are either associated with "this", which we make sure is unused, or one of the other pointer arguments, which means we don't need the function). So something else we could do that would be a little more tricky is to try to drop the type information for "this". (Of course, this would happen for free if we had typeless pointers.)

pcc updated this revision to Diff 88295.Feb 13 2017, 7:01 PM
  • Check function arguments
mehdi_amini accepted this revision.Feb 13 2017, 7:30 PM
In D29701#675840, @pcc wrote:
In D29701#675809, @pcc wrote:
In D29701#670366, @pcc wrote:

Note: I patched in https://github.com/pcc/llvm-project/commit/5a5904d6721f895eafdd2fc476872b98806c36e6 to measure perf impact. Total wall time spent in addRegularLTO when linking chrome was 8.2743s, as compared to about 6 seconds before (see D27324).

So >30% overhead on addRegularLTO, but what is the overhead for runRegularLTO?

Just did a fresh set of runs on chrome with https://github.com/pcc/llvm-project/commits/lto-timers patched in, and took median of 5 for each measurement:

               before after
addRegularLTO 8.0407s 9.7432s (+21%)
runRegularLTO 9.6064s 10.4748s (+9%)

Quite interesting that we take more time to build and merge the IR than transforming and emitting code :)

llvm/lib/Transforms/IPO/ThinLTOBitcodeWriter.cpp
283–284 ↗(On Diff #88295)

great comments!

This revision is now accepted and ready to land.Feb 13 2017, 7:30 PM
This revision was automatically updated to reflect the committed changes.