This is an archive of the discontinued LLVM Phabricator instance.

IPO: Introduce ThinLTOBitcodeWriter pass.
ClosedPublic

Authored by pcc on Dec 1 2016, 4:59 PM.

Details

Summary

This pass prepares a module containing type metadata for ThinLTO by splitting
it into regular and thin LTO parts if possible, and writing both parts to
a multi-module bitcode file. Modules that do not contain type metadata are
written unmodified as a single module.

All globals with type metadata are added to the regular LTO module, and
the rest are added to the thin LTO module.

Event Timeline

pcc updated this revision to Diff 79999.Dec 1 2016, 4:59 PM
pcc retitled this revision from to IPO: Introduce ThinLTOBitcodeWriter pass..
pcc updated this object.
pcc added reviewers: mehdi_amini, tejohnson.
pcc added a subscriber: llvm-commits.
tejohnson edited edge metadata.Dec 2 2016, 10:19 AM

Just skimmed the code so far, and have a high-level question. Why a new pass instead of building this into the existing WriteBitcodePass as an option? It's confusing because that pass also has the ability to write bitcode for ThinLTO. Is the intention to use this new pass instead of WriteBitcodePass when we are building with -flto=thin (i.e. from clang)? Seems cleaner to use a single pass and not have clang worry about which to invoke. Also avoids redundancy in the ThinLTO handling.

llvm/lib/Transforms/IPO/ThinLTOBitcodeWriter.cpp
201

The WriteBitcodePass instead uses the ModuleSummaryIndexAnalysis pass to construct the index. This pass should be consistent with that.

pcc added a comment.Dec 2 2016, 11:01 AM

Just skimmed the code so far, and have a high-level question. Why a new pass instead of building this into the existing WriteBitcodePass as an option?

I see the WriteBitcodePass as a "pure" serialization pass, whereas this pass will do whatever is required to write out the module for ThinLTO. I also see this pass as potentially diverging a lot from WriteBitcodePass, e.g. it could be doing code generation for non-importable functions (as I think we discussed a long time ago), which is way outside the bounds of what WriteBitcodePass is or should be doing.

It's confusing because that pass also has the ability to write bitcode for ThinLTO.

To a certain extent that's coincidental to the pass's ability to write a summary. I suspect that we'll just want to use that capability in "opt" for testing purposes and use this new pass in production pipelines.

Is the intention to use this new pass instead of WriteBitcodePass when we are building with -flto=thin (i.e. from clang)?

Yes.

Seems cleaner to use a single pass and not have clang worry about which to invoke.

In the end some code does need to make some choice about which code path to take. The question is whether we want separate passes or one pass that does everything and takes a ton of flags. It may come down to personal preference but at least I think that avoiding flags where possible makes the code more readable.

Also avoids redundancy in the ThinLTO handling.

It's not a lot of code to be honest, most of the implementation is in BitcodeWriter which is already shared.

llvm/lib/Transforms/IPO/ThinLTOBitcodeWriter.cpp
201

Note that this is a separate module from the "main" module. We can't just use the main module summary from the pass manager because we may have renamed some of the globals to export them. Running a separate pass manager on the newly created module to build the summary also seems like the wrong approach, as I mentioned in the FIXME I think we'll eventually want to pull profiling information out of the main module analysis and match it up with the globals in ThinM. That would require going outside the bounds of what the pass manager supports (or needs to support, I'd argue).

mehdi_amini edited edge metadata.Dec 2 2016, 11:03 AM

Another high-level comment: technically I don't think we need to split anything for type-based LTO devirtualization.

pcc added a comment.Dec 2 2016, 11:15 AM

Another high-level comment: technically I don't think we need to split anything for type-based LTO devirtualization.

It may be possible to implement vcall opt without splitting, but it certainly seems a lot more complicated. For example virtual const prop works by "evaluating" each virtual function; we'd somehow need to figure out which virtual functions can be evaluated and store the values in the summary.

Not reading any IR during ThinLTO is important, only if we really can do differently I'd consider it. But I'd look at every options/tradeoff before and make sure the design does not lock us in.

pcc added a comment.Dec 2 2016, 11:33 AM

Not reading any IR during ThinLTO is important, only if we really can do differently I'd consider it. But I'd look at every options/tradeoff before and make sure the design does not lock us in.

I intend the amount of IR we read during ThinLTO to be minimal. Basically just the vtables and a small number of virtual functions (for vcall opt). The fact that we're reading IR seems less important than that we're reading less of it; I don't want to be in a situation where we keep adding things to the summary only to discover that we've just reinvented parts of the IR and we're just reading it via a "summary" code path.

pcc added a comment.Dec 2 2016, 11:35 AM
In D27324#612053, @pcc wrote:

Not reading any IR during ThinLTO is important, only if we really can do differently I'd consider it. But I'd look at every options/tradeoff before and make sure the design does not lock us in.

I intend the amount of IR we read during ThinLTO to be minimal. Basically just the vtables and a small number of virtual functions (for vcall opt). The fact that we're reading IR seems less important than that we're reading less of it; I don't want to be in a situation where we keep adding things to the summary only to discover that we've just reinvented parts of the IR and we're just reading it via a "summary" code path.

(and incidentally, this is part of why I've been pushing on the typeless pointer work lately; it would allow us to strip a lot of useless data from the IR in these cases)

In D27324#611987, @pcc wrote:

Just skimmed the code so far, and have a high-level question. Why a new pass instead of building this into the existing WriteBitcodePass as an option?

I see the WriteBitcodePass as a "pure" serialization pass, whereas this pass will do whatever is required to write out the module for ThinLTO. I also see this pass as potentially diverging a lot from WriteBitcodePass, e.g. it could be doing code generation for non-importable functions (as I think we discussed a long time ago), which is way outside the bounds of what WriteBitcodePass is or should be doing.

It's confusing because that pass also has the ability to write bitcode for ThinLTO.

To a certain extent that's coincidental to the pass's ability to write a summary. I suspect that we'll just want to use that capability in "opt" for testing purposes and use this new pass in production pipelines.

Why would we want to write ThinLTO bitcode one way in opt and another way in the production pipelines? I think we should keep it simple and have one way to write the ThinLTO bitcode.

Is the intention to use this new pass instead of WriteBitcodePass when we are building with -flto=thin (i.e. from clang)?

Yes.

Seems cleaner to use a single pass and not have clang worry about which to invoke.

In the end some code does need to make some choice about which code path to take. The question is whether we want separate passes or one pass that does everything and takes a ton of flags. It may come down to personal preference but at least I think that avoiding flags where possible makes the code more readable.

But I assume the intention is to always do this under -flto=thin, in which case there's just one parameter, analogous to the existing BitcodeWriterPass's EmitSummaryIndex. I have less of a strong feeling about having this as a separate pass though if we can (eventually) remove the existing ThinLTO handling from BitcodeWriterPass. But I would imagine the splitting should stay optional for the time being.

llvm/lib/Transforms/IPO/ThinLTOBitcodeWriter.cpp
201

Ok, that makes sense.

In D27324#612037, @pcc wrote:

Another high-level comment: technically I don't think we need to split anything for type-based LTO devirtualization.

It may be possible to implement vcall opt without splitting, but it certainly seems a lot more complicated. For example virtual const prop works by "evaluating" each virtual function; we'd somehow need to figure out which virtual functions can be evaluated and store the values in the summary.

I haven't thought through this, but is it possible to evaluate the virtual functions at compile time (without WPA)? I.e. is the issue deciding which virtual functions can be evaluated, doing the evaluation, or how to store that in the summary?

pcc added a comment.Dec 2 2016, 12:22 PM
In D27324#611987, @pcc wrote:

Just skimmed the code so far, and have a high-level question. Why a new pass instead of building this into the existing WriteBitcodePass as an option?

I see the WriteBitcodePass as a "pure" serialization pass, whereas this pass will do whatever is required to write out the module for ThinLTO. I also see this pass as potentially diverging a lot from WriteBitcodePass, e.g. it could be doing code generation for non-importable functions (as I think we discussed a long time ago), which is way outside the bounds of what WriteBitcodePass is or should be doing.

It's confusing because that pass also has the ability to write bitcode for ThinLTO.

To a certain extent that's coincidental to the pass's ability to write a summary. I suspect that we'll just want to use that capability in "opt" for testing purposes and use this new pass in production pipelines.

Why would we want to write ThinLTO bitcode one way in opt and another way in the production pipelines? I think we should keep it simple and have one way to write the ThinLTO bitcode.

The idea was to allow testing something about the summary writer in isolation without worrying about the rest of the pipeline.

Is the intention to use this new pass instead of WriteBitcodePass when we are building with -flto=thin (i.e. from clang)?

Yes.

Seems cleaner to use a single pass and not have clang worry about which to invoke.

In the end some code does need to make some choice about which code path to take. The question is whether we want separate passes or one pass that does everything and takes a ton of flags. It may come down to personal preference but at least I think that avoiding flags where possible makes the code more readable.

But I assume the intention is to always do this under -flto=thin, in which case there's just one parameter, analogous to the existing BitcodeWriterPass's EmitSummaryIndex. I have less of a strong feeling about having this as a separate pass though if we can (eventually) remove the existing ThinLTO handling from BitcodeWriterPass. But I would imagine the splitting should stay optional for the time being.

The idea is that splitting would only happen if you have enabled a feature that needs it (that's what requiresSplit is testing for), so there should be no functional change for existing users once we switch clang to using ThinLTOBitcodeWriterPass.

I'm not sure if this is a good idea yet, but maybe there should just be two bitcode writer passes: LTOBitcodeWriterPass and ThinLTOBitcodeWriterPass. Neither would take flags, and basically they would be the only supported "production" passes for writing LTO bitcode. If something like "opt" wants to go under the hood and test a specific feature of the bitcode writer (e.g. the summary writer), it can run a pass pipeline without a bitcode writer and call the writer itself.

pcc added a comment.Dec 2 2016, 12:30 PM
In D27324#612037, @pcc wrote:

Another high-level comment: technically I don't think we need to split anything for type-based LTO devirtualization.

It may be possible to implement vcall opt without splitting, but it certainly seems a lot more complicated. For example virtual const prop works by "evaluating" each virtual function; we'd somehow need to figure out which virtual functions can be evaluated and store the values in the summary.

I haven't thought through this, but is it possible to evaluate the virtual functions at compile time (without WPA)? I.e. is the issue deciding which virtual functions can be evaluated, doing the evaluation, or how to store that in the summary?

I think it's more doing the evaluation. Basically we support virtual const prop on functions which take arguments that are constant at the call site. Currently this works by scanning the module for calls and evaluating the function for each set of constant arguments that are actually passed. To make this work on a whole program basis we'd either need to evaluate functions with every possible set of arguments (probably too expensive in non-trivial cases) or somehow encode the function body in the "summary" (which at that point wouldn't really be a summary at all).

(at this point I agree with Teresa on the structure of the writer, it is not clear to me why changing it)

In D27324#612167, @pcc wrote:
In D27324#612037, @pcc wrote:

Another high-level comment: technically I don't think we need to split anything for type-based LTO devirtualization.

It may be possible to implement vcall opt without splitting, but it certainly seems a lot more complicated. For example virtual const prop works by "evaluating" each virtual function; we'd somehow need to figure out which virtual functions can be evaluated and store the values in the summary.

I haven't thought through this, but is it possible to evaluate the virtual functions at compile time (without WPA)? I.e. is the issue deciding which virtual functions can be evaluated, doing the evaluation, or how to store that in the summary?

I think it's more doing the evaluation. Basically we support virtual const prop on functions which take arguments that are constant at the call site. Currently this works by scanning the module for calls and evaluating the function for each set of constant arguments that are actually passed. To make this work on a whole program basis we'd either need to evaluate functions with every possible set of arguments (probably too expensive in non-trivial cases) or somehow encode the function body in the "summary" (which at that point wouldn't really be a summary at all).

Or encode as a function based on TBD const param values, and encode callsite constant param values in the summary...ok I agree that this is more complex that needed at least for now. It seems simpler to do the splitting for now.

pcc added a comment.Dec 5 2016, 3:38 PM

Mehdi asked me to look at the cost of the split module design. To do that, I took my prototype [0] and made a change [1] to shrink the size of the regular LTO module. With that the total time spent in LTO::addRegularLTO and LTO::addThinLTO on my machine during a ThinLTO link of Chromium is:

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 6.1434 ( 58.6%)   0.2158 ( 36.2%)   6.3591 ( 57.4%)   6.1709 ( 55.6%)  LTO::addRegularLTO
 4.3320 ( 41.4%)   0.3805 ( 63.8%)   4.7125 ( 42.6%)   4.9269 ( 44.4%)  LTO::addThinLTO
10.4754 (100.0%)   0.5963 (100.0%)  11.0717 (100.0%)  11.0978 (100.0%)  Total

So that's about 6 seconds spent loading and linking all of Chromium's vtables, as compared to 5 seconds spent building the combined summary. If the figures were an order of magnitude off I'd be concerned, but these figures seem reasonable enough to me.

[0] https://github.com/pcc/llvm-project/tree/cfi-thinlto
[1] https://github.com/pcc/llvm-project/commit/de5bd8fe4c05c6e9aecf1e5384ef5553bf7332f0

Does this include all the codegen as well?

Also, what is the story for incremental build with respect to ThinLTO devirtualization?

pcc added a comment.Dec 5 2016, 4:06 PM

Does this include all the codegen as well?

No, I'll measure that as well.

Also, what is the story for incremental build with respect to ThinLTO devirtualization?

Basically we want to cache the combined module and the part of the combined summary that contains the resolutions for each of the type tests. The cache would be keyed on the regular LTO module hashes. We'd only copy the needed resolutions into the individual summaries like we do for the import/export lists.

pcc added a comment.Dec 5 2016, 4:08 PM

Basically we want to cache the combined module

By which I mean the object file representing the combined module, sorry.

In D27324#613959, @pcc wrote:

Basically we want to cache the combined module

By which I mean the object file representing the combined module, sorry.

I'm not totally sure about this (the time to hash the IR may be almost as long as the codegen for this module if it contains "only" the Vtables). But that's a detail that can be tuned later.

I'm more interested to understand the flow of devirtualization in particular. How do we know which type resolution impact which ThinLTO backend?
I haven't all the pieces of the flow yet.

pcc added a comment.Dec 5 2016, 4:32 PM
In D27324#613959, @pcc wrote:

Basically we want to cache the combined module

By which I mean the object file representing the combined module, sorry.

I'm not totally sure about this (the time to hash the IR may be almost as long as the codegen for this module if it contains "only" the Vtables). But that's a detail that can be tuned later.

At least the LowerTypeTests pass needed for CFI can get expensive, so I'd like to be able to cache its output. But it may be possible to optimize that pass further (the pass hasn't historically shown up in profiles simply because it was being run along with the other regular LTO passes), so ok, let's think about what to do about it later.

FWIW for hashing I was thinking that we'd just read the MODULE_CODE_HASH.

I'm more interested to understand the flow of devirtualization in particular. How do we know which type resolution impact which ThinLTO backend?
I haven't all the pieces of the flow yet.

This was discussed on the original RFC thread, and the conclusion was at the end of this message:
http://lists.llvm.org/pipermail/llvm-dev/2016-October/106628.html
Basically the information is stored in the individual summaries.

pcc added a comment.Dec 5 2016, 4:36 PM

Here are the results after applying https://github.com/pcc/llvm-project/commit/07bc98867d232add3bd823a7e6e0b1257ff836ca to collect more timing information:

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
190.1413 ( 94.1%)   1.2036 ( 59.0%)  191.3449 ( 93.8%)  191.2808 ( 93.8%)  lto::backend opt
 6.3190 (  3.1%)   0.2526 ( 12.4%)   6.5716 (  3.2%)   6.2862 (  3.1%)  LTO::addRegularLTO
 4.5287 (  2.2%)   0.3791 ( 18.6%)   4.9079 (  2.4%)   5.1791 (  2.5%)  LTO::addThinLTO
 1.0622 (  0.5%)   0.2041 ( 10.0%)   1.2663 (  0.6%)   1.2656 (  0.6%)  lto::backend codegen
202.0513 (100.0%)   2.0394 (100.0%)  204.0907 (100.0%)  204.0117 (100.0%)  Total

In other words: as expected, most of the regular LTO time is being spent in the optimizer, and the time spent in the backend is quite small.

What does the 190s in the optimizer correspond to? (i.e. is it only building the combined "vtables module"?)

pcc added a comment.Dec 5 2016, 4:44 PM

What does the 190s in the optimizer correspond to? (i.e. is it only building the combined "vtables module"?)

Yes, that's the "vtables module". Note that it's probably just the LowerTypeTests pass being slow.

Ouch, that seems *huge* to me.

pcc added a comment.Dec 5 2016, 5:15 PM

Ouch, that seems *huge* to me.

As mentioned on IRC: the time spent in the midend is CFI-specific (and necessarily sequential), and I'm going to be looking at optimizing it anyway. The parts that would have similar timings regardless of CFI/devirtualization settings are "LTO::addRegularLTO" and "lto::backend codegen", and I'm satisfied with the timings for those parts.

pcc added a comment.Dec 6 2016, 2:22 PM

After D27484 (actually a slightly different version of that patch rebased onto my prototype branch) the timings look like this:

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 5.8192 ( 40.7%)   0.1061 ( 17.5%)   5.9253 ( 39.8%)   5.6562 ( 37.7%)  LTO::addRegularLTO
 3.9290 ( 27.5%)   0.2029 ( 33.5%)   4.1319 ( 27.7%)   4.4836 ( 29.9%)  LTO::addThinLTO
 3.6604 ( 25.6%)   0.1720 ( 28.4%)   3.8325 ( 25.7%)   3.8303 ( 25.6%)  lto::backend opt
 0.8909 (  6.2%)   0.1242 ( 20.5%)   1.0151 (  6.8%)   1.0144 (  6.8%)  lto::backend codegen
14.2996 (100.0%)   0.6052 (100.0%)  14.9048 (100.0%)  14.9844 (100.0%)  Total

It seems like > 100% overhead for an incremental build, right?

Mmm, no, it does not include the thin-link. So hard to evaluate.

pcc added a comment.Dec 6 2016, 2:50 PM

With a timer for the thin-link phase (https://github.com/pcc/llvm-project/commit/9f389f9d1c16c34d32d040cfb55aa69e97d9ad9f) the timings are:

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 6.2637 ( 28.1%)   0.2148 ( 10.6%)   6.4785 ( 26.7%)   6.2113 ( 25.5%)  LTO::addRegularLTO
 5.1410 ( 23.1%)   0.6442 ( 31.7%)   5.7852 ( 23.8%)   5.7871 ( 23.8%)  LTO::runThinLTO thin-link
 5.1463 ( 23.1%)   0.5565 ( 27.4%)   5.7028 ( 23.5%)   5.7000 ( 23.4%)  lto::backend opt
 4.4082 ( 19.8%)   0.3652 ( 18.0%)   4.7735 ( 19.7%)   5.0972 ( 20.9%)  LTO::addThinLTO
 1.2965 (  5.8%)   0.2523 ( 12.4%)   1.5487 (  6.4%)   1.5487 (  6.4%)  lto::backend codegen
22.2557 (100.0%)   2.0330 (100.0%)  24.2887 (100.0%)  24.3442 (100.0%)  Total

So that's 5.0972 + 5.7871 = 10.8843s for the sequential parts of ThinLTO and 6.2113 + 5.7000 + 1.5487 = 13.4600s for regular LTO.

pcc updated this revision to Diff 81649.Dec 15 2016, 1:06 PM
pcc edited edge metadata.
  • Drop unused globals, and drop type information from function declarations. This helps improve the performance of the regular LTO link phase.
mehdi_amini added inline comments.Dec 15 2016, 2:02 PM
llvm/lib/Transforms/IPO/ThinLTOBitcodeWriter.cpp
74

We call this export "Promotion" in ThinLTO usually I believe.

202

This seems dead code to me, the returned value is return ("$" + Str).str(); which can't be empty.

211

"Regular" was misleading to me, because I see the ThinLTO module as the regular one, the other that only contains the VTable and the types is the "special" one.

212

Should this be a check on GlobalObject instead? Technically GlobalObject can have MDs attached. I guess you assume that we'll never see a typeMD on anything else than a GlobalVariable?

218

Looks like we could have method bool GlobalObject::hasMD(unsigned KindID);

226

I'm concerned about the cost: cloning a module isn't free.
Why do you need to clone the Thin module? Can't you just strip it from what was moved to the other?
Could we just have an option to clone module so that it actually move directly so that it'll do exactly what's needed?

296

} // anonymous namespace

pcc marked 4 inline comments as done.Dec 15 2016, 3:38 PM
pcc added inline comments.
llvm/lib/Transforms/IPO/ThinLTOBitcodeWriter.cpp
202

See line 63, also the test case unsplittable.ll.

211

Renamed to MergedM.

212

Yes, at this point we only support type metadata on GlobalVariables. We can also have type metadata attached to Functions, which is used to support CFI on indirect calls, but that will be implemented separately (and much differently).

218

It doesn't seem worth it to me, we're basically only doing this in this pass.

226

I added a filterModule function that operates on an existing module and used it here.

pcc updated this revision to Diff 81679.Dec 15 2016, 3:38 PM
pcc marked 2 inline comments as done.
  • Address review comments
mehdi_amini accepted this revision.Dec 15 2016, 4:23 PM
mehdi_amini edited edge metadata.

LGTM.

llvm/lib/Transforms/IPO/ThinLTOBitcodeWriter.cpp
202

Oh right!

This revision is now accepted and ready to land.Dec 15 2016, 4:23 PM
This revision was automatically updated to reflect the committed changes.