Page MenuHomePhabricator

WholeProgramDevirt: introduce.
ClosedPublic

Authored by pcc on Feb 1 2016, 8:12 PM.

Details

Summary

This pass implements whole program optimization of virtual calls in cases
where we know (via bitset information) that the list of callee is fixed. This
includes the following:

  • Single implementation devirtualization: if a virtual call has a single possible callee, replace all calls with a direct call to that callee.
  • Virtual constant propagation: if the virtual function's return type is an integer <=64 bits and all possible callees are readnone, for each class and each list of constant arguments: evaluate the function, store the return value alongside the virtual table, and rewrite each virtual call as a load from the virtual table.
  • Uniform return value optimization: if the conditions for virtual constant propagation hold and each function returns the same constant value, replace each virtual call with that constant.
  • Unique return value optimization for i1 return values: if the conditions for virtual constant propagation hold and a single vtable's function returns 0, or a single vtable's function returns 1, replace each virtual call with a comparison of the vptr against that vtable's address.

Diff Detail

Repository
rL LLVM

Event Timeline

pcc updated this revision to Diff 46612.Feb 1 2016, 8:12 PM
pcc retitled this revision from to WholeProgramDevirt: introduce..
pcc updated this object.
pcc added reviewers: mehdi_amini, pete, chandlerc.
pcc added subscribers: llvm-commits, kcc, hans.
pete edited edge metadata.Feb 2 2016, 3:20 PM

Hi Peter

This is great. Although also quite large, so apologies in advance if it takes a few rounds of review.

Firstly, can you do the evaluator stuff in another patch. I'm ok saying LGTM on that here assuming that you only extracted the code from GlobalOpt. That'll greatly reduce the scope of what we have here.

Also, I know Chandler has been passionate about data structure choices in the past. Luckily he's already CCed here, but i'd double check with him whether things like std::set and std::map are reasonable choices for what you're doing here.

Thanks,
Pete

lib/Transforms/IPO/LowerBitSets.cpp
923 ↗(On Diff #46612)

Is this needed for this patch? I know its just an optimization, but can it be committed separately?

lib/Transforms/IPO/PassManagerBuilder.cpp
640 ↗(On Diff #46612)

Before this is enabled in tree by default (I can see from the clang patch that you have an option to control this), we'll need perf testing and sign off from LTO folks. Mehdi is CCed which is good, but probably also need some other interested folks and some data to go over to justify the GlobalDCE pass as well as the new code in terms of compile/runtime.

lib/Transforms/IPO/WholeProgramDevirt.cpp
10–26 ↗(On Diff #46612)

The tests you've included are very good in terms of ensuring all of this is covered. However, can you please add negative tests for everything too. At least one test for each of the main points would be good, for example, a test returning an i128 doesn't get virtual constant propagation.

112 ↗(On Diff #46612)

Nitpicking: space between the ;;

115 ↗(On Diff #46612)

Please use () here around the + and / operators to make it clear what precedence you want.

174 ↗(On Diff #46612)

You're comparing pointers here. Is that safe/intended?

281 ↗(On Diff #46612)

This method is pretty huge. Can you break it out in to one method for each type of optimization you are doing? That would make it easier to review each chunk individually.

304 ↗(On Diff #46612)

This is highly unlikely, but there's no guarantee that the user is a CallInst here (or in any of the other similar casts in your patch).

Unfortunately, you could have something like 'call ... (bitcast i1 @llvm.bitset.test to ...)

307–310 ↗(On Diff #46612)

As you are using an intrinsic here, seems fair to just use a cast<MetadataAsValue> instead of report_fatal_error as either way you'll know if anyone changes the intrinsic types.

333 ↗(On Diff #46612)

You can use RecursivelyDeleteTriviallyDeadInstructions instead to also delete any dead arguments to the call.

pete added a subscriber: Gerolf.Feb 2 2016, 3:21 PM

Added Gerolf who is always interested in new LTO optimizations.

pcc updated this revision to Diff 46862.Feb 3 2016, 7:17 PM
pcc marked 4 inline comments as done.
pcc edited edge metadata.
  • Remove llvm:: in a few places
  • Parens
  • We can't use RecursivelyDeleteTriviallyDeadInstructions
  • Better casting
  • Move code into functions; fix some non-determinism
  • Use DenseSet
  • Simplify a little
  • Add negative tests
pcc added a comment.Feb 3 2016, 7:18 PM

Firstly, can you do the evaluator stuff in another patch. I'm ok saying LGTM on that here assuming that you only extracted the code from GlobalOpt.

Done in r259621.

lib/Transforms/IPO/LowerBitSets.cpp
923 ↗(On Diff #46612)

It isn't needed for this patch specifically. I've committed it separately in r259625.

lib/Transforms/IPO/PassManagerBuilder.cpp
640 ↗(On Diff #46612)

As one data point: I timed the GlobalDCE pass running over a full LTO module of Chromium at about 3 seconds on my machine. That's less than 1% of total runtime (about 20-30 mins), so I think it's tolerable. I'll also get numbers for this new pass, but I don't think it should affect code that doesn't use the whole-program-vtables feature.

lib/Transforms/IPO/WholeProgramDevirt.cpp
112 ↗(On Diff #46612)

That isn't what clang-format does, and I'd prefer to stick with that formatting where reasonable.

174 ↗(On Diff #46612)

This was safe before because I was iterating over an associated std::vector. But I realized that the code can just use a MapVector instead, so I've done that.

333 ↗(On Diff #46612)

It turns out that we can't do this too early because CI may be the only user of the vtable. It's probably simplest to let another pass clean up the other arguments later.

pete accepted this revision.Feb 5 2016, 7:21 PM
pete edited edge metadata.
In D16795#343746, @pcc wrote:
  • Remove llvm:: in a few places
  • Parens
  • Better casting
  • Move code into functions; fix some non-determinism
  • Use DenseSet
  • Simplify a little
  • Add negative tests

Thanks for all of those. They look good.

Regarding this one:

  • We can't use RecursivelyDeleteTriviallyDeadInstructions

I hadn't realized you were caching the found calls, so makes sense you can't delete them with this. Thanks for the comment in the code saying so.

Also, thanks for the extensive tests.

I've now gone through the whole patch. I was able to follow along very well so kudos for the good comments and sensible naming on everything.

I don't think there's anything i've commented on here that needs further review. Please fix any comments I had which you feel are reasonable, ignore those which should be ignored, and LGTM :)

lib/Transforms/IPO/PassManagerBuilder.cpp
640 ↗(On Diff #46862)

Thanks for the data. 3s isn't bad at all considering how big that Module is likely to be.

I'm going to take a look at what this does to LLVM itself when we link only a single backend, but my curiously there shouldn't block further progress on this.

lib/Transforms/IPO/WholeProgramDevirt.cpp
113 ↗(On Diff #46862)

Didn't realize that was the clang-format behavior. No problem keeping it the way it is then :)

208–210 ↗(On Diff #46862)

I don't tend to work with InvokeInst much. I trust that what you're doing here is correct, but please add a test to cover it just to make sure it keeps being ok in future.

295 ↗(On Diff #46862)

Nitpicking: Please swap the order here as checking 'VPtr == GEP->getPointerOperand()' is faster than GEP->hasAllConstantIndices()

296–298 ↗(On Diff #46862)

This is pretty gross (not your fault! I looked and most other places do exactly the same thing)

But I did see one that was much nicer. Its in ConstantFolding.cpp. Can you make yours similar? I might go back and fix the others in tree.

Note, it was getting a range from a vector of indices, but I think it might be possible to do the equivalent with op_begin() + 1 and op_end().

makeArrayRef((Value * const *)Ops.data() + 1, Ops.size() - 1)));

586 ↗(On Diff #46862)

Can you move this in to the 'if (!Assumes.empty()) {' below as I think its only used in there.

615 ↗(On Diff #46862)

Can we test for this being non-null as the very first thing in runModule? Or do you want to erase all the assume's as in the previous loop, even if you can't find the metadata?

If you do keep the conditional here, looks like you could move BitSets inside the scope, as well as moving the "// For each (bitset, offset) pair:" loop to the end of the conditional block

630–631 ↗(On Diff #46862)

In addition to these checks, it looks like BitSets is only used by tryFindVirtualCallTargets, and the first thing we do in there is check the following:

if (!BS.Bits->GV->isConstant())
    return false;

  auto Init = dyn_cast<ConstantArray>(BS.Bits->GV->getInitializer());
  if (!Init)
    return false;

Can we move those checks here so that we don't even populate BitSets[] entries unless these properties hold?

662–669 ↗(On Diff #46862)

I think you could have a 'bool Changed' to track if tryVirtualConstProp ever did anything and only run this loop if thats the case. Is that right?

677–679 ↗(On Diff #46862)

Please do something like 'I = 0, Size = B.Before.Bytes.Size()' so that you can avoid reevaluating the size on each iteration and inside the loop body

702 ↗(On Diff #46862)

No idea if this is possible, so feel free to ignore, but if the original VT had an explicit section on it, would either NewGV or Alias here need to inherit that section? I have no idea what the correct answer would be.

For safety sake, we could always exclude VTables with explicit sections from this optimization until we have a good answer.

Either way, please add a test with an explicit section and make sure we either propagate it as required, or don't run this optimization on it.

This revision is now accepted and ready to land.Feb 5 2016, 7:21 PM
mehdi_amini edited edge metadata.Feb 5 2016, 11:19 PM

Looking what's involved, I am not convinced that the current "bitset"/assume intrinsics structure is the most straightforward/efficient representation to communicate the information from the front-end to the analysis. It seems to complicate quite noticeably the implementation here.

lib/Transforms/IPO/PassManagerBuilder.cpp
644 ↗(On Diff #46862)

This is the beginning of the pipeline right? When do we have unused virtual tables emitted by the front-end? I don't expect this to happen.

lib/Transforms/IPO/WholeProgramDevirt.cpp
216 ↗(On Diff #46862)

That's not friendly to the new pass manager, can you refactor with a very thin wrapper class inheriting from ModulePass and a class for the Devirtualizer itself?

258 ↗(On Diff #46862)

There is a INITIALIZE_PASS macro that does the job if you don't need dependencies.

558 ↗(On Diff #46862)

This function is pretty long and could probably be split.

602 ↗(On Diff #46862)

It seems to me that you have to reconstruct information that could have been generated by the front-end, using a representation like the intrinsic I sent by email.
This would make completely straightforward since you'd have all the uses off-hand.

616 ↗(On Diff #46862)

Could be documented (and renamed maybe?)

pcc updated this revision to Diff 47261.Feb 8 2016, 3:02 PM
pcc marked 7 inline comments as done.
pcc edited edge metadata.
  • Add invoke test; fix invoke replacement bug
  • Use INITIALIZE_PASS
  • Move most of the code into a separate struct
  • Move buildBitSets to function
  • Move rebuildGlobal to function
  • Address misc code review comments, add support for explicit sections
lib/Transforms/IPO/PassManagerBuilder.cpp
644 ↗(On Diff #46862)

This happens after the internalizer pass. If a vtable is linked into the program but turns out to be unused from a whole-program perspective, this pass will remove it.

lib/Transforms/IPO/WholeProgramDevirt.cpp
208–210 ↗(On Diff #46862)

Added a test. Good thing I did, because this code contained a bug, which is now fixed.

296–298 ↗(On Diff #46862)

I don't think this will work because op_begin()..op_end() is an iterator of Uses, not Values.

602 ↗(On Diff #46862)

While this is true, it isn't clear that the lowered complexity here would not be outweighed by added complexity in the rest of the compiler. For example, if we adopt a design similar to the one you proposed,other parts of the compiler (e.g. constant folding) would need to be taught about the new intrinsic, and the information from the frontend would need to be kept in sync with the program semantics. The advantage of reconstructing information from program semantics is that we know that the semantics are most likely correct, or the program wouldn't work.

Before we consider changing the design, I would like to see a concrete design proposal that takes into account existing features such as CFI. I don't think any potential new design should block this from landing though.

630–631 ↗(On Diff #46862)

If we encounter one of these, we need to disable the optimization for that bitset only. I suppose we could either keep track of which bitsets we need to disable optimization for, or disable optimization for all bitsets if we see one of these (by clearing BitSets and returning), but at that point we're making the code more complicated and/or losing a feature.

702 ↗(On Diff #46862)

I don't think this comes up with real vtables, but NewGV would need to inherit the section (aliases can't have explicit sections). Added test.

pete added inline comments.Feb 8 2016, 3:22 PM
lib/Transforms/IPO/WholeProgramDevirt.cpp
331–347 ↗(On Diff #47261)

Sorry, don't know how to reply to this point correctly in Phab, but here goes:

So this is the current code to add an entry to BitSets[]. This is where I think we could move the following checks, as that way we know that all elements in BitSets have valid GlobalValue's according to these checks. I don't think thats too bad so long as we can move the checks here. Obviously if for some reason we had to duplicate the checks then your current code is better.

These are the checks I'm thinking of:

if (!BS.Bits->GV->isConstant())

  return false;

auto Init = dyn_cast<ConstantArray>(BS.Bits->GV->getInitializer());
if (!Init)
  return false;
pcc added inline comments.Feb 8 2016, 3:37 PM
lib/Transforms/IPO/WholeProgramDevirt.cpp
331–347 ↗(On Diff #47261)

If an element has an invalid GlobalValue according to those checks, we cannot apply the optimization for that bitset, because it would be an invalid optimization. Consider:

@vt1 = global [1 x i8*] [i8* bitcast void()* @vf1 to i8*]
@vt2 = constant [1 x i8*] [i8* bitcast void()* @vf2 to i8*]


void @f(i8* %this) {
  %vtable = load %this
  llvm.assume(llvm.bitset.test(%vtable, !"bitset"))
  ; make a virtual call through %vtable
}

!0 = {!"bitset", @vt1, 0}
!1 = {!"bitset", @vt2, 0}
!llvm.bitsets = !{!0, !1}

If we do those checks here, the pass would add only @vt2 to BitSets["bitset"], and it would later apply single-implementation devirtualization to turn the virtual call in @f into a direct call to @vf2. But that would not be correct, because it could potentially be a call to @vf1 (or potentially any other function, if some other code overwrites the pointer in @vt1).

Because we expect this case to never arise in practice, the simplest thing to do is to handle it in the loop later.

mehdi_amini accepted this revision.Feb 9 2016, 1:15 PM
mehdi_amini edited edge metadata.

LGTM, unless Pete has more changes to ask pre-commit?

lib/Transforms/IPO/PassManagerBuilder.cpp
644 ↗(On Diff #47261)

I didn't expect internalize alone to expose such opportunities, but yes thinking more about it I can foresee that.

lib/Transforms/IPO/WholeProgramDevirt.cpp
603 ↗(On Diff #47261)

I still regret that the design of the IR representation haven't been considered any further. In many other cases such a feature would be block on such consideration (Technical debt). As much as this is not widespread, it still involves changes in the frontend, so it is not completly isolated. It seems to me that CFI acts as a Trojan horse here. It is not clear to me if the actual "bitset" has any other justification for its existence that being the clever and efficient *runtime* system that CFI is using. Now I can be totally wrong and it is possible that when CFI was integrated, the IR representation was carefully designed independently of the actual CFI runtime, I just don't have time to dig in the history.

pcc added inline comments.Feb 9 2016, 1:38 PM
lib/Transforms/IPO/WholeProgramDevirt.cpp
603 ↗(On Diff #47261)

I feel in general that it's best not to do too much design up-front in cases where it can be avoided, as that can lead to a sub-optimal design. In this case, we control the frontend, so it's certainly possible to reverse out of any bad decisions we make here. Instead, I feel it's best to revisit the design in response to feature requirements. For example, as I implement llvm.bitset.check and some of the ABI breaking features discussed on the proposal thread I expect I'll end up tweaking the design in response to them, and the ultimate design will be better for it.

This revision was automatically updated to reflect the committed changes.