[PGO] Promote indirect calls to conditional direct calls with value-profile
ClosedPublic

Authored by xur on Mar 3 2016, 12:30 PM.

Details

Summary

This patch implements the transformation that promotes indirect calls to conditional direct calls when the indirect-call value profile meta-data is available.

It is invoked in two places: (1) after the IR-level profile-use pass. (2) in LTO optimizations. (1) performs limited promotion as many of the targets are in other modules. (2) is capable of promoting all the targets.

The Transformation performs legality check and profitability check. Legality check mainly for corrupted profiles where the ic-targets can points to wrong targets. Profitability check currently only uses the counts value and the ratio b/w this target and total indirect-call count.

We stress-tested the transformation with SPEC2006 to promote all the indirect-targets in the value profiles (in lto compilation).

We haven yet to do a thorough performance tuning. Here is the initial performance numbers with some SPEC programming running (10-run average on an ibis_sandybridge machine).

`` (1) (2) (3) (4)
spec/2006/fp/C++/453.povray 42.05 -8.17% +3.72% +2.32%
spec/2006/int/C++/471.omnetpp 23.69 -3.41% +3.37% +1.27%
spec/2006/int/C++/473.astar 23.04 -1.49% +1.91% +1.68%
spec/2006/int/C++/483.xalancbmk 38.83 -1.91% +6.01% +4.99%
spec/2000/int/C++/252.eon 4999 -14.47% +4.84% +2.36%``

(1) PGO + LTO
(2) PGO
(3) PGO + LTO + ICP(Indirect Call Promotion)
(4) PGO + LTO + ICP(only promote 1 target)

One known issue to be fixed later:
code duplication (like unroll) b/w value profile annotation and value profile transformation blindly duplicates the counts in the VP metadata. Need to update the metadata copying.

Diff Detail

There are a very large number of changes, so older changes are hidden. Show Older Changes
xur added a subscriber: xur.Apr 7 2016, 1:25 PM

We intentionally drop this +1 in IR level PGO instrumentation. I thing
branch weight can handle 0 weight now.

Thanks,

-Rong

xur updated this revision to Diff 53078.Apr 8 2016, 1:26 PM

Refactor the code to break a large function into smaller ones.
This new patch supports the covariant return for invoke instructions too.

xur updated this revision to Diff 53081.Apr 8 2016, 1:33 PM

fix comments.

I benchmarked this patch on our ARM64. I've done two runs.

In the first, I used the patch with -enable-value-profiling, LTO and PGO of course. I was using FE-based instrumentation. This is the first column in the table below.

I've also done a second run where I tried to fix up the missed promotions due to the incorrect internalization for static functions (the FE variant of D17895). This helps povray which has a few critical indirect calls to functions that are statically defined. Unfortunately, the promoted calls are still not inlined in povray which would further improve performance.

This the second column in the table and the diff is relative to the previous column (i.e. povray in the second run recovers *half* the regression from the first run). The numbers are directly from LNT so a negative percentage is a performance *improvement*.

|                        |    patch  | no-localize |
|------------------------+-----------+-------------+
| CINT2006/464.h264ref   |    -3.35% |             |
| CINT2006/483.xalancbmk |    -1.11% |             |
|------------------------+-----------+-------------+
| CFP2006/453.povray     |    +0.92% | -0.48%      |
|------------------------+-----------+-------------+
| CINT2000/252.eon       |    -1.06% |             |

So to summarize, this looks great on ARM64 assuming we have a solution for the internalization problem for FE-based profiling as well. xur, is there already a patch for that?

xur added a comment.Apr 15 2016, 10:26 AM

Adam, thanks for testing on ARM64 and sharing the number.

For front-end's VP annotation, I've sent out a patch for review a few weeks
early:
http://reviews.llvm.org/D18624

Could you try that?

Thanks,

-Rong

Sure, I'll try it.

First batch of comments from browsing the patch.

lib/IR/Instructions.cpp
3016

This change can be carved out as a NFC refactoring.

lib/Transforms/IPO/PassManagerBuilder.cpp
588

that left -- that are left

589

intra-module targets

590

can you explain the 'save compile time' part here?

lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
74

once the total number of promotions equals the cutoff value.

77

Why all upper case, ICPCutOff looks better.

80

-->

If ICPCSSkip is non zero, the first ICPCSSkip callsites will be skipped.

92

If the option is set to true, only call instructions will be considered for transformation -- invoke instructions will be ignored.

149

maps indirect call profile values to function names.

169

records

170

Perhaps PromotionCandidate?

185

Document this method -- what is passed in and what is returned.

189

make it static member function

221

just
if (count < ICPCountThreshold)

223

When can this happen? Should it be corrected in caller?

234

Is this check needed? NotAvailableInModule is essentially body not available.

278

This local variable is not used.

362

make the name more meaningful such as "if.true.direct_targ"

537

It always returns true -- change it to void

In D17864#402676, @xur wrote:

Adam, thanks for testing on ARM64 and sharing the number.

For front-end's VP annotation, I've sent out a patch for review a few weeks
early:
http://reviews.llvm.org/D18624

Could you try that?

I am replying under D18624.

davidxl added inline comments.Apr 20 2016, 3:44 PM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
378

Do you need merge BB here for invoke at all (which will be empty and unreachable)?

439

if you remove MergeBB, there is no need for this step -- or this is needed to avoid fixing up phis in the original normal dest?

445

add comment -- clear the value profile data.

555

Can this be folded into create4DirectCallInst call?

xur marked 15 inline comments as done.Apr 20 2016, 4:56 PM
xur added inline comments.
lib/IR/Instructions.cpp
3016

OK. Will split this code out to a separated patch.

lib/Transforms/IPO/PassManagerBuilder.cpp
590

As I mentioned in earlier reply. the optimization performed in LTOOptimizationPasses cannot be palatalized. The Intra-module promotion can be trivially paralleled. LTO part of promotion takes pretty long for moderately large programs like 483.xalancbmk in spec2006. Splitting into two part saves the time for parallel builds.

lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
185

will do.

223

will have this checked in the caller.

234

decls might be available.
just for debug.
I'll merge this two cases.

378

I assume you are referring this to the invoke instr.
I need this BB to insert a potential bit-cast inst.
If there is not bit-cast, it will be just a branch to the original destBB.
I assume this gonna be clean by late CFG optimization.

439

as I mentioned early, this is for bit-cast, not for phis.

555

I remember there is a transformation ordering issue. but I can try. will comment this more in the new patch.

xur updated this revision to Diff 54784.Apr 23 2016, 10:36 AM

Integrated David's comments.
(1) it turned out we don't need new interface for castIsValid. I now use the existing isBitCastable()
(2) I was able to fold insertCallRetCast() to createDirectCallInst() which make code more compact.
I have to add a reference parameter in createDirCallInst() as I need to use both the direct-call instruction and the bitcast instruction.

Thanks,

-Rong

davidxl added inline comments.Apr 24 2016, 4:30 PM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
409

Fix the comments.

This method fixes up PHI nodes in block BB. In BB, there may be PHIs with incoming block being OrigBB (the MergeBB after block splitting). After if-then-else splitting, OrigBB is no longer the predecessor block of BB. Instead two new predecessors are added: IndirectCallBB and DirectCallBB, so the PHI nodes's incoming BBs need to be fixed up accordingly.

413

This method is correct only for unwind/eh dest BB for an invoke instruction. It is not right for the normalDest.

For normalDest BB, the PHI node's predecessor BB remain unchanged -- OrigBB (i.e. MergeBB), but the value of the PHI incoming value may change into new PHI or cast inst inserted into the MergeBB.

429

Add assert that V is defined in a BB that dominates IndirectCallBB

479

Clean --> Clear

498

This comment needs lots of cleanup.

503

Since this function body computes lots of values already available in the caller context, suggest removing it and inline it into the caller.

512

nullptr ? It does not look correct. It shold be the DirectcallInst.

514

see comment in fixupPHINode method.

527

For Invoke, since the cast is inserted in the original BB (mergeBB) -- i.e., the new NormalDest, it post-dominates the call return values from both predecessors so there seems to be no need for phi insertion -- unless you split the another new BB between DirectBB and MergeBB.

xur marked 3 inline comments as done.Apr 25 2016, 10:35 AM
xur added inline comments.
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
413

Note the condition "if (DirectCallBB)"
I only change the phi operand points to DirectcallBB when it's not a nullptr.

For normalDest, the DirectcallBB is set to nullptr in the caller.
This also answers your question about the call's nullptr argument.

512

see the reply in fixupPHINode.

514

see the reply in the fixupPHINode.

527

The phi is not inserted in the mergeBB. It's in the BB the joins mergeBB and original NormalDest. We now have one return value from the original indirect invoke and one from new direct invoke. A phi is needed in the join point.

xur added a comment.Apr 25 2016, 11:16 AM

another comments

lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
429

do you mean V is defined in IndirectCallBB and dominates BB?

we are in the middle of CFG transformation and dominate tree might not be valid.
Is there a cheap way to check?

why do you think we need to check this? this seems to be obvious (unless the phis are corrupted already).

xur updated this revision to Diff 54915.Apr 25 2016, 2:58 PM

Addressed some of the David's most recently comments:
(1) comments change
(2) inline fixupInvokeInst into promote(). After this change, we no longer need the direct-call instruction. So I changed createDirectCallInst() to return the new instruction (either is direct-call or the result-cast).
(3) simplify the test case for the covariant return for call.
(4) add a test case for covariant return for invoke.

davidxl added inline comments.Apr 26 2016, 12:02 PM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
388

Document that this is needed for the newly created direct call (invoke) -- as its normal Dst will be fixed up to MergeBB

409

.. in BB (where BB can either be the original NormalDst BB or the unwind BB of the indirect call/invoke instruction).

411

After if-then-else splitting, if BB is the unwindBB, OrigBB will no longer be the pre...

If BB is the Original NormalDstBB, a new incoming edge will be added to the original NormalDstBB from the IndirectCallBB.

414

I suggest split this method into two fixPHINodesForNormDest and fixPHINodesForUnwind -- for better readability.

459

Document parameters. For instance, for mergeBB, document that it is the mergeBB (bottom BB of the if-then-else-diamond) after indirect call promotion. For Invoke instruction, the edges from directBB and IndirectBB to mergedBB are removed before this call (during createIfThenElse).

Also document the CFG effect of this method: after the call, the direct invokes normal destination will be redirected to MergeBB.

463

NormalDest --> MergeBB

474

Add a comment lie:

direct Invoke's normal destinaion is fixed up to mergeBB. MergeBB is the place where return cast is inserted. Also since indirectCallBB does not have an edge to MergeBB, there is no need to insert new PHIs into MergeBB.

xur updated this revision to Diff 55106.Apr 26 2016, 3:25 PM
xur marked 7 inline comments as done.

Integrated David's most recently review comments.

Thanks,

-Rong

davidxl added inline comments.Apr 27 2016, 10:07 AM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
416

Document 'Inst'. Also looks like Inst is not used in the body.

425

Is this condition possible?

429

What is the phi arg value for this new incoming block?

434

invoke --> invoke (indirect)

437

Add this description: the OrigBB is still an incoming block for NormalDstBB, but the phi incoming value from this block becomes the new value NewInst

448

Is IX == -1 possible?

xur marked 2 inline comments as done.Apr 27 2016, 10:59 AM
xur added inline comments.
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
416

you are right. Inst parameter is only used to check the invoke return value. I assume the UnwindDest will not have a phi using the return value.

I'll remove Inst from the parameter list.

425

why not? The operand can come from a basic block other than OrigBB:
like bb1 (def v_1) --> bb2(invoke) --> bb

bb2(def v_2) --> bb

bb can have a phi(v_1, v_2) in bb.
The incoming block for the phi with ix 0 is bb1, rather bb2, right?

429

it's the same value as the one from OriginBB: in the two statements, the first only dup the value, and second one rest the incoming bb to DirectCallBB.

448

see earlier reply.

davidxl added inline comments.Apr 27 2016, 11:27 AM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
425

Sorry -- I misread it :) WHat I meant is that the loop should have an assert that OrigBB is found in one of the iterations

429

IF the value was the value from the original indirect call, should the new value become the result of the direct call?

xur added inline comments.Apr 27 2016, 11:38 AM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
425

It might not have a PHI that have OrigBB has the incoming BB, right?

429

We have the special handling of the return value PHI. This only applies to the NormalDest.

I assume UnwindBB won't use the return value. Let me know my assumption is incorrect.

This code handle the phi other than the return value PHI.

davidxl accepted this revision.Apr 27 2016, 11:45 AM

LGTM

lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
429

This assumption is probably correct -- in case of exception, the return value is not properly defined.

This revision is now accepted and ready to land.Apr 27 2016, 11:45 AM
This revision was automatically updated to reflect the committed changes.