This is an archive of the discontinued LLVM Phabricator instance.

[PGO] Promote indirect calls to conditional direct calls with value-profile
ClosedPublic

Authored by xur on Mar 3 2016, 12:30 PM.

Details

Summary

This patch implements the transformation that promotes indirect calls to conditional direct calls when the indirect-call value profile meta-data is available.

It is invoked in two places: (1) after the IR-level profile-use pass. (2) in LTO optimizations. (1) performs limited promotion as many of the targets are in other modules. (2) is capable of promoting all the targets.

The Transformation performs legality check and profitability check. Legality check mainly for corrupted profiles where the ic-targets can points to wrong targets. Profitability check currently only uses the counts value and the ratio b/w this target and total indirect-call count.

We stress-tested the transformation with SPEC2006 to promote all the indirect-targets in the value profiles (in lto compilation).

We haven yet to do a thorough performance tuning. Here is the initial performance numbers with some SPEC programming running (10-run average on an ibis_sandybridge machine).

`` (1) (2) (3) (4)
spec/2006/fp/C++/453.povray 42.05 -8.17% +3.72% +2.32%
spec/2006/int/C++/471.omnetpp 23.69 -3.41% +3.37% +1.27%
spec/2006/int/C++/473.astar 23.04 -1.49% +1.91% +1.68%
spec/2006/int/C++/483.xalancbmk 38.83 -1.91% +6.01% +4.99%
spec/2000/int/C++/252.eon 4999 -14.47% +4.84% +2.36%``

(1) PGO + LTO
(2) PGO
(3) PGO + LTO + ICP(Indirect Call Promotion)
(4) PGO + LTO + ICP(only promote 1 target)

One known issue to be fixed later:
code duplication (like unroll) b/w value profile annotation and value profile transformation blindly duplicates the counts in the VP metadata. Need to update the metadata copying.

Diff Detail

Event Timeline

xur updated this revision to Diff 49766.Mar 3 2016, 12:30 PM
xur retitled this revision from to [PGO] Promote indirect calls to conditional direct calls with value-profile.
xur updated this object.
xur added reviewers: davidxl, ivanbaev.
xur added a subscriber: llvm-commits.
silvas added a subscriber: silvas.Mar 3 2016, 2:32 PM

Can you split the pass out of PGOInstrumentation.cpp?

Also, there does not appear to be enough testing for the amount of functionality introduced.

include/llvm/Transforms/Instrumentation.h
86

Please name the parameter after what it does. What does it control?

Also, whatever it does control can probably be done in a separate patch.

lib/ProfileData/InstrProf.cpp
91 ↗(On Diff #49766)

This function is confusingly named. How about "get name ignoring per-file mangling" or something? A comment can explain why this is valid for LTO.

lib/Transforms/Instrumentation/PGOInstrumentation.cpp
302 ↗(On Diff #49766)

Please clarify why this is needed in a comment (and possibly a separate patch).

973 ↗(On Diff #49766)

No floating point.

1000 ↗(On Diff #49766)

ParamNum

1025 ↗(On Diff #49766)

NumVals is way too generic of a name. Please be more specific. If it is just the number of entries in ValueData then use ArrayRef.

1059 ↗(On Diff #49766)

NumPromotions is always just equal to WorkList.size().

Actually, I would remove the terminology "worklist" and just say "promotions".

E.g. add struct CandidatePromotion with two members TargetFunction and Count. Then have this function just return a std::vector<CandidatePromotion>.

1130 ↗(On Diff #49766)

Redundant parentheses.

1140 ↗(On Diff #49766)

std::next (or llvm::next, if we can't use std::next yet).

I.e.

if (FuncRetType != CallRetType)
  NewInst = new BitCastInst(NewInst, CallRetType, "", std::next(NewInst));

Also, I see you are using IRBuilder above. Any reason why IRBuilder is not preferred for the IR building later in the function?

1184 ↗(On Diff #49766)

The naming here is quite poor. We have 3 functions:

performPromotion
doPromotion
doVersion

I think a better naming is:

processFunction
tryToPromote
promote

respectively.

1188 ↗(On Diff #49766)

In a separate patch, please add a helper function findIndirectCallSites(F) so that you can do for auto &I : findIndirectCallSites(F)) instead of having this boilerplate manually calling visit.

1198 ↗(On Diff #49766)

getCandidatePromotionsForCallSite is a better name.

auto CandidatePromotions = getCandidatePromotionsForCallSite(...);
NumPromoted += tryToPromote(ICall, CandidatePromotions, TotalCount)
1213 ↗(On Diff #49766)

If you had ValueDataArray in an ArrayRef (and annotateValueSite took an ArrayRef), then this would be annotateValueSite(*M, *I, ValueDataArray.drop_front(NumPromoted, TotalCount, IPVK_IndirectCallTarget, MaxNumPromotions) which would be a lot nicer.

test/Transforms/PGOProfile/indirect_call_annotation.ll
2 ↗(On Diff #49766)

Do not piggy-back on existing tests. You are adding completely new functionality which should have its own independent tests.

xur added a reviewer: silvas.Mar 4 2016, 9:20 AM
xur marked 10 inline comments as done.Mar 4 2016, 12:25 PM

Sean: Thanks for the detailed suggestions. I'll split the new feature into a separated file.

I will also put some of changes into a separated patch.

I do have some other tests locally -- I just not yet to convert them into the unit tests. I'll add them alone with the review goes.

lib/ProfileData/InstrProf.cpp
91 ↗(On Diff #49766)

This is to deal with the interaction with LTO's internalization. I'll split this part of change into a separated patch.

lib/Transforms/Instrumentation/PGOInstrumentation.cpp
302 ↗(On Diff #49766)

This will go to a separated patch.

1025 ↗(On Diff #49766)

will use a ArrayRef

1140 ↗(On Diff #49766)

No particular reason that I don't use IRbuilder. The inserted instructions are not in different BBs (not with the original call inst). I just thought it's convenient to change the code stream directly.

1188 ↗(On Diff #49766)

ok. will do a separated patch.

1213 ↗(On Diff #49766)

yes. this is cleaner. but this requires a annotateValueSite() change. I prefer to do this in a post-commit cleanup patch.

test/Transforms/PGOProfile/indirect_call_annotation.ll
2 ↗(On Diff #49766)

undo the change.

silvas added inline comments.Mar 4 2016, 12:33 PM
lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1213 ↗(On Diff #49766)

A pre-commit cleanup patch is also possible.

mehdi_amini added inline comments.Mar 4 2016, 12:39 PM
include/llvm/ProfileData/InstrProf.h
160 ↗(On Diff #49766)

I expect lengthy description of the naming, why the difference between LTO and non-LTO. If it described somewhere else, please insert a reference to it.

lib/Transforms/Instrumentation/PGOInstrumentation.cpp
126 ↗(On Diff #49766)

Remove

1140 ↗(On Diff #49766)

The IRBuilder has the nice property that you can supply a name for the instruction and it'll show up only in debug build.

xur marked 3 inline comments as done.Mar 4 2016, 2:10 PM

split some of the chages to http://reviews.llvm.org/D17895

include/llvm/ProfileData/InstrProf.h
160 ↗(On Diff #49766)
xur updated this revision to Diff 49858.Mar 4 2016, 2:48 PM

Integrated with Sean's review comments.
more unit tests are still to be added. Will include them later.

This patch depends on
http://reviews.llvm.org/D17895

Thanks,

-Rong

davidxl added inline comments.Mar 4 2016, 3:09 PM
lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1223 ↗(On Diff #49766)

See comment in 17895.

There is no need for this map. InstrProfSymtab should return the real function name from the PGOName's MD5 sum (icall target value). Using the real function name, you can directly use:

M.getFunction(RealName) to return the function.

anemet added a subscriber: anemet.Mar 22 2016, 10:05 PM
mehdi_amini added inline comments.Mar 22 2016, 10:41 PM
lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1140 ↗(On Diff #49766)

Just a note: forget the comment above on the IRBuilder, this has changed recently and names are handled dynamically in the Context now.

xur updated this revision to Diff 51689.Mar 25 2016, 2:49 PM

This patch uses the updated InstrProf interface in patch http://reviews.llvm.org/D17895.

A few more unit tests also added.

This also fixed a regression in last patch: it seems the "std::next()" is not working I thought here:
NewInst = new BitCastInst(NewInst, CallRetType, "", std::next(NewInst));
It would not return the next instruction. This is fixed in this patch.

Again, this patch depends on http://reviews.llvm.org/D17895.

xur updated this revision to Diff 52762.Apr 5 2016, 8:46 PM

Update the patch with recently refactoring patches.
Add handling of invoke instructions (covariant return type for invoke is to be supported later).
Also add some internal options for debugging.
Stressed tested with SPEC2006.

johanengelen added inline comments.
lib/Transforms/IPO/PassManagerBuilder.cpp
229

Placing add(createPGOIndirectCallPromotionPass()) here means that IndirectCallPromotionPass will be enabled at -O0, which I think is undesirable.

I expect that somewhere you look at the "opt_none" attribute on the function and bail.

lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
559

Why isn't is a function pass?

mehdi_amini added inline comments.Apr 6 2016, 10:38 AM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
75

s/transforamtion/transformation/

xur marked 2 inline comments as done.Apr 6 2016, 12:15 PM

I will check Attribute::OptimizeNone per @joker.eph's suggestion.

lib/Transforms/IPO/PassManagerBuilder.cpp
229

Good point. We should not do this in O0. I'll move this call out of this function.

lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
559

mainly I need to create InstrProfSymtab which is module level. I don't want to create this symtab for each function.

xur updated this revision to Diff 52833.Apr 6 2016, 12:19 PM
xur marked an inline comment as done.

Integrated the review comments from Johan and Mehdi.

Thanks!

mehdi_amini added inline comments.Apr 6 2016, 12:46 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
383

Why is it running twice in LTO mode?

588

What is the meaning of the "xur" here?

lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
534

Can you implement this in a utility function? I expect the "pass wrapper" to have a dummy runOnModule that would only call the public API.

xur updated this revision to Diff 52852.Apr 6 2016, 2:40 PM

Medhi, here is the patch that uses a static function. Does it look OK?

As for your question about the two-step promotion: Yes, this produces the same result if we only do the promotion in LTO optimization. The intra-module promotion will stop at the first target that it cannot promote and write the remaining VP targets to the metadata. The cross-module promotion will pick it up and do the rest.

Thanks!
-Rong

mehdi_amini added inline comments.Apr 6 2016, 3:06 PM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
536

Nice, but should be static considering the name. Alternatively I don't mind having it non static if it is called something like promoteIndirectCalls()

xur updated this revision to Diff 52857.Apr 6 2016, 3:26 PM

Fix the function name.

(thanks for your work on this btw! ;)

test/Transforms/PGOProfile/icp_vararg.ll
33

I think the branch_weights are calculated slightly wrong.
branch_weights usually have an offset of 1, where 1 means there is no profile measurement that it was taken. See the documentation of scaleBranchWeight in CodeGenPGO.cpp of Clang.

xur added a subscriber: xur.Apr 7 2016, 1:25 PM

We intentionally drop this +1 in IR level PGO instrumentation. I thing
branch weight can handle 0 weight now.

Thanks,

-Rong

xur updated this revision to Diff 53078.Apr 8 2016, 1:26 PM

Refactor the code to break a large function into smaller ones.
This new patch supports the covariant return for invoke instructions too.

xur updated this revision to Diff 53081.Apr 8 2016, 1:33 PM

fix comments.

I benchmarked this patch on our ARM64. I've done two runs.

In the first, I used the patch with -enable-value-profiling, LTO and PGO of course. I was using FE-based instrumentation. This is the first column in the table below.

I've also done a second run where I tried to fix up the missed promotions due to the incorrect internalization for static functions (the FE variant of D17895). This helps povray which has a few critical indirect calls to functions that are statically defined. Unfortunately, the promoted calls are still not inlined in povray which would further improve performance.

This the second column in the table and the diff is relative to the previous column (i.e. povray in the second run recovers *half* the regression from the first run). The numbers are directly from LNT so a negative percentage is a performance *improvement*.

|                        |    patch  | no-localize |
|------------------------+-----------+-------------+
| CINT2006/464.h264ref   |    -3.35% |             |
| CINT2006/483.xalancbmk |    -1.11% |             |
|------------------------+-----------+-------------+
| CFP2006/453.povray     |    +0.92% | -0.48%      |
|------------------------+-----------+-------------+
| CINT2000/252.eon       |    -1.06% |             |

So to summarize, this looks great on ARM64 assuming we have a solution for the internalization problem for FE-based profiling as well. xur, is there already a patch for that?

xur added a comment.Apr 15 2016, 10:26 AM

Adam, thanks for testing on ARM64 and sharing the number.

For front-end's VP annotation, I've sent out a patch for review a few weeks
early:
http://reviews.llvm.org/D18624

Could you try that?

Thanks,

-Rong

Sure, I'll try it.

davidxl edited edge metadata.Apr 16 2016, 6:16 PM

First batch of comments from browsing the patch.

lib/IR/Instructions.cpp
3016

This change can be carved out as a NFC refactoring.

lib/Transforms/IPO/PassManagerBuilder.cpp
588

that left -- that are left

589

intra-module targets

590

can you explain the 'save compile time' part here?

lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
75

once the total number of promotions equals the cutoff value.

78

Why all upper case, ICPCutOff looks better.

81

-->

If ICPCSSkip is non zero, the first ICPCSSkip callsites will be skipped.

93

If the option is set to true, only call instructions will be considered for transformation -- invoke instructions will be ignored.

150

maps indirect call profile values to function names.

170

records

171

Perhaps PromotionCandidate?

186

Document this method -- what is passed in and what is returned.

190

make it static member function

222

just
if (count < ICPCountThreshold)

224

When can this happen? Should it be corrected in caller?

235

Is this check needed? NotAvailableInModule is essentially body not available.

279

This local variable is not used.

363

make the name more meaningful such as "if.true.direct_targ"

538

It always returns true -- change it to void

In D17864#402676, @xur wrote:

Adam, thanks for testing on ARM64 and sharing the number.

For front-end's VP annotation, I've sent out a patch for review a few weeks
early:
http://reviews.llvm.org/D18624

Could you try that?

I am replying under D18624.

davidxl added inline comments.Apr 20 2016, 3:44 PM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
379

Do you need merge BB here for invoke at all (which will be empty and unreachable)?

440

if you remove MergeBB, there is no need for this step -- or this is needed to avoid fixing up phis in the original normal dest?

446

add comment -- clear the value profile data.

556

Can this be folded into create4DirectCallInst call?

xur marked 15 inline comments as done.Apr 20 2016, 4:56 PM
xur added inline comments.
lib/IR/Instructions.cpp
3016

OK. Will split this code out to a separated patch.

lib/Transforms/IPO/PassManagerBuilder.cpp
590

As I mentioned in earlier reply. the optimization performed in LTOOptimizationPasses cannot be palatalized. The Intra-module promotion can be trivially paralleled. LTO part of promotion takes pretty long for moderately large programs like 483.xalancbmk in spec2006. Splitting into two part saves the time for parallel builds.

lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
186

will do.

224

will have this checked in the caller.

235

decls might be available.
just for debug.
I'll merge this two cases.

379

I assume you are referring this to the invoke instr.
I need this BB to insert a potential bit-cast inst.
If there is not bit-cast, it will be just a branch to the original destBB.
I assume this gonna be clean by late CFG optimization.

440

as I mentioned early, this is for bit-cast, not for phis.

556

I remember there is a transformation ordering issue. but I can try. will comment this more in the new patch.

xur updated this revision to Diff 54784.Apr 23 2016, 10:36 AM
xur edited edge metadata.

Integrated David's comments.
(1) it turned out we don't need new interface for castIsValid. I now use the existing isBitCastable()
(2) I was able to fold insertCallRetCast() to createDirectCallInst() which make code more compact.
I have to add a reference parameter in createDirCallInst() as I need to use both the direct-call instruction and the bitcast instruction.

Thanks,

-Rong

davidxl added inline comments.Apr 24 2016, 4:30 PM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
409

Fix the comments.

This method fixes up PHI nodes in block BB. In BB, there may be PHIs with incoming block being OrigBB (the MergeBB after block splitting). After if-then-else splitting, OrigBB is no longer the predecessor block of BB. Instead two new predecessors are added: IndirectCallBB and DirectCallBB, so the PHI nodes's incoming BBs need to be fixed up accordingly.

413

This method is correct only for unwind/eh dest BB for an invoke instruction. It is not right for the normalDest.

For normalDest BB, the PHI node's predecessor BB remain unchanged -- OrigBB (i.e. MergeBB), but the value of the PHI incoming value may change into new PHI or cast inst inserted into the MergeBB.

429

Add assert that V is defined in a BB that dominates IndirectCallBB

479

Clean --> Clear

498

This comment needs lots of cleanup.

503

Since this function body computes lots of values already available in the caller context, suggest removing it and inline it into the caller.

512

nullptr ? It does not look correct. It shold be the DirectcallInst.

514

see comment in fixupPHINode method.

527

For Invoke, since the cast is inserted in the original BB (mergeBB) -- i.e., the new NormalDest, it post-dominates the call return values from both predecessors so there seems to be no need for phi insertion -- unless you split the another new BB between DirectBB and MergeBB.

xur marked 3 inline comments as done.Apr 25 2016, 10:35 AM
xur added inline comments.
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
413

Note the condition "if (DirectCallBB)"
I only change the phi operand points to DirectcallBB when it's not a nullptr.

For normalDest, the DirectcallBB is set to nullptr in the caller.
This also answers your question about the call's nullptr argument.

512

see the reply in fixupPHINode.

514

see the reply in the fixupPHINode.

527

The phi is not inserted in the mergeBB. It's in the BB the joins mergeBB and original NormalDest. We now have one return value from the original indirect invoke and one from new direct invoke. A phi is needed in the join point.

xur added a comment.Apr 25 2016, 11:16 AM

another comments

lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
429

do you mean V is defined in IndirectCallBB and dominates BB?

we are in the middle of CFG transformation and dominate tree might not be valid.
Is there a cheap way to check?

why do you think we need to check this? this seems to be obvious (unless the phis are corrupted already).

xur updated this revision to Diff 54915.Apr 25 2016, 2:58 PM

Addressed some of the David's most recently comments:
(1) comments change
(2) inline fixupInvokeInst into promote(). After this change, we no longer need the direct-call instruction. So I changed createDirectCallInst() to return the new instruction (either is direct-call or the result-cast).
(3) simplify the test case for the covariant return for call.
(4) add a test case for covariant return for invoke.

davidxl added inline comments.Apr 26 2016, 12:02 PM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
388

Document that this is needed for the newly created direct call (invoke) -- as its normal Dst will be fixed up to MergeBB

409

.. in BB (where BB can either be the original NormalDst BB or the unwind BB of the indirect call/invoke instruction).

411

After if-then-else splitting, if BB is the unwindBB, OrigBB will no longer be the pre...

If BB is the Original NormalDstBB, a new incoming edge will be added to the original NormalDstBB from the IndirectCallBB.

414

I suggest split this method into two fixPHINodesForNormDest and fixPHINodesForUnwind -- for better readability.

459

Document parameters. For instance, for mergeBB, document that it is the mergeBB (bottom BB of the if-then-else-diamond) after indirect call promotion. For Invoke instruction, the edges from directBB and IndirectBB to mergedBB are removed before this call (during createIfThenElse).

Also document the CFG effect of this method: after the call, the direct invokes normal destination will be redirected to MergeBB.

463

NormalDest --> MergeBB

474

Add a comment lie:

direct Invoke's normal destinaion is fixed up to mergeBB. MergeBB is the place where return cast is inserted. Also since indirectCallBB does not have an edge to MergeBB, there is no need to insert new PHIs into MergeBB.

xur updated this revision to Diff 55106.Apr 26 2016, 3:25 PM
xur marked 7 inline comments as done.

Integrated David's most recently review comments.

Thanks,

-Rong

davidxl added inline comments.Apr 27 2016, 10:07 AM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
416

Document 'Inst'. Also looks like Inst is not used in the body.

425

Is this condition possible?

429

What is the phi arg value for this new incoming block?

434

invoke --> invoke (indirect)

437

Add this description: the OrigBB is still an incoming block for NormalDstBB, but the phi incoming value from this block becomes the new value NewInst

448

Is IX == -1 possible?

xur marked 2 inline comments as done.Apr 27 2016, 10:59 AM
xur added inline comments.
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
416

you are right. Inst parameter is only used to check the invoke return value. I assume the UnwindDest will not have a phi using the return value.

I'll remove Inst from the parameter list.

425

why not? The operand can come from a basic block other than OrigBB:
like bb1 (def v_1) --> bb2(invoke) --> bb

bb2(def v_2) --> bb

bb can have a phi(v_1, v_2) in bb.
The incoming block for the phi with ix 0 is bb1, rather bb2, right?

429

it's the same value as the one from OriginBB: in the two statements, the first only dup the value, and second one rest the incoming bb to DirectCallBB.

448

see earlier reply.

davidxl added inline comments.Apr 27 2016, 11:27 AM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
425

Sorry -- I misread it :) WHat I meant is that the loop should have an assert that OrigBB is found in one of the iterations

429

IF the value was the value from the original indirect call, should the new value become the result of the direct call?

xur added inline comments.Apr 27 2016, 11:38 AM
lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
425

It might not have a PHI that have OrigBB has the incoming BB, right?

429

We have the special handling of the return value PHI. This only applies to the NormalDest.

I assume UnwindBB won't use the return value. Let me know my assumption is incorrect.

This code handle the phi other than the return value PHI.

davidxl accepted this revision.Apr 27 2016, 11:45 AM
davidxl edited edge metadata.

LGTM

lib/Transforms/Instrumentation/IndirectCallPromotion.cpp
429

This assumption is probably correct -- in case of exception, the return value is not properly defined.

This revision is now accepted and ready to land.Apr 27 2016, 11:45 AM
This revision was automatically updated to reflect the committed changes.