Page MenuHomePhabricator

PGO IR-level instrumentation infrastructure

Authored by xur on Sep 10 2015, 4:03 PM.



This patch implements the infrastructure for PGO late (i.e. IR-level) instrumentation. (Refer to RFC: PGO Late instrumentation for LLVM

The main part of code is in a newly added file: lib/Transforms/Instrumentation/PGOLateInstr.cpp
This file implements a module pass PGOLateInstrumeatnion. It applies the instrumentation to each function by class PGOLateInstrumentationFunc. For each function, perform the following steps:
(1) Collect all the CFG edges. Assign an estimated weight to each edge. Critical edges and back-edges are assigned to high value of weights. One fake node and a few fake edges (from the fake node to the entry node, and from the exit nodes to the fake node) are also added to the worklist.
(2) Construct the MST. The edges with the higher weight will be put to MST first, unless it forms a cycle.
(3) Traverse the CFG and compute the CFG hash.

The above three steps are the same for profile-generate and profile-use compilation.

In the next step, for profile-generation compilation:
(4-gen) Instrument all the edges that not in the MST. If this is a critical edge, split the edge first. The actual instrumentation is to generate Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic will be lowed by pass createInstrProfilingPass().

For profile-use compilation,
(4-use) Read in the counters and the CFG hash from the profile file.
(5-use) If there is no error, populate the counters to all the edges in reverse topological order of the MST.
(6-use) Once having all the edge counts, set the branch weights metadata for the IR having multiple branches. Also apply the cold/hot function attributes based on function level counts.

This pass is added to PassManagerBuilder when populate module passes. see lib/Transforms/IPO/PassManagerBuilder.cpp.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
silvas added inline comments.Nov 13 2015, 7:04 PM

Factor the DEBUG macro usage into a single point of truth here. E.g. factor out a function printEdges(raw_ostream &OS, const StringRef Message). Then change this function to be simply DEBUG(dumpEdges(dbgs(), Message)). (this is similar to the pattern used by Value, for example).

This will also give clang-format an easier time.


Use std::tie


Other places seem to use uint64_t for weight. Why unsigned here?


Does this actually need to be virtual? Same for all the other virtual methods in this file.


"dump" functions are for debugging and should be conditionalized. For example, I suggest refactoring to allow a call

std::string Message = "Dump Function " + FuncName +
                          " after CFGMST: \t Hash: " +
MST.dumpEdges(dbgs(), Message);

Avoid having debug code be non-conditional.


Avoid const std::unique_ptr<PGOUseEdge> &.


Use a free function instrumentOneFunc


Use a free function setPGOCountOnFunc

Also, can you begin to work on tests? I think right now the core algorithm is pretty good. There are some cleanups to the implementation that I'd like to see but I can do those post-commit if necessary. Right now the main missing piece needed before this is committed are tests.

davidxl added inline comments.Nov 14 2015, 11:59 AM

if there is no BB ..


It is better to just call it 'setEdgeCount' with same documentation -- or make it just

getUnknownCountEdge() and let the client to set the edge directly.


Sean's comment here is not addressed -- use raw PGOUseEdge&


In general it is not safe to update the container (AllEdges) while iterating -- it may invalidate the iterator or element reference saved before. The code should create a worklist of edges before the count setting -- the worklist will then be 'frozen' (It is safe to update worklist, but there is no need to push new split edges into the list) and AllEdges can be safely updated.


Use the new getInstrProfRecord method -- eventually we will need to read value profile data too.

xur marked 17 inline comments as done.Nov 16 2015, 3:54 PM
xur added inline comments.

reorganized the code, also handles overflow.


changed to uint64_t.


Now guarded by DEBUG.


I see your point. But this is exactly the reason I did not use range-based loop like all others in the file. I get the vector size in the loop initialization and use the index to reference the vector elements in the body. So adding element to the vector will not be a problem. I could use a worklist, but the effect should be the same.

xur updated this revision to Diff 40361.Nov 16 2015, 3:58 PM
xur marked 3 inline comments as done.

Here is the latest patch that integrated Sean and David Li's comments.
I'm working on the test case. One thing I'm not clear is that I need the passmanagerbuild change to invoke the new functionality. That part of code was split from this patch from Manman's review comments.


Some more small comments.


Use a real variable.




return *AllEdges.back()


Give intuition for why it is edges *not in* the MST rather than edges *in* the MST? Edges with high counts should have high weight and therefore *not* be in the MST (which tries for minimum weight); we don't want to instrument edges with high counts, and they are *not* in the MST, so why would we place counters there?


Please cite the Knuth paper. Also explain what we actually do with the MST (and give intuition for why that makes sense (it's fairly simple, should not need a long explanation)).


Make these two variable just local variables of instrumentOnFunc free function.


This is only called in one place, inline the code into instrumentOnFunc.


Can you use continue to reduce indentation for the "large" side of the if? ( )


Do you mean to do a copy here? Probably std::vector<uint64_t> & is intended.


e instead of E, as is common in LLVM.


getBBInfo(Ei->SrcBB). You shouldn't need a cast here (if you do, then please fix whatever code is requiring this to cast away constness).


Just make the iteration variable BB and use getBBInfo(&BB), like you do below in the loop // Assert every BB has a valid counter..


Remove the const_cast (I think this will require making GetSuccessorNumber take const BasicBlock *, which you can do as a separate patch (no need for pre-commit review))


Do you mean instrumentOneFunc? instrumentOnFunc doesn't make sense (weird use of "on").

xur marked 14 inline comments as done.Nov 19 2015, 11:17 AM

Thanks for the suggestion. All fixed. Please refer to the newest patch for the update.

xur updated this revision to Diff 40682.Nov 19 2015, 11:18 AM

Updated with Sean's new comments.
Also includes the unit tests.



davidxl edited edge metadata.Nov 19 2015, 2:57 PM

Can you also add a script to re-generate the binary profile data needed whenever profile format changes ?

xur added a comment.Nov 19 2015, 3:11 PM

These .ll files are the complete program IR that can generate the
profiles. I can embed the commands in the comment of the .ll files. Just
to make sure: You want a separated shell script in the same director to
generate the profiles?



These test files look like they are just a dump of IR generated from a C/C++, which is extremely verbose and has a lot of inessential details. I also don't like checking in binary profdata files. Overall these tests seem extremely brittle. I also don't understand why the test files have names like "for" or "goto" or "ifelse". Those concepts don't exist in the IR. Surely the tests should have names like "criticaledge" etc.

It seems like this testing might be more readable as C++ unit tests. Using the new pass manager this should be easy to wire up. I think the hardest part would be to stub out IndexedInstrProfReader.

I thought about the size of the test case too. Initially I had the same
concern, but in second look, it does not seem to be brittle -- because it
just checks whether the key branches are properly annotated with the right
profile count or not, or profile counter update instruction is inserted or

I think adding C++ unit tests are independent tasks which can be done once
the driver part of the changes land.


xur updated this revision to Diff 41062.Nov 24 2015, 10:29 AM
xur edited edge metadata.

I changed the regression tests based on the feedbacks from Sean, David and Justin. The IR is much simplified and the profile uses the text format.
I still keep the target triplet as x86_64-unknown-linux-gnu, because this is the only platform I tested. I will try to relax it later.



Looks good with the minor fix.


Remove unused code.

This revision was automatically updated to reflect the committed changes.

Seems it behaves differently on i686. Investigating.

See also;
(It is configured as "gcc -m32")

silvas added a subscriber: silvas.Nov 24 2015, 3:16 PM

Hi Rong,

Typically, when there have been multiple active reviewers, it is common
courtesy to wait for all of them to LGTM the patch.

I had further comments on the testing here (which the bots seem to have
caught you on anyway), so I recommend reverting this patch for the moment
and reopening the review.

  • Sean Silva
xur updated this revision to Diff 41175.Nov 25 2015, 1:20 PM
xur removed rL LLVM as the repository for this revision.

Fix a few issues that being exposed by the buildbot:
(1) wrong shift operation in functionhash computing,
(2) missing type cast for vector<...>size_type (which results bad functionhash in m32 host)
(3) sorting is not stable

Also improved the test checks suggested by Sean.



I took another pass through, so some of the comments are nits. But most of the comments on the tests in particular are quite significant and not just "nits".

23 ↗(On Diff #41175)

Why does IPO now depend on Instrumentation, but didn't previously?
What is different about the PGO passes vs the other instrumentation passes?


Naming convention.


The length of this line does not match the one below.


typo: two spaces after 'the' should be just one space.


Why are they mutually exclusive? Weren't you wanting to the the count profile in MST computation eventually?


typo: "is done"


class PGOGenFunc is no longer present, please update the comment.


These header comments easily go out of date during patch review. I would recommend reading the comment from top to bottom and verifying that it is up to date.


This option is only for testing, so please rename it to make that clear (also update the description).


Do you still need these const_cast<>'s after r253733?


You say "edges" (plural) here but then say "There should be one and only one".

7 ↗(On Diff #41175)

This would be more readable with a non-mangled name. (same for the other tests)

10 ↗(On Diff #41175)

Verify that it is attached to the instruction you expect (CHECK-SAME may be useful).

(same for the other "use" tests)

53 ↗(On Diff #41175)

Will this test fail if the call ends up in the previous or next BB?

I would recommend, for each BB, having a CHECK line verifying the BB name, followed by either CHECK or CHECK-NOT verifying the presence or absence of the increment call.

10 ↗(On Diff #41175)

Please have the names consistent with the switch values.

25 ↗(On Diff #41175)

This test case can be simplified further. At the very least, the function bar is not needed.

49 ↗(On Diff #41175)

The convention for FileCheck variable captures is all upper case (e.g. [[BW2:[0-9]+]]).
Also, can you use more semantic names instead of numbers? E.g. BW_<name> instead of BW<number>.

5 ↗(On Diff #41175)

What part of the code is this test trying to cover that is not covered by loop3 or loop1? Do we need this test?

6 ↗(On Diff #41175)

You can merge the _gen and _use files by using FileCheck's option --check-prefix (e.g. --check-prefix=GEN or --check-prefix=USE).

Actually, doing this is necessary so that there is a single point of truth for the IR used by both.

16 ↗(On Diff #41175)

These add instructions are not needed. Just make the function return void.

xur marked 17 inline comments as done.Nov 30 2015, 1:04 PM

Thanks for Sean's suggestion. They indeed make the tests cleaner and more robust. I integrated his reviews in the latest patch that I'll post soon.

23 ↗(On Diff #41175)

thanks for catching this. This should not be there -- it is only needed for the passmangerbuilder change.

25 ↗(On Diff #41175)

bar is a file static function and was used to check the source name is part of the profile variable name.

xur updated this revision to Diff 41428.Nov 30 2015, 1:08 PM
xur marked 2 inline comments as done.

Integrated Sean's most recently review comments.

The tests are looking great! A couple nits and a final question about the test.


Can you use CHECK-DAG directive to move these next to the place where the BW_* capture occurs? That would improve locality and clarify what is being checked.

(same in the other files)

4 ↗(On Diff #41428)

Tiny nit: could you name all the tests which are just checking diagnostics to match diag_*.ll? (or whatever naming convention seems appropriate)

That will help to clearly identify them.


What is the importance of testing a nested for loop? What part of the code are you trying to exercise that isn't covered by loop1.ll?

davidxl added inline comments.Nov 30 2015, 8:38 PM

Having loop nest in the test for better coverage is fine -- but looks like we don't actually need 3-deep nest -- a 2-deep loop nest is good enough and will be easier to read.

loop1.ll has a loop with top test. How about a loop test with bottom testing. Also a loop with more control flow inside the body, and a loop with early exit might be nice to have -- but those can be added later as follow ups if needed.

xur added a comment.Dec 1 2015, 2:47 PM

It's not clear how can I use CHECK-DAG to group together the branch weight
meta data and the related instruction. if there is a single instruction and
single meta data, I can move up the branch weight meta data next the the
instruction. But if there are multiple, I don't know how to do it. From
the document, CHECK-DAG can be used b/w two matches (or before the first
match, or after the last match). Move the branch-weight meta data and use
USE-DAG is not working.

I also tried to change all use USE to USE (except the first and last). That
won't work either.

As for the loop2, it just a more complex data flow. I'll change it to two
level nexted loop according to David's suggestion.

I used to use a buttom test loop, but I removed based on silvas's
suggestion. I'll add more tests later if necessary.

xur updated this revision to Diff 41686.Dec 2 2015, 3:37 PM

Integrated most recent review comments from Sean, David and Justin.
Let me know if I missed anything.



xur added a comment.Dec 7 2015, 10:12 AM

Sean, Justin and David: Do you have any comment on the latest patch?
If it looks fine to you, I plan to submit it again this week.