Page MenuHomePhabricator

[PGO] Register promote profile counter updates

Authored by davidxl on Jun 10 2017, 8:42 PM.



Two of the biggest problems with the current PGO instrumentation are 1) instrumented multi-threaded program performance 2) multi-threaded profile counter precision. Both are due to contentions created due to shared profile counter updates in hot regions of the program. In a multi-threaded program with work-sharing loop, with instrumentation increasing the number of threads actually slows down the program significantly -- from 10s elapse time with one thread down to > 5min using 16 threads. What is worse, the hottest block count is 4000000000 with 1 thread, but dropped to only 177119367 with 32 threads -- 95% of the counts are lost due to data races. Using atomic RMW is one way to fix the problem, but it will greatly slows down the instrumented program (see data below).

This patch implements the loop based register promotion for counter updates. It makes use of existing SSA updater utility (load store promotion) and isolates the change inside the lowerer without the need to expose the aliasing properties of the counter variables to any of the existing optimizer components. The lowerer has the full knowledge of counters and requires very little analysis. It supports speculative code motion and works at lower optimization level.

With this patch, the performance of the multi-threaded program mentioned above is improved greatly. For one thread, it speeds up the program by 22%. For 16 threads, the elapse time is only 0.9s, > 300x speedup compared without the patch. The profile counter precision is also greatly improved. With 32 threads, the hottest block count is 3996000000 only 0.09% counts are lost.

The patch speeds up single threaded instrumentation binary performance as well.

Here are the spec2000 int numbers

 164.gzip     -0.81%
   175.vpr      2.05%
    176.gcc     11.08%
    181.mcf     -0.61%
 186.crafty     -3.46%
 197.parser      4.90%
    252.eon     18.00%
253.perlbmk     11.21%
 255.vortex     -0.04%
  256.bzip2      8.67%
  300.twolf      3.89%

Here are the spec06 numbers

400.perlbench         -1.87%
     401.bzip2           16.98%
       403.gcc            4.82%
       429.mcf           12.88%
     445.gobmk          1.83%
     456.hmmer         12.48%
     458.sjeng           -0.19%
462.libquantum       28.09%
   464.h264ref          6.49%
   471.omnetpp         1.21%
     473.astar             8.31%
 483.xalancbmk        0.95%
    450.soplex          12.35%
    447.dealII            6.33%
    453.povray         -3.29%
      444.namd          1.88%

I did some analysis on the povray regression: the program has a few hot loops with some blocks guarded by input flags which are never executed. Hoisting counter update outside the loop thus increases dynamic instruction count.

Lastly, here is the SPEC2k number of using atomic fetch-add compared the base line without this patch:

164.gzip       -14.02%
     175.vpr       -16.02%
     176.gcc       -14.15%
     181.mcf        -4.48%
  186.crafty       -44.98%
  197.parser       -11.69%
     252.eon       -13.79%
 253.perlbmk         6.26%
  255.vortex        -4.70%
   256.bzip2        -4.06%
   300.twolf       -17.24%

Diff Detail


Event Timeline

davidxl created this revision.Jun 10 2017, 8:42 PM

Did you do an analysis of whether this produces the same counter values for single-threaded code?

It looks like the way you're doing hoisting doesn't interact correctly with loops that don't exit; if a call in a loop throws an exception, or calls exit(), the counter values will be off.

See the summary for precision data. With this patch, the multi-threaded counter value is much closer to the single threaded case with very little loss.

It is unlikely that loops with calls to no-return function can be in the hot paths. Regardless, the patch already disables promotion for such loops (with multiple existing blocks). If we want to improve training run performance for such loops, we can make enhancement insert counter flush before the noreturn calls.

vsk edited edge metadata.Jun 13 2017, 11:06 AM

This is a really interesting patch. I haven't worked with SSAUpdater before, so I think it would be best if someone else reviewed that part of the patch.

On another note, do you know if counter promotion could be made to work with frontend-based instrumentation?

64 ↗(On Diff #102128)

It may be more convenient to introduce 'using LoadStorePair = ...'

75 ↗(On Diff #102128)

Could you write this with a dash, e.g 'Register-promote'

101 ↗(On Diff #102128)

Would cl::Optional be more appropriate?

111 ↗(On Diff #102128)

The "do-counter-promotion" flag is ternary, and should be simplified. I'd prefer a cl::opt which specifies whether counter promotion is "allowed", and an InstrProfOptions field which specifies whether it is done. In promoteCounterLoadStores, you could write:

if (!AllowCounterPromotion || !Opts.DoCounterPromotion)
148 ↗(On Diff #102128)

It would help to have brief doxygen lines for the two new classes.

221 ↗(On Diff #102128)

const auto &Cand?

313 ↗(On Diff #102128)

const auto &LoadStore?

480 ↗(On Diff #102128)

We shouldn't update the promotion candidates vector if counter promotion is disabled.

3 ↗(On Diff #102128)

Is the global needed?

40 ↗(On Diff #102128)

You may need to add 'requires: asserts' to tests which refer to bb or variable names.

52 ↗(On Diff #102128)

In this test, the CFG looks like:

  /  \
4      5
|     / \
|    7    8
 \  /____/ 
   9 --- loop ---> 2

Is this the simplest loop where ld/st promotion applies?

52 ↗(On Diff #102128)

Could you add checks for the operands to the adds? It would be useful to have checks for the PHIs incremented in the loop body, to ensure they make their way to the correct 'add'.

52 ↗(On Diff #102128)

Could you add a check-not for further uses of __profc*?

72 ↗(On Diff #102128)

It would help to have 'check-nots' for further updates to __profc* in this test as well.

15 ↗(On Diff #102128)

Why are the IR checks needed in this test? Imho the interesting thing about this test is the "diff", and not the IR checks, which are probably better off in test/Transforms/PGOProfile.

39 ↗(On Diff #102128)

I have the same comment about the IR checks here.

41 ↗(On Diff #102128)

Would placing a nested for loop here increase test coverage?

davidxl marked 7 inline comments as done.Jun 13 2017, 1:25 PM

Regarding turning this on for FE instrumentation, it can be done, but probably as a follow up (which also needs loop canonlicalization and rotation). More coverage test performance data (possibly at O0) also need to be collected.

101 ↗(On Diff #102128)

Sometimes the build system can append multiple instances of a common flag to the command line. Without ZeroOrMore, driver reports error.

111 ↗(On Diff #102128)

We need the ability to use this option to override (either force it to be on or off) the default pipeline option. Your proposal does not seem to accomplish that.

3 ↗(On Diff #102128)


40 ↗(On Diff #102128)

The name of the counter vars seem to be kept with the non-asserts compiler. Any idea why?

52 ↗(On Diff #102128)

The purpose is to test multiple counters hoisted

52 ↗(On Diff #102128)


15 ↗(On Diff #102128)

It is end-to-end test to make sure every single step in the piplineline (from FE ) is covered. If the transformation does not happen, it makes no sense to do DIFF any more. The LLVM tests only cover the lowering part.

39 ↗(On Diff #102128)

I think it is better to keep them here unless it creates big overhead which I doubt.

41 ↗(On Diff #102128)

Probably a little -- as it tests the behavior of promoter with multiple loops in the same nest.

davidxl updated this revision to Diff 102392.Jun 13 2017, 1:26 PM

Addressed Vedant's comments.

vsk added a comment.Jun 13 2017, 4:56 PM

Regarding turning this on for FE instrumentation, it can be done, but probably as a follow up (which also needs loop canonlicalization and rotation). More coverage test performance data (possibly at O0) also need to be collected.

I'd be willing to help out with this, unless you already have plans to get to it.

170 ↗(On Diff #102392)

This may be a basic question, but.. why is it that SSA.GetValueInMiddleOfBlock(ExitBlock) gives the right increment, coming from the promoted counter for {L, S}? It may help to have a short comment about this for readers who are unfamiliar with SSAUpdater (like me).

101 ↗(On Diff #102128)


111 ↗(On Diff #102128)

I think the 'isCounterPromotionEnabled' method makes things clear enough

40 ↗(On Diff #102128)

Ah, I think ssa var names are kept, but basic block label names are not.

52 ↗(On Diff #102128)

I think this is still missing

davidxl marked an inline comment as done.Jun 13 2017, 11:07 PM

Added one more test case to show the merged live-in values from multiple predecessors in the loop.

170 ↗(On Diff #102392)

Added a comment. The name of the API is a little misleading -- it really means getting the LiveIn value (from the promoted counter stores) to the exit block.

davidxl updated this revision to Diff 102483.Jun 13 2017, 11:10 PM

Addressed review comments.

davidxl updated this revision to Diff 102683.Jun 15 2017, 10:09 AM

Added another debug option.

vsk added a comment.Jun 20 2017, 12:46 PM

For due diligence I tested out this patch on a simple multi-threaded program which runs 10^7 dot-products in 8 different threads. Post-patch, I noticed that the profile counts are much more consistent. Register-promotion kicked in and the optimizer was able to hoist some counter increments out of the loop.

Review has stalled, so I'll try looking into the SSAUpdater bits.

vsk added a comment.Jun 20 2017, 1:39 PM

I had a look through SSAUpdater and it seems like this is doing something totally reasonable. This lgtm with some changes pointed out inline.

183 ↗(On Diff #102683)

There should be some test coverage for the atomic-counter-update-promoted option.

207 ↗(On Diff #102683)

Would a SmallPtrSet<..., 8> be more appropriate here, given the definition for LoopExitBlocks?

davidxl updated this revision to Diff 103856.Jun 24 2017, 5:25 PM

Update patch according to Vedant's feedback.

FWIW, the usage of the updater seems correct.
Please note that the updater has O(N^3) worst case performance in some cases [1].
I don't think this will show up here, but you might want to double check the impact on compile time.

[1] e.g. type
llvm-stress -instructions=50000 | ./opt -mem2reg -lcssa -time-passes and run inside a profiler.

This revision was automatically updated to reflect the committed changes.
This revision was automatically updated to reflect the committed changes.