Page MenuHomePhabricator

profi - a flow-based profile inference algorithm: Part I (out of 3)
Needs ReviewPublic

Authored by spupyrev on Sep 15 2021, 4:01 PM.

Details

Summary

The benefits of sampling-based PGO crucially depends on the quality of profile
data. This diff implements a flow-based algorithm, called profi, that helps to
overcome the inaccuracies in a profile after it is collected.

Profi is an extended and significantly re-engineered classic MCMF (min-cost
max-flow) approach suggested by Levin, Newman, and Haber [2008, Complementing
missing and inaccurate profiling using a minimum cost circulation algorithm]. It
models profile inference as an optimization problem on a control-flow graph with
the objectives and constraints capturing the desired properties of profile data.
Three important challenges that are being solved by profi:

  • "fixing" errors in profiles caused by sampling;
  • converting basic block counts to edge frequencies (branch probabilities);
  • dealing with "dangling" blocks having no samples in the profile.

The main implementation (and required docs) are in SampleProfileInference.cpp.
The worst-time complexity is quadratic in the number of blocks in a function,
O(|V|^2). However a careful engineering and extensive evaluation shows that
the running time is (slightly) super-linear. In particular, instances with
1000 blocks are solved within 0.1 second.

The algorithm has been extensively tested internally on prod workloads,
significantly improving the quality of generated profile data and providing
speedups in the range from 0% to 5%. For "smaller" benchmarks (SPEC06/17), it
generally improves the performance (with a few outliers) but extra work in
the compiler might be needed to re-tune existing optimization passes relying on
profile counts.

Diff Detail

Unit TestsFailed

TimeTest
130 msx64 debian > ORC-x86_64-linux.TestCases/Linux/x86-64::trivial-cxa-atexit.S
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -m64 -c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/orc/X86_64LinuxConfig/TestCases/Linux/x86-64/Output/trivial-cxa-atexit.S.tmp /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/orc/TestCases/Linux/x86-64/trivial-cxa-atexit.S
110 msx64 debian > ORC-x86_64-linux.TestCases/Linux/x86-64::trivial-static-initializer.S
Script: -- : 'RUN: at line 7'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -m64 -c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/orc/X86_64LinuxConfig/TestCases/Linux/x86-64/Output/trivial-static-initializer.S.tmp /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/orc/TestCases/Linux/x86-64/trivial-static-initializer.S
150 msx64 debian > ORC-x86_64-linux.TestCases/Linux/x86-64::trivial-tls.S
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -m64 -c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/orc/X86_64LinuxConfig/TestCases/Linux/x86-64/Output/trivial-tls.S.tmp /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/orc/TestCases/Linux/x86-64/trivial-tls.S
10,770 msx64 debian > libarcher.races::task-dependency.c
Script: -- : 'RUN: at line 13'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -Wl,-rpath,/var/lib/buildkite-agent/builds/llvm-project/build/lib /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/races/task-dependency.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-dependency.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/deflake.bash /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-dependency.c.tmp 2>&1 | tee /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-dependency.c.tmp.log | /var/lib/buildkite-agent/builds/llvm-project/build/./bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/openmp/tools/archer/tests/races/task-dependency.c

Event Timeline

spupyrev created this revision.Sep 15 2021, 4:01 PM
spupyrev requested review of this revision.Sep 15 2021, 4:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 15 2021, 4:01 PM
spupyrev edited the summary of this revision. (Show Details)Sep 15 2021, 4:25 PM
spupyrev added reviewers: wenlei, hoy, wlei.
davidxl added a reviewer: xur.Sep 15 2021, 7:37 PM
spupyrev updated this revision to Diff 372986.Sep 16 2021, 10:25 AM

extended tests with 'block-freq' analysis

spupyrev retitled this revision from profi - a flow-based profile inference algorithm: Part I to profi - a flow-based profile inference algorithm: Part I (out of 3).Sep 16 2021, 11:17 AM
spupyrev edited the summary of this revision. (Show Details)

A few high level questions before the review:

  1. Is there a high level description of the 3 sub-patches and their relationship?
  2. does it work without using pseudoprob? Is there a test case?
  3. the test case seems to disable new pm, why is that?
  1. Is there a high level description of the 3 sub-patches and their relationship?

The split into 3 patches is pretty ad hoc; this is done just to simplify reviewing. In order to get good results, we really need all three pieces.
On a high level, the first part (D109860) implements a basic minimum-cost maximum flow algorithm and applies it to sampling-based profile. Two other diffs implement adjustments for the computed flow.
The second part (D109903) makes the computed flow "connected" -- without it hot loops might be ignored by BFI.
The third part (D109980) applies a post-processing for "dangling" basic blocks (having no sample counts in the profile).

  1. does it work without using pseudoprob? Is there a test case?

Yes, just added such a test.

  1. the test case seems to disable new pm, why is that?

This is needed for opt -analyze. Otherwise I get the following exception: Cannot specify -analyze under new pass manager, either specify '-enable-new-pm=0', or use the corresponding new pass manager pass, e.g. '-passes=print<scalar-evolution>'

wmi added a comment.Mon, Sep 20, 10:37 AM

Thanks Sergey. That looks like a good work to improve the consistency of the profile. Have you checked whether the new algorithm can infer the missing parts based on equivalence relationship on the CFG? If it can, could you add a test for it?

I will collect some performance number w/wo sample-profile-use-profi.

davidxl added inline comments.Tue, Sep 21, 10:59 AM
llvm/include/llvm/Transforms/Utils/SampleProfileLoaderBaseImpl.h
909

It is probably better to restructure the code a little more.

Add a helper routine called "PrepareForPropagation(..)' or "InitForPropagation'. In this helper

  1. buildEdges;
  2. compute equiv class for non-profi or profi specific initialization (basically move some code from propagateWeights' to here.
920

Move it into a Finalize helper method?

llvm/lib/Transforms/IPO/SampleProfile.cpp
1561

Is this a limitation? Can it be made to handle multi-graph?

rajeshwarv added inline comments.Tue, Sep 21, 1:36 PM
llvm/lib/Transforms/Utils/SampleProfileInference.cpp
76

Could you please why the time complexity is not O(V*E^2) per http://go/wikip/Edmonds%E2%80%93Karp_algorithm and it is actually O(...) as described here?

Also, what is n?

spupyrev updated this revision to Diff 374304.Wed, Sep 22, 11:23 AM
  • added helper functions following David's suggestions
  • expanded a comment on time complexity of the MCMF algorithm

Thanks Sergey. That looks like a good work to improve the consistency of the profile. Have you checked whether the new algorithm can infer the missing parts based on equivalence relationship on the CFG? If it can, could you add a test for it?

Do you mind to elaborate more on this? What kind of a test would you like to see? (i may not be familiar with the terminology)

In general, the algorithm builds a valid "flow" comprised of the block/jump counts. That is, the sum of incoming jump counts always equals the sum of outgoing jump counts (except for the source and sinks). Hence, if two blocks are guaranteed to have equal counts, the algorithm will always return equal counts. A few of the tests are (implicitly) verifying that.

I will collect some performance number w/wo sample-profile-use-profi.

That would be awesome, thanks!
Tbh, I've seen some weird problems while using the algorithm with AutoFDO. There are a few issues that make profi to work not as expected:

  • there are no "dangling" blocks in AutoFDO; the corresponding blocks seem to be reported as blocks with 0 counts;
  • some blocks/counts seem to have been duplicated and there is no concept of "distribution factor"; profi has difficulties with such incorrect blocks.

Overall we see benefits of using the new inference with CSSPGO (where the above issues are resolved); however, AutoFDO weights/counts need to be polished before the inference.

spupyrev marked 3 inline comments as done.Wed, Sep 22, 11:41 AM
spupyrev added inline comments.
llvm/lib/Transforms/IPO/SampleProfile.cpp
1561

The inference algorithm can process multi-graphs -- the questions is how do we store/process such edges outside of inference. The current implementation (see SampleProfileLoaderBaseImpl::buildEdges) merges all multiway branches into a single edge.
We could probably modify buildEdges and avoid merging, but that will change the existing non-profi functionality. An alternative is to apply this post-processing and keep the existing implementation as is.

llvm/lib/Transforms/Utils/SampleProfileInference.cpp
76

Unfortunately, it is not trivial to derive a tight worst-case estimation of our implementation, since we modify the classical algorithm (see in particular lines 206-209). For sure we're not doing more than v(f) augmentation iterations and each of them takes at most n*m steps. But I don't know if this worst-case bound can be achieved on some instance.

Furthermore, the worst-case complexity is a misleading concept to look into, since control-flow graphs have a "nice" structure. Typically the observed runtime is slightly super-linear.

I've updated the comment, feel free to suggest more explanations.

wmi added a comment.Thu, Sep 23, 9:02 AM

Thanks Sergey. That looks like a good work to improve the consistency of the profile. Have you checked whether the new algorithm can infer the missing parts based on equivalence relationship on the CFG? If it can, could you add a test for it?

Do you mind to elaborate more on this? What kind of a test would you like to see? (i may not be familiar with the terminology)

Like block A dominates block B and block B post dominates block A, then the two blocks must have equal execution frequency. That is what SampleProfileLoaderBaseImpl<BT>::findEquivalenceClasses does. I just want to know whether profi could cover it since profi could be used to replace the existing weight propagation logic.

In general, the algorithm builds a valid "flow" comprised of the block/jump counts. That is, the sum of incoming jump counts always equals the sum of outgoing jump counts (except for the source and sinks). Hence, if two blocks are guaranteed to have equal counts, the algorithm will always return equal counts. A few of the tests are (implicitly) verifying that.

Ok, sounds like profi already cover the case described above.

I will collect some performance number w/wo sample-profile-use-profi.

That would be awesome, thanks!
Tbh, I've seen some weird problems while using the algorithm with AutoFDO. There are a few issues that make profi to work not as expected:

  • there are no "dangling" blocks in AutoFDO; the corresponding blocks seem to be reported as blocks with 0 counts;
  • some blocks/counts seem to have been duplicated and there is no concept of "distribution factor"; profi has difficulties with such incorrect blocks.

Overall we see benefits of using the new inference with CSSPGO (where the above issues are resolved); however, AutoFDO weights/counts need to be polished before the inference.

I get the test result for our search benchmark. Enabling sample-profile-use-profi has a small regression - 0.3%. That may be related with the issues you described above.

spupyrev marked an inline comment as done.Tue, Sep 28, 11:58 AM

I get the test result for our search benchmark. Enabling sample-profile-use-profi has a small regression - 0.3%. That may be related with the issues you described above.

Thanks for running those experiments, the results align with what I've seen in my benchmarks.
At this point I think we should think of the inference algorithm as a part of CSSPGO, where it provides benefits. Making the inference work with AutoFDO is an important but a separate (and potentially time-consuming) task. I'd try to address the regression in future diffs.

MTC added a subscriber: MTC.Wed, Oct 6, 4:56 PM