This is an archive of the discontinued LLVM Phabricator instance.

profi - a flow-based profile inference algorithm: Part I (out of 3)
ClosedPublic

Authored by spupyrev on Sep 15 2021, 4:01 PM.

Details

Summary

The benefits of sampling-based PGO crucially depends on the quality of profile
data. This diff implements a flow-based algorithm, called profi, that helps to
overcome the inaccuracies in a profile after it is collected.

Profi is an extended and significantly re-engineered classic MCMF (min-cost
max-flow) approach suggested by Levin, Newman, and Haber [2008, Complementing
missing and inaccurate profiling using a minimum cost circulation algorithm]. It
models profile inference as an optimization problem on a control-flow graph with
the objectives and constraints capturing the desired properties of profile data.
Three important challenges that are being solved by profi:

  • "fixing" errors in profiles caused by sampling;
  • converting basic block counts to edge frequencies (branch probabilities);
  • dealing with "dangling" blocks having no samples in the profile.

The main implementation (and required docs) are in SampleProfileInference.cpp.
The worst-time complexity is quadratic in the number of blocks in a function,
O(|V|^2). However a careful engineering and extensive evaluation shows that
the running time is (slightly) super-linear. In particular, instances with
1000 blocks are solved within 0.1 second.

The algorithm has been extensively tested internally on prod workloads,
significantly improving the quality of generated profile data and providing
speedups in the range from 0% to 5%. For "smaller" benchmarks (SPEC06/17), it
generally improves the performance (with a few outliers) but extra work in
the compiler might be needed to re-tune existing optimization passes relying on
profile counts.

UPD Dec 1st 2021:

  • synced the declaration and definition of the option SampleProfileUseProfi to use type cl::opt<bool;
  • added inline for SampleProfileInference<BT>::findUnlikelyJumps and SampleProfileInference<BT>::isExit to avoid linking problems on windows.

Diff Detail

Event Timeline

spupyrev created this revision.Sep 15 2021, 4:01 PM
spupyrev requested review of this revision.Sep 15 2021, 4:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 15 2021, 4:01 PM
spupyrev edited the summary of this revision. (Show Details)Sep 15 2021, 4:25 PM
spupyrev added reviewers: wenlei, hoy, wlei.
davidxl added a reviewer: xur.Sep 15 2021, 7:37 PM
spupyrev updated this revision to Diff 372986.Sep 16 2021, 10:25 AM

extended tests with 'block-freq' analysis

spupyrev retitled this revision from profi - a flow-based profile inference algorithm: Part I to profi - a flow-based profile inference algorithm: Part I (out of 3).Sep 16 2021, 11:17 AM
spupyrev edited the summary of this revision. (Show Details)

A few high level questions before the review:

  1. Is there a high level description of the 3 sub-patches and their relationship?
  2. does it work without using pseudoprob? Is there a test case?
  3. the test case seems to disable new pm, why is that?
  1. Is there a high level description of the 3 sub-patches and their relationship?

The split into 3 patches is pretty ad hoc; this is done just to simplify reviewing. In order to get good results, we really need all three pieces.
On a high level, the first part (D109860) implements a basic minimum-cost maximum flow algorithm and applies it to sampling-based profile. Two other diffs implement adjustments for the computed flow.
The second part (D109903) makes the computed flow "connected" -- without it hot loops might be ignored by BFI.
The third part (D109980) applies a post-processing for "dangling" basic blocks (having no sample counts in the profile).

  1. does it work without using pseudoprob? Is there a test case?

Yes, just added such a test.

  1. the test case seems to disable new pm, why is that?

This is needed for opt -analyze. Otherwise I get the following exception: Cannot specify -analyze under new pass manager, either specify '-enable-new-pm=0', or use the corresponding new pass manager pass, e.g. '-passes=print<scalar-evolution>'

wmi added a comment.Sep 20 2021, 10:37 AM

Thanks Sergey. That looks like a good work to improve the consistency of the profile. Have you checked whether the new algorithm can infer the missing parts based on equivalence relationship on the CFG? If it can, could you add a test for it?

I will collect some performance number w/wo sample-profile-use-profi.

davidxl added inline comments.Sep 21 2021, 10:59 AM
llvm/include/llvm/Transforms/Utils/SampleProfileLoaderBaseImpl.h
898

It is probably better to restructure the code a little more.

Add a helper routine called "PrepareForPropagation(..)' or "InitForPropagation'. In this helper

  1. buildEdges;
  2. compute equiv class for non-profi or profi specific initialization (basically move some code from propagateWeights' to here.
909

Move it into a Finalize helper method?

llvm/lib/Transforms/IPO/SampleProfile.cpp
1652

Is this a limitation? Can it be made to handle multi-graph?

rajeshwarv added inline comments.Sep 21 2021, 1:36 PM
llvm/lib/Transforms/Utils/SampleProfileInference.cpp
76

Could you please why the time complexity is not O(V*E^2) per http://go/wikip/Edmonds%E2%80%93Karp_algorithm and it is actually O(...) as described here?

Also, what is n?

spupyrev updated this revision to Diff 374304.Sep 22 2021, 11:23 AM
  • added helper functions following David's suggestions
  • expanded a comment on time complexity of the MCMF algorithm

Thanks Sergey. That looks like a good work to improve the consistency of the profile. Have you checked whether the new algorithm can infer the missing parts based on equivalence relationship on the CFG? If it can, could you add a test for it?

Do you mind to elaborate more on this? What kind of a test would you like to see? (i may not be familiar with the terminology)

In general, the algorithm builds a valid "flow" comprised of the block/jump counts. That is, the sum of incoming jump counts always equals the sum of outgoing jump counts (except for the source and sinks). Hence, if two blocks are guaranteed to have equal counts, the algorithm will always return equal counts. A few of the tests are (implicitly) verifying that.

I will collect some performance number w/wo sample-profile-use-profi.

That would be awesome, thanks!
Tbh, I've seen some weird problems while using the algorithm with AutoFDO. There are a few issues that make profi to work not as expected:

  • there are no "dangling" blocks in AutoFDO; the corresponding blocks seem to be reported as blocks with 0 counts;
  • some blocks/counts seem to have been duplicated and there is no concept of "distribution factor"; profi has difficulties with such incorrect blocks.

Overall we see benefits of using the new inference with CSSPGO (where the above issues are resolved); however, AutoFDO weights/counts need to be polished before the inference.

spupyrev marked 3 inline comments as done.Sep 22 2021, 11:41 AM
spupyrev added inline comments.
llvm/lib/Transforms/IPO/SampleProfile.cpp
1652

The inference algorithm can process multi-graphs -- the questions is how do we store/process such edges outside of inference. The current implementation (see SampleProfileLoaderBaseImpl::buildEdges) merges all multiway branches into a single edge.
We could probably modify buildEdges and avoid merging, but that will change the existing non-profi functionality. An alternative is to apply this post-processing and keep the existing implementation as is.

llvm/lib/Transforms/Utils/SampleProfileInference.cpp
76

Unfortunately, it is not trivial to derive a tight worst-case estimation of our implementation, since we modify the classical algorithm (see in particular lines 206-209). For sure we're not doing more than v(f) augmentation iterations and each of them takes at most n*m steps. But I don't know if this worst-case bound can be achieved on some instance.

Furthermore, the worst-case complexity is a misleading concept to look into, since control-flow graphs have a "nice" structure. Typically the observed runtime is slightly super-linear.

I've updated the comment, feel free to suggest more explanations.

wmi added a comment.Sep 23 2021, 9:02 AM

Thanks Sergey. That looks like a good work to improve the consistency of the profile. Have you checked whether the new algorithm can infer the missing parts based on equivalence relationship on the CFG? If it can, could you add a test for it?

Do you mind to elaborate more on this? What kind of a test would you like to see? (i may not be familiar with the terminology)

Like block A dominates block B and block B post dominates block A, then the two blocks must have equal execution frequency. That is what SampleProfileLoaderBaseImpl<BT>::findEquivalenceClasses does. I just want to know whether profi could cover it since profi could be used to replace the existing weight propagation logic.

In general, the algorithm builds a valid "flow" comprised of the block/jump counts. That is, the sum of incoming jump counts always equals the sum of outgoing jump counts (except for the source and sinks). Hence, if two blocks are guaranteed to have equal counts, the algorithm will always return equal counts. A few of the tests are (implicitly) verifying that.

Ok, sounds like profi already cover the case described above.

I will collect some performance number w/wo sample-profile-use-profi.

That would be awesome, thanks!
Tbh, I've seen some weird problems while using the algorithm with AutoFDO. There are a few issues that make profi to work not as expected:

  • there are no "dangling" blocks in AutoFDO; the corresponding blocks seem to be reported as blocks with 0 counts;
  • some blocks/counts seem to have been duplicated and there is no concept of "distribution factor"; profi has difficulties with such incorrect blocks.

Overall we see benefits of using the new inference with CSSPGO (where the above issues are resolved); however, AutoFDO weights/counts need to be polished before the inference.

I get the test result for our search benchmark. Enabling sample-profile-use-profi has a small regression - 0.3%. That may be related with the issues you described above.

spupyrev marked an inline comment as done.Sep 28 2021, 11:58 AM

I get the test result for our search benchmark. Enabling sample-profile-use-profi has a small regression - 0.3%. That may be related with the issues you described above.

Thanks for running those experiments, the results align with what I've seen in my benchmarks.
At this point I think we should think of the inference algorithm as a part of CSSPGO, where it provides benefits. Making the inference work with AutoFDO is an important but a separate (and potentially time-consuming) task. I'd try to address the regression in future diffs.

MTC added a subscriber: MTC.Oct 6 2021, 4:56 PM
hoy added a comment.Oct 19 2021, 5:20 PM

I just left a couple comments about minor issues. @davidxl @xur @rajeshwarv We have done an around of code review internally. The current patch looks good to me in general. Please let the author know if you have more comments. Thanks.

llvm/include/llvm/Transforms/Utils/SampleProfileInference.h
11

nit: comment the profi abbreviation and its full name here?

llvm/include/llvm/Transforms/Utils/SampleProfileLoaderBaseImpl.h
148

nit: no need to use the inline keyword.

898

We still need this for MIR even if SampleProfileUseProfi is true.

I get the test result for our search benchmark. Enabling sample-profile-use-profi has a small regression - 0.3%. That may be related with the issues you described above.

Thanks for running those experiments, the results align with what I've seen in my benchmarks.
At this point I think we should think of the inference algorithm as a part of CSSPGO, where it provides benefits. Making the inference work with AutoFDO is an important but a separate (and potentially time-consuming) task. I'd try to address the regression in future diffs.

This sounds reasonable to me. We have turned on profi by default for CSSPGO internally, and we can leave it off for AutoFDO for now.

And as Hongtao mentioned, we have reviewed the implementation internally, so this is mostly for other reviewers to take a look.

spupyrev updated this revision to Diff 382093.Oct 25 2021, 1:04 PM

-rebase

  • hoy's comments
hoy accepted this revision.Oct 25 2021, 1:24 PM

LGTM, thanks.

This revision is now accepted and ready to land.Oct 25 2021, 1:24 PM
wenlei added inline comments.Nov 1 2021, 9:40 PM
llvm/include/llvm/Transforms/Utils/SampleProfileLoaderBaseImpl.h
757

Performance and count quality aside, can we make it work with MIR here just to make sure functionality is there. i.e. we can probably have a test case for MIR+Profi using FS-AFDO. Making sure it generates better count for MIR can be dealt with separately later.

spupyrev updated this revision to Diff 388517.Nov 19 2021, 8:51 AM
spupyrev marked 4 inline comments as done.

addressing wenlei's comment: making profi to work MIR

spupyrev updated this revision to Diff 388521.Nov 19 2021, 8:55 AM
spupyrev marked an inline comment as done.
  • modified comments
wenlei accepted this revision.Nov 19 2021, 9:21 AM

lgtm except some comment nits, thanks

llvm/include/llvm/Transforms/Utils/SampleProfileInference.h
109

same here.

274

nit, fix comment? same for line 278

This revision was landed with ongoing or failed builds.Nov 23 2021, 9:09 AM
This revision was automatically updated to reflect the committed changes.

Our windows buildbot failed the link with:

cmd.exe /C "cd . && "C:\Program Files\CMake\bin\cmake.exe" -E vs_link_exe --intdir=tools\mlir\unittests\ExecutionEngine\CMakeFiles\MLIRExecutionEngineTests.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100177~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100177~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~3\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\MLIRExecutionEngineTests.rsp  /out:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.exe /implib:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.lib /pdb:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.pdb /version:0.0 /machine:x64 /STACK:10000000 /INCREMENTAL:NO /subsystem:console  && cd ."
LINK: command "C:\PROGRA~2\MICROS~3\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\MLIRExecutionEngineTests.rsp /out:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.exe /implib:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.lib /pdb:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.pdb /version:0.0 /machine:x64 /STACK:10000000 /INCREMENTAL:NO /subsystem:console /MANIFEST /MANIFESTFILE:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.exe.manifest" failed (exit code 1120) with the following output:
LLVMCodeGen.lib(MIRSampleProfile.cpp.obj) : error LNK2019: unresolved external symbol "class llvm::cl::opt<unsigned int,0,class llvm::cl::parser<unsigned int> > llvm::SampleProfileUseProfi" (?SampleProfileUseProfi@llvm@@3V?$opt@I$0A@V?$parser@I@cl@llvm@@@cl@1@A) referenced in function "protected: bool __cdecl llvm::SampleProfileLoaderBaseImpl<class llvm::MachineBasicBlock>::computeAndPropagateWeights(class llvm::MachineFunction &,class llvm::DenseSet<unsigned __int64,struct llvm::DenseMapInfo<unsigned __int64,void> > const &)" (?computeAndPropagateWeights@?$SampleProfileLoaderBaseImpl@VMachineBasicBlock@llvm@@@llvm@@IEAA_NAEAVMachineFunction@2@AEBV?$DenseSet@_KU?$DenseMapInfo@_KX@llvm@@@2@@Z)
tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.exe : fatal error LNK1120: 1 unresolved externals

FYI

hoy added a comment.Nov 23 2021, 12:11 PM

Our windows buildbot failed the link with:

cmd.exe /C "cd . && "C:\Program Files\CMake\bin\cmake.exe" -E vs_link_exe --intdir=tools\mlir\unittests\ExecutionEngine\CMakeFiles\MLIRExecutionEngineTests.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100177~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100177~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~3\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\MLIRExecutionEngineTests.rsp  /out:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.exe /implib:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.lib /pdb:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.pdb /version:0.0 /machine:x64 /STACK:10000000 /INCREMENTAL:NO /subsystem:console  && cd ."
LINK: command "C:\PROGRA~2\MICROS~3\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\MLIRExecutionEngineTests.rsp /out:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.exe /implib:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.lib /pdb:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.pdb /version:0.0 /machine:x64 /STACK:10000000 /INCREMENTAL:NO /subsystem:console /MANIFEST /MANIFESTFILE:tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.exe.manifest" failed (exit code 1120) with the following output:
LLVMCodeGen.lib(MIRSampleProfile.cpp.obj) : error LNK2019: unresolved external symbol "class llvm::cl::opt<unsigned int,0,class llvm::cl::parser<unsigned int> > llvm::SampleProfileUseProfi" (?SampleProfileUseProfi@llvm@@3V?$opt@I$0A@V?$parser@I@cl@llvm@@@cl@1@A) referenced in function "protected: bool __cdecl llvm::SampleProfileLoaderBaseImpl<class llvm::MachineBasicBlock>::computeAndPropagateWeights(class llvm::MachineFunction &,class llvm::DenseSet<unsigned __int64,struct llvm::DenseMapInfo<unsigned __int64,void> > const &)" (?computeAndPropagateWeights@?$SampleProfileLoaderBaseImpl@VMachineBasicBlock@llvm@@@llvm@@IEAA_NAEAVMachineFunction@2@AEBV?$DenseSet@_KU?$DenseMapInfo@_KX@llvm@@@2@@Z)
tools\mlir\unittests\ExecutionEngine\MLIRExecutionEngineTests.exe : fatal error LNK1120: 1 unresolved externals

FYI

Thanks for reporting this issue. It looks like somehow MLIRExecutionEngineTests references LLVMCodeGen.lib(MIRSampleProfile.cpp.obj) , which in turn references llvm::SampleProfileUseProfi that is defined in LLVMTransformUtils. Could it be a possible missing dependency on the MLIRExecutionEngineTests side?

I see the definition being cl::opt<bool> SampleProfileUseProfi( while the declaration is extern cl::opt<unsigned> SampleProfileUseProfi;

llvm/lib/Transforms/Utils/SampleProfileLoaderBaseUtil.cpp
37

This is defined as cl::opt<bool>

hoy added a comment.Nov 23 2021, 12:22 PM

I see the definition being cl::opt<bool> SampleProfileUseProfi( while the declaration is extern cl::opt<unsigned> SampleProfileUseProfi;

Good catch! That should be the culprit. The declaration should be cl::opt<bool> as well. Could you kindly test it for us since we don't have a Windows environment? Thanks!

I don't have a Windows environment either.

You could try to push at a time when you can afford to monitor a builder like https://lab.llvm.org/buildbot/#/builders/13 until your commit is processed.

That said it seems that previous pre-merge checks run on this revision were failing on Windows, can you try to re-open this revision, update the patch, and see if the pre-merge checks are passing?

spupyrev reopened this revision.Dec 1 2021, 10:29 AM
This revision is now accepted and ready to land.Dec 1 2021, 10:29 AM
spupyrev updated this revision to Diff 391077.Dec 1 2021, 10:42 AM

fixed an incorrect option type

spupyrev edited the summary of this revision. (Show Details)Dec 1 2021, 12:19 PM
hoy added a comment.Dec 1 2021, 12:32 PM

That said it seems that previous pre-merge checks run on this revision were failing on Windows, can you try to re-open this revision, update the patch, and see if the pre-merge checks are passing?

Thanks for the suggestion. The failing tests on Windows seem unrelated. Will land this.

This revision was landed with ongoing or failed builds.Dec 1 2021, 3:31 PM
This revision was automatically updated to reflect the committed changes.