This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO] Even count distribution
ClosedPublic

Authored by spupyrev on Jan 31 2022, 11:47 AM.

Details

Summary

This is a modification of the count inference algorithm for inferring missing
counts of basic blocks and jumps based on the sampling-based weights (e.g.,
collected via pseudo-probes). The goal is to distribute counts evenly among
the control-flow graph when there are multiple equally likely options. This is
in contrast with the earlier version that created a valid flow in CFG using an
arbitrarily chosen set of blocks/jumps.
Algorithmically, the change is a modification of the min-cost max flow approach
that works by iteratively increasing flow along augmenting paths. In the new
version, we instead find an augmenting DAG (directed acyclic graph) and push the
flow evenly along all edges of the DAG. A a result, the counts are distributed evenly
not only among dangling (w/o sample weights) blocks but also among regular
basic blocks.

The algorithm is controlled by a new flag sample-profile-even-count-distribution
(default ON).

Evaluation:
This has been tested on several benchmarks (with CSSPGO):

  • For smallish SPEC17 binaries, I do not see a perf difference;
  • For clang-10 (17MB of hot code), there is a 0.5% - 1.5% improvement;
  • For gcc-8 (8MB of hot code), the win is in the range 0% - 0.5%.

In addition, added a few tests to verify correctness on simple instances.

Build-time impact:
With the change, the inference algorithm is ~2x-4x slower than the previous
version for (very) large binaries. However, the runtime is still close-to-linear
and I did not see any build-time regression.

Diff Detail

Event Timeline

spupyrev created this revision.Jan 31 2022, 11:47 AM
spupyrev requested review of this revision.Jan 31 2022, 11:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 31 2022, 11:47 AM
spupyrev retitled this revision from [CSSPGO] Even flow distribution to [CSSPGO] Even count distribution.Jan 31 2022, 12:07 PM
spupyrev edited the summary of this revision. (Show Details)
spupyrev added reviewers: wenlei, hoy, wlei.
spupyrev edited the summary of this revision. (Show Details)Jan 31 2022, 12:10 PM
spupyrev added a comment.EditedFeb 1 2022, 7:33 AM

Detailed perf measurements on clang, release_10 (speedups over the previous version of inference):
with CSSPGO+LTO:

benchmark1: 1.52% (stat sig)
benchmark2: 0.84% (stat sig)

with AutoFDO+LTO:

benchmark1: 1.72% (stat sig)
benchmark2: 0.61% (stat sig)

Perf measurements on gcc, release_83 (speedups over the previous version of inference):
with CSSPGO+LTO:

benchmark1: 0.16% (non-stat sig)
benchmark2: 0.34% (stat sig)
hoy added a comment.Feb 13 2022, 11:41 PM

Thanks for working on the smart algorithm and the offline illustration of the high-level idea. The performance win on clang-10 looks promising.

llvm/lib/Transforms/Utils/SampleProfileInference.cpp
197

nit: name it computeAugmentingPathCapacity?

301

nit: Edges[NodeIdx][EdgeIdx]?

333

Should the Discovery field be should a boolean? Looks like this is the only place using it.

333

I was wondering if setting SampleProfileMaxDfsCalls=1 makes sense. It looks like whether Dst is taken or not doesn't matter with which predecessor the searching starts with. Correct me if I'm wrong.

335

nit: Stack.emplace(Edge.Dst, 0)

354

assert stack must not be empty at this point?

398

Please add a message for the assertion.

482

What is an incident node?

spupyrev updated this revision to Diff 409292.Feb 16 2022, 9:21 AM
spupyrev marked 6 inline comments as done.
spupyrev edited the summary of this revision. (Show Details)

review comments

llvm/lib/Transforms/Utils/SampleProfileInference.cpp
333

Quoting my colleague:

I would keep it the way it is. Discovery and Finish times are well known DFS concepts. Of course, here they have been adapted to our modified version that can visit a node more than once. The marginal savings to switching to bool are not worth it.

333

That is a good point. We actually ran experiments with the flag but found that the default value of 10 is fine.

The tradeoff here is quality vs speed: the more iterations we apply, the higher quality we get. We certainly need more than 1 iteration (thus, SampleProfileMaxDfsCalls>1), otherwise we get poor results. Increasing it beyond 10-20 doesn't significantly improve the quality. Time savings of reducing the value from 10 to, say, 5 are minor.

spupyrev updated this revision to Diff 410543.Feb 22 2022, 8:26 AM

faster convergence

hoy accepted this revision.Feb 25 2022, 5:47 PM

LGTM.

This revision is now accepted and ready to land.Feb 25 2022, 5:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 7:37 AM
spupyrev updated this revision to Diff 412499.Mar 2 2022, 11:32 AM

update broken test

This revision was landed with ongoing or failed builds.Mar 2 2022, 1:12 PM
This revision was automatically updated to reflect the committed changes.