This is an archive of the discontinued LLVM Phabricator instance.

[BOLT] using jump weights in profi
ClosedPublic

Authored by spupyrev on Dec 12 2022, 11:58 AM.

Details

Summary

We want to use profile inference (profi) in BOLT for stale profile matching.
This is the second change for existing usages of profi (e.g., CSSPGO):

(i) Added the ability to provide (estimated) jump weights for the algorithm. The
goal of the algorithm is to create a valid control flow for a given function
(that is, one in which incoming counts equal outgoing counts for every basic
block while minimally modifying the original input block and jump weights). The
input jump weights will be provided based on collected LBR profiles in BOLT.

(ii) Added the corresponding options to ProfiParams.

(iii) Slightly modified / simplified the construction of the flow network in profi
so as it utilizes fewer auxiliary nodes. This is done by introducing parallel
edges to the network (which is supported by MMF) and reduces the size of the
network from 3*|V| to 2*|V|, where |V| is the number of basic blocks in the
function.

Inference (profile quality) impact:
The diff is supposed to be a no-op for the inferred counts. However, our
implementation of MCF is not fully deterministic and might return different
results depending on the input network model. Since we changed the model
construction, there are a few differences in comparison to the original
implementation. I checked manually on an internal benchmark and see a minor
difference (+/- 1 count for certain basic blocks) in just a dozen of instances
(out of 10000+ input functions). Hence, the diff is highly unlikely to have an
impact for existing prod workloads.

Runtime impact:
I measure up to 10% speedup for block-only (ie CSSPGO/AutoFDO) inference and up
to 50% speedup for block+jump inference (ie BOLT) in comparison to the original
unoptimized version.

Diff Detail

Event Timeline

spupyrev created this revision.Dec 12 2022, 11:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 12 2022, 11:58 AM
spupyrev requested review of this revision.Dec 12 2022, 11:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 12 2022, 11:58 AM
spupyrev edited the summary of this revision. (Show Details)Dec 12 2022, 11:59 AM
spupyrev updated this revision to Diff 482568.Dec 13 2022, 11:17 AM
spupyrev edited the summary of this revision. (Show Details)

upds

spupyrev edited the summary of this revision. (Show Details)Dec 16 2022, 9:07 AM
spupyrev updated this revision to Diff 483560.Dec 16 2022, 9:08 AM

fixing a test

spupyrev removed a subscriber: wenlei.
hoy added inline comments.Jan 9 2023, 4:51 PM
llvm/lib/Transforms/Utils/SampleProfileInference.cpp
1088

Does this still hold?

1120

Where is this handled in the new implementation?

1125

Previously the cost it takes to go from Bin to Bout is 2*AuxCostInc. Now without Baux the cost for the same path becomes AuxCostInc. Is it a discrepancy or am I missing anything?

1164

Add a comment to indicate the check is for fall-through branches?

1222

Please add a message for the assert and the one right below.

spupyrev marked 2 inline comments as done.Jan 11 2023, 9:44 AM
spupyrev added inline comments.
llvm/lib/Transforms/Utils/SampleProfileInference.cpp
1088

the assert is moved to verifyInput

1120

This is a good question; we don't have this condition anymore. While in theory there might be a difference, I do not see a single instance (in my benchmark with 10K functions) where this statement yields a different result. So it must be quite rare.
More importantly, I do not remember the original motivation: Why is it "correct" not to penalize blocks with self loops?

1125

Now all the costs are reduced by a factor of 2, that is, instead of 2*AuxCostInc we have AuxCostInc and instead of 2*AuxCostDec we have AuxCostDec. This does not have any impact on the algorithm and the produced solution; only the objective is (uniformly) scaled

hoy added inline comments.Jan 11 2023, 1:54 PM
llvm/lib/Transforms/Utils/SampleProfileInference.cpp
1120

Oh, it might be related to a Skylake-specific hardware issue where a LBR entry can occur consecutively redundantly. Decreasing the backedge count may make sense. Otherwise I cannot think of why self-loop is special from normal loops.

The Skylake issue is worked around in the profile generation time so we should be free from that now.

hoy accepted this revision.Jan 11 2023, 2:19 PM
This revision is now accepted and ready to land.Jan 11 2023, 2:19 PM
This revision was automatically updated to reflect the committed changes.