We want to use profile inference (profi) in BOLT for stale profile matching.
This is the second change for existing usages of profi (e.g., CSSPGO):
(i) Added the ability to provide (estimated) jump weights for the algorithm. The
goal of the algorithm is to create a valid control flow for a given function
(that is, one in which incoming counts equal outgoing counts for every basic 
block while minimally modifying the original input block and jump weights). The
input jump weights will be provided based on collected LBR profiles in BOLT.
(ii) Added the corresponding options to ProfiParams.
(iii) Slightly modified / simplified the construction of the flow network in profi
so as it utilizes fewer auxiliary nodes. This is done by introducing parallel
edges to the network (which is supported by MMF) and reduces the size of the
network from 3*|V| to 2*|V|, where |V| is the number of basic blocks in the
function.
Inference (profile quality) impact:
The diff is supposed to be a no-op for the inferred counts. However, our
implementation of MCF is not fully deterministic and might return different 
results depending on the input network model. Since we changed the model 
construction, there are a few differences in comparison to the original 
implementation. I checked manually on an internal benchmark and see a minor 
difference (+/- 1 count for certain basic blocks) in just a dozen of instances 
(out of 10000+ input functions). Hence, the diff is highly unlikely to have an 
impact for existing prod workloads.
Runtime impact:
I measure up to 10% speedup for block-only (ie CSSPGO/AutoFDO) inference and up
to 50% speedup for block+jump inference (ie BOLT) in comparison to the original
unoptimized version.
Does this still hold?