This is an archive of the discontinued LLVM Phabricator instance.

[InstrProf] Use BalancedPartitioning to order temporal profiling trace data
ClosedPublic

Authored by ellis on Apr 7 2023, 2:02 PM.

Details

Summary

In [0] we described an algorithm called BalancedPartitioning (bp) to consume function traces [1] and compute a function order that reduces the number of page faults during startup.

This patch adds the order command to the llvm-profdata tool which uses bp to output a function order that can be passed to the linker via --symbol-ordering-file=.

Special thanks to Sergey Pupyrev and Julian Mestre for designing this balanced partitioning algorithm.

[0] https://discourse.llvm.org/t/rfc-temporal-profiling-extension-for-irpgo/68068
[1] https://reviews.llvm.org/D147287

Diff Detail

Event Timeline

ellis created this revision.Apr 7 2023, 2:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2023, 2:02 PM
ellis retitled this revision from [InstrProf] Use BalancedPartitioning to order trace data to [InstrProf] Use BalancedPartitioning to order temporal profiling trace data.Apr 11 2023, 8:36 AM
ellis published this revision for review.Apr 11 2023, 9:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2023, 9:17 AM

A few high level questions:

Different types of traces don't have the same frequency, so it might be useful to support weighting. The frequency certain trace pattern appear in the profile data does not necessarily match to their frequency in real world usage. To support this, some kind of symbolic id may be needed to annotate the trace data.

The cost metric is entropy like. Have you tried other metrics such as gini impurity?

llvm/lib/Support/BalancedPartitioning.cpp
202

why 4x larger? Is the number of utility nodes the same as 'the number of traces * number of cutoffs" ?

llvm/tools/llvm-profdata/llvm-profdata.cpp
3042

Need to document it in commandline guide.

3050

How are the default values selected?

ellis added a subscriber: jmestre.Apr 11 2023, 2:47 PM

A few high level questions:

Different types of traces don't have the same frequency, so it might be useful to support weighting. The frequency certain trace pattern appear in the profile data does not necessarily match to their frequency in real world usage. To support this, some kind of symbolic id may be needed to annotate the trace data.

I see what you are saying. We could have a set of raw profiles collected under type "A" conditions and another set under type "B" conditions. Maybe type "A" is common while type "B" is more rare so we'd like to weight "A"s traces more than "B"s.

This could be implemented with an extra "weight" field in the trace data for each trace. It hasn't been long since I landed https://reviews.llvm.org/D147287. Do you think it makes sense to land a patch to add this field without updating the version and supporting two trace formats?

The other option is to have the llvm-profdata merge command duplicate traces with extra weight. Of course this reduces the data density so we couldn't store as many profiles.

The cost metric is entropy like. Have you tried other metrics such as gini impurity?

@spupyrev @jmestre Do you have any thoughts on this?

llvm/lib/Support/BalancedPartitioning.cpp
202

The Cutoffs variable is only used with bp when ordering functions from traces. We can also use bp to order functions to improve compression by placing similar function close together. For that case we have another way to assign utility nodes to function nodes, so the number of signatures will be different.

llvm/tools/llvm-profdata/llvm-profdata.cpp
3050

These values were also empirically found by optimizing a large binary. I was debating leaving out a default value since these values are pretty specific to the binary we tested. Now that I think about it, we can try to derive a similar list by looking at the total number of functions and assign cutoffs linearly.

A few high level questions:

Different types of traces don't have the same frequency, so it might be useful to support weighting. The frequency certain trace pattern appear in the profile data does not necessarily match to their frequency in real world usage. To support this, some kind of symbolic id may be needed to annotate the trace data.

I see what you are saying. We could have a set of raw profiles collected under type "A" conditions and another set under type "B" conditions. Maybe type "A" is common while type "B" is more rare so we'd like to weight "A"s traces more than "B"s.

This could be implemented with an extra "weight" field in the trace data for each trace. It hasn't been long since I landed https://reviews.llvm.org/D147287. Do you think it makes sense to land a patch to add this field without updating the version and supporting two trace formats?

It should be fine -- since there is no ordering tool/support (like this patch does) that enable folks to use the feature yet.

The other option is to have the llvm-profdata merge command duplicate traces with extra weight. Of course this reduces the data density so we couldn't store as many profiles.

The cost metric is entropy like. Have you tried other metrics such as gini impurity?

@spupyrev @jmestre Do you have any thoughts on this?

llvm/tools/llvm-profdata/llvm-profdata.cpp
3050

or add some empirical guidance in the description.

ellis updated this revision to Diff 512598.Apr 11 2023, 3:10 PM

Add commandline docs

ellis updated this revision to Diff 513382.Apr 13 2023, 4:28 PM

Rebase and remove utility nodes if they have edges to only one function or all the functions.

ellis added a comment.Apr 13 2023, 4:30 PM

A few high level questions:

Different types of traces don't have the same frequency, so it might be useful to support weighting. The frequency certain trace pattern appear in the profile data does not necessarily match to their frequency in real world usage. To support this, some kind of symbolic id may be needed to annotate the trace data.

I see what you are saying. We could have a set of raw profiles collected under type "A" conditions and another set under type "B" conditions. Maybe type "A" is common while type "B" is more rare so we'd like to weight "A"s traces more than "B"s.

This could be implemented with an extra "weight" field in the trace data for each trace. It hasn't been long since I landed https://reviews.llvm.org/D147287. Do you think it makes sense to land a patch to add this field without updating the version and supporting two trace formats?

It should be fine -- since there is no ordering tool/support (like this patch does) that enable folks to use the feature yet.

The other option is to have the llvm-profdata merge command duplicate traces with extra weight. Of course this reduces the data density so we couldn't store as many profiles.

The cost metric is entropy like. Have you tried other metrics such as gini impurity?

@spupyrev @jmestre Do you have any thoughts on this?

I've added https://reviews.llvm.org/rG4bddef4117403a305727d145a9abf6bda700f8ff but I'm not using the weights yet. I haven't decided the best way to use them yet, and I haven't decided if I'll do that here or in a followup diff.

The cost metric is entropy like. Have you tried other metrics such as gini impurity?

For context, we're using (a variant of) the same reordering optimization across several applications internally for several years now. We did extensively test and engineer the implementation, and I am so happy to see it finally open sourced! There are a couple of research papers describing the technique with details and evaluation, e.g., (1) or (2), including the evaluation of various objectives.

We did try several alternatives for the objective and the log-gap (the "entropy") seems to be the best choice across the majority of use cases, including the start-up function layout. For the dataset I have at hand, the Gini impurity behaves slightly worse than the current log-gap one. I'd like also to clarify that we're speaking about second order effects here. That is, the difference between "no-reordered layout" vs "log-gap reordered one" is an order (or two) of magnitude larger than the difference between "log-gap" and "gini".

spupyrev added inline comments.Apr 14 2023, 12:22 PM
llvm/include/llvm/Support/BalancedPartitioning.h
89

I thought we use 18 by default?

llvm/lib/Support/BalancedPartitioning.cpp
86

Would be great to unify the implementation with the existing ThreadPool, but I don't have a suggestion on how exactly to implement that. I'm fine with keeping it as is, if there are no alternative suggestions.

215

Do we want to get rid of the map? (and the weird 4x initialization)

367

I think we've got rid of this method

llvm/tools/llvm-profdata/llvm-profdata.cpp
3050

The assumption here is that most of the traces are of size ~32K. Perhaps we could also add a comment and a clarification to re-consider the values, if the traces are of significantly different sizes?

ellis updated this revision to Diff 513921.Apr 15 2023, 9:08 AM
ellis marked an inline comment as done.

Remove cutoff values and refactor unittest

ellis planned changes to this revision.Apr 15 2023, 9:08 AM
ellis marked 3 inline comments as done.
ellis updated this revision to Diff 514438.Apr 17 2023, 3:34 PM
ellis marked 4 inline comments as done.

Make moveGain() much faster

ellis added a comment.Apr 17 2023, 3:34 PM

I've ran some performance analysis by running the BalancedPartitioningTest.Large test with different sizes.

GTEST_FILTER="BalancedPartitioningTest.Large" perf record --call-graph lbr -- build/unittests/Support/SupportTests

I found that most of the time is spent in moveGain(), so I made several changes.

  1. Switch from double to float.
  2. Use std::accumulate() instead of a for loop to compute gain.
  3. Pull out constant FromLeftToRight from loop.
  4. Use a vector instead of a map for Signatures.
  5. Separate CachedCost into two floats and a bool since moveGain() assumes the cache is valid.
llvm/lib/Support/BalancedPartitioning.cpp
215

I've removed the map because I found that it is more efficient to renumber these utility nodes so these signatures can be an array indexed by the utility node.

367

I've deleted computeGoal().

spupyrev added inline comments.May 2 2023, 7:17 AM
llvm/lib/Support/BalancedPartitioning.cpp
148
201

How much overhead does DenseMap have in comparison with std::vector? I assume one could first find the maximum utility index and then use a vector instead of a map. Of course, that should only be done if there is visible speedup.

213

maybe add UtilityNodeIndex.reserve(NumUtilities)
(and use the variable for signature initialization too)

258–265
363

should it be Gain + Signatures[UN].CachedCostLR instead?

Assuming I'm correct and this is a bug, can we add a test to catch the problem?

367

same here

ellis updated this revision to Diff 518861.May 2 2023, 2:24 PM
ellis marked 4 inline comments as done.

Fix bug in moveGain

llvm/lib/Support/BalancedPartitioning.cpp
201

From looking at perf record it seems like very little time is spent with UtilityNodeDegree. And since I'm renumbering the utility nodes below, we can't index by utilities at this point, so I think it's best to leave it as it is.

213

We don't actually know how many utilities there are at this point because there could be duplicates.

363

Thanks for the catch! I've added a simple unittest which would have caught this. I guess this explains why I saw a perf improvement. We are still spending the most time (~30%) in moveGain(), but I've switched back to a simpler implementation because I'm not seeing any gains from using std::accumulate().

Looks good to me, thanks! I'd wait a bit to let others provide their feedback

snehasish added inline comments.May 2 2023, 2:53 PM
llvm/include/llvm/Support/BalancedPartitioning.h
43

[optional] It seems a bit odd to have an include from llvm/ProfileData -> llvm/Support. Can we forward decl TemporalTraceTy and keep this Support header independent? I think that's the only thing we need from InstrProf in this header but it might require changing the fromTemporalProfTraces API a little bit.

Just a flyby comment, I'm not planning on reviewing the patch in detail since others already spent time looking into it.

ellis updated this revision to Diff 519560.May 4 2023, 10:31 AM

Move BPFunctionNode::fromTemporalProfTraces() -> TemporalProfTraceTy::createBPFunctionNodes()

llvm/include/llvm/Support/BalancedPartitioning.h
43

Thanks for the feedback! I've moved BPFunctionNode::fromTemporalProfTraces() -> TemporalProfTraceTy::createBPFunctionNodes() and I think that fits better since the trace data is now responsible for knowing how to construct utility nodes. Now Support/BalancedPartitioning is independent of InstrProf, but BalancedPartitioningTest.cpp does now need to link ProfileData.

I also changed the API to use ArrayRef since it doesn't need to modify Traces.

ellis added a comment.May 22 2023, 8:50 AM

Does anyone have any more concerns?

kyulee added a subscriber: kyulee.May 25 2023, 10:28 PM
kyulee added inline comments.
llvm/include/llvm/Support/BalancedPartitioning.h
27

Not sure where you got, but it seems slightly different than the one in the linked paper.

36

Not blocking, but I wonder if you want to add a link to LCTES 2023 when the proceeding is available soon?

llvm/lib/Support/BalancedPartitioning.cpp
179

I'm not sure what this code does. I might miss something.

317

Given logCost is a negative value which we want to minimize for an objective, my read on CachedCostLR is actually a cost saving (or gain) when N is moved from Left to Right. So, wouldn't it make sense naming it CachedCostSavingLR or CachedGainLR so that accumulating them to Gain sounds natural?

ellis updated this revision to Diff 526113.May 26 2023, 10:20 AM

Rename CachedCost -> CachedGain and refactor runIteration() a bit

llvm/include/llvm/Support/BalancedPartitioning.h
27

Yes, the paper reports a runtime of O(m log n + n log2 n). However O(m log2 n) should be slightly worse and much easier to understand, so I think this is good enough 🙂

36

Since the paper we submitted to LCTES 2023 is slightly shorter I think it makes sense to keep this arxiv link which contains our whole paper.

llvm/lib/Support/BalancedPartitioning.cpp
179

The contents of BPFunctionNode::UtilityNodes could originally contain arbitrary values. To make matters worse, each bisect step could shuffle these nodes into different sections.

Here we are computing UtilityNodeIndex to be a map from UtilityNodeT to an int in range [0,N) where N is the number of unique utility nodes. Then we can update each utility node so they are in range [0,N). Then we can make SignaturesT a normal vector indexed by utility nodes for performance.

317

I've renamed it to CachedGainLR which makes sense because we want to maximize it. Thanks for the suggestion!

spupyrev accepted this revision.Jun 5 2023, 8:36 AM

LGTM

This revision is now accepted and ready to land.Jun 5 2023, 8:36 AM
thakis added a subscriber: thakis.Jun 6 2023, 4:59 PM
thakis added inline comments.
llvm/unittests/Support/CMakeLists.txt
2

This increases the number of files needed to compile SupportTests by 200% (from ~200 to ~600). Is there a way to prevent this dep with some mocking or similar?

ellis added inline comments.Jun 6 2023, 6:08 PM
llvm/unittests/Support/CMakeLists.txt
2

Thanks for flagging! In https://reviews.llvm.org/D152325 I've moved one of theses tests to unittests/ProfileData so we can remove the dependency.

luporl added a subscriber: luporl.Jun 7 2023, 7:29 AM

The show-order.proftext test never finishes on armhf, causing a timeout in ninja check-all and a failure in this build bot: https://lab.llvm.org/buildbot/#/builders/178

ellis added a comment.Jun 7 2023, 10:43 AM

The show-order.proftext test never finishes on armhf, causing a timeout in ninja check-all and a failure in this build bot: https://lab.llvm.org/buildbot/#/builders/178

Looking now. I suspect this is related to the fact that threads are disabled (-DLLVM_ENABLE_THREADS=OFF). I don't see any mention of the show-order.proftext test in the job. Have you reproduced locally?

The show-order.proftext test never finishes on armhf, causing a timeout in ninja check-all and a failure in this build bot: https://lab.llvm.org/buildbot/#/builders/178

Looking now. I suspect this is related to the fact that threads are disabled (-DLLVM_ENABLE_THREADS=OFF). I don't see any mention of the show-order.proftext test in the job. Have you reproduced locally?

Sorry, I saw your message after having just disabled the tests on ARM, to get the build bot working again. But we can revert it as soon as they are fixed.

I have been able to reproduce it locally, both show-order.proftext and BalancedPartitioningTest just hang and never finish. They don't consume any significant amount of CPU or memory. It seems like they are waiting for something.

I guess the tests weren't printed in the bot logs because they haven't passed or failed, they were still being executed when the timeout killed them.

ellis added a comment.Jun 7 2023, 10:55 AM

The show-order.proftext test never finishes on armhf, causing a timeout in ninja check-all and a failure in this build bot: https://lab.llvm.org/buildbot/#/builders/178

Looking now. I suspect this is related to the fact that threads are disabled (-DLLVM_ENABLE_THREADS=OFF). I don't see any mention of the show-order.proftext test in the job. Have you reproduced locally?

Sorry, I saw your message after having just disabled the tests on ARM, to get the build bot working again. But we can revert it as soon as they are fixed.

I have been able to reproduce it locally, both show-order.proftext and BalancedPartitioningTest just hang and never finish. They don't consume any significant amount of CPU or memory. It seems like they are waiting for something.

I guess the tests weren't printed in the bot logs because they haven't passed or failed, they were still being executed when the timeout killed them.

No problem at all! Thanks for fixing the bots for me. I think I just reproduced locally (simply by adding -DLLVM_ENABLE_THREADS=OFF to the cmake command) so I should be able to post a fix soon.