Page MenuHomePhabricator

[CSSPGO][llvm-profgen] Reimplement CS profile generator using context trie
ClosedPublic

Authored by wlei on May 9 2022, 10:39 AM.

Details

Summary

Our investigation showed ProfileMap's key is the bottleneck of the memory consumption for CS profile generation on some large services. This patch tries to optimize it by storing the CS function samples using the context trie tree structure instead of the context frame array ref. Parts of code in ContextTrieNode are reused.

Our experiment on one internal service showed that the context key's memory can be reduced from 80GB to 300MB.

To be compatible with non-CS profiles, the profile writer still needs to use ProfileMap as input, so rebuild the ProfileMap using the context trie in postProcessProfiles.

The optimization is not complete yet, next step is to reimplement Pre-inliner or profile trimmer, after that, ProfileMap should be small to be written.

Diff Detail

Unit TestsFailed

TimeTest
60,040 msx64 debian > MLIR.Examples/standalone::test.toy
Script: -- : 'RUN: at line 1'; /usr/bin/cmake /var/lib/buildkite-agent/builds/llvm-project/mlir/examples/standalone -G "Ninja" -DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DCMAKE_C_COMPILER=/usr/bin/clang -DLLVM_ENABLE_LIBCXX=OFF -DMLIR_DIR=/var/lib/buildkite-agent/builds/llvm-project/build/lib/cmake/mlir ; /usr/bin/cmake --build . --target check-standalone | tee /var/lib/buildkite-agent/builds/llvm-project/build/tools/mlir/test/Examples/standalone/Output/test.toy.tmp | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/mlir/test/Examples/standalone/test.toy

Event Timeline

wlei created this revision.May 9 2022, 10:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2022, 10:39 AM
Herald added subscribers: hoy, wenlei. · View Herald Transcript
wlei requested review of this revision.May 9 2022, 10:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2022, 10:39 AM
wlei edited the summary of this revision. (Show Details)May 9 2022, 10:56 AM
wlei added reviewers: hoy, wenlei.
hoy added a comment.May 9 2022, 11:23 AM

Our experiment on one internal service showed that the context key's memory can be reduced from 80GB to 300MB.

This is awesome! Thanks for the work!

How about running time?

wlei added a comment.May 9 2022, 8:54 PM

Our experiment on one internal service showed that the context key's memory can be reduced from 80GB to 300MB.

This is awesome! Thanks for the work!

How about running time?

Just tested the trie based is faster than the old one, E2E running time : 103mins vs 122mins.

The reason for the speed-up should be due to the time spent on getting caller's FunctionSamples. For the new version, it's just call getParentContext() which is O(1). The old version it generate the caller's context frames then call getFunctionProfileForContext(...), which is O(n), n is the context length.

hoy added a comment.May 10 2022, 3:39 PM

Our experiment on one internal service showed that the context key's memory can be reduced from 80GB to 300MB.

This is awesome! Thanks for the work!

How about running time?

Just tested the trie based is faster than the old one, E2E running time : 103mins vs 122mins.

The reason for the speed-up should be due to the time spent on getting caller's FunctionSamples. For the new version, it's just call getParentContext() which is O(1). The old version it generate the caller's context frames then call getFunctionProfileForContext(...), which is O(n), n is the context length.

Very nice. BTW, looks like this is done to line number based profile generation only. How about pseudo probes? Am I missing anything?

hoy added a comment.May 10 2022, 3:43 PM

Our experiment on one internal service showed that the context key's memory can be reduced from 80GB to 300MB.

This is awesome! Thanks for the work!

How about running time?

Just tested the trie based is faster than the old one, E2E running time : 103mins vs 122mins.

The reason for the speed-up should be due to the time spent on getting caller's FunctionSamples. For the new version, it's just call getParentContext() which is O(1). The old version it generate the caller's context frames then call getFunctionProfileForContext(...), which is O(n), n is the context length.

Very nice. BTW, looks like this is done to line number based profile generation only. How about pseudo probes? Am I missing anything?

NVM, I didn't notice the changes for pseudo probes.

hoy added inline comments.May 15 2022, 11:43 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
894

PLease add more comments about why the processing order matters.

917

Update the comment "But we can't modify ProfileMap while iterating it."?

On the other hand, since we are doing bottom-up processing, we should have no such limitation now?

933–944

nit: perhaps just name this function buildProfileMap since we are using trie by default.

934

use reference type instead of pointer type for Node to explicitly indicate that this is a non-null ptr?

945

Can FProfile be updated in place to avoid copy and deletion?

968

I guess eventually this call will be after the preiliner run, but can this be moved after computeSummaryAndThreshold now?

968

nit: seems that Context is not used anywhere. Perhaps make a wrapper function void buildProfileMapFromContextTrie( ContextTrieNode *Node) and declare Context inside?

llvm/tools/llvm-profgen/ProfileGenerator.h
304

nit: getOrCreateContextNodeForContext

wlei updated this revision to Diff 430452.May 18 2022, 10:54 AM

Addressing feedback.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
894

Comments updated.

917

Good point! Now we can just create the new function samples for caller, actually that's already there see the later code creating this.

FunctionSamples &CallerProfile = *getOrCreateFunctionSamples(CallerNode);

933–944

Sounds good!

934

Sounds good!

945

the profile in ProfileMap require the full context as the key, so we can't create the value in ProfileMap in the early time. Or perhaps we can use the move semantics, but FunctionSamples currently doesn't have the move constructor, which will make things complicated. I think after the trimming and pre-inliner, the num of profile will be small, copy and deletion should be fine then?

968

Sounds good!

968

I guess eventually this call will be after the preiliner run, but can this be moved after computeSummaryAndThreshold now?

Yeah, we need to move it after preinliner and moreover it should be after SampleContextTrimmer.

computeSummaryAndThreshold is still using ProfileMap and I also found that ProfileMap can also be imported from the SampleReader. Both of them will needs to be rewritten. Maybe leave them a separate diff, just to be easy for the code review, what do you think?

llvm/tools/llvm-profgen/ProfileGenerator.h
304

Fixed!

hoy added inline comments.May 19 2022, 11:24 AM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
945

Makes sense. Eventually the trie should be small after preinliner.

You may want to set Node.getFunctionSamples() to be null after deleting it.

968

computeSummaryAndThreshold is still using ProfileMap

I'm wondering if we can make a version of computeSummaryAndThreshold to run on the trie to avoid generate a giant ProfileMap with all contexts in the middle.

SampleReader is a special path and we don't care about the performance there. It can be treated differently.

Yeah, a separate diff is fine.

wlei updated this revision to Diff 434102.Jun 3 2022, 12:09 PM

rebase
using context tracker directly

wlei added inline comments.Jun 3 2022, 12:09 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
945

Set to null, thanks!

wenlei added inline comments.Jun 6 2022, 8:39 AM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
894

parent(inliner)'s sample depends on child(inlinee)'s sample, so traverse the tree in post-order.

I thought order was not important because we use head samples to update body samples. But it looks like we use entry samples which could be first body samples.

So this was a bug in previous implementation?

946

Maybe it isn't too complicated to add move ctor for FunctionSamples? We're taking care of SampleContext here, and the rest are two std::map. But we can deal with it in separate change.

Relatedly, this feels more like converting trie to map, rather than building a map from trie, because the process also "destroy" the trie. Who owns FProfile of the trie? While delete it on the fly saves a bit of mem, it also feels a bit weird for it to be deleted here as iiuc the trie may not own the underlying function samples?

wlei added inline comments.Jun 6 2022, 4:25 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
718

The FunctionSamples is new here.

894

Right, we use the callee's body sample to update caller's body sample.

It is a bug in previous implementation, but it should not be critical one, as you said, it's only related with the first body samples.

946

Maybe it isn't too complicated to add move ctor for FunctionSamples? We're taking care of SampleContext here, and the rest are two std::map. But we can deal with it in separate change.

OK, I just realized that we can use the default implicit move ctor of FunctionSamples, it can automatically move the two std::map. Then I think it should work here to add the std::move(..) to avoid copy.

Who owns FProfile of the trie?

It's a little bit tricky.

  1. It can be from the ProfileMap(using --llvm-sample-profile as input) or sample reader from compiler side. we can't delete the profile owned by ProfileMap.
  1. for profile generation, it's created on on-the-fly on the trie.(see ProfileGenerator.cpp:718). That's why I delete this profile while "destroying" trie.

I guess you meant the memory saving of the on-the-fly deletion is not noticeable(I agree that the memory after pre-inliner should be small), if so, I will separate this into two parts: 1) build map and 2) free the memory(in the future patch).

wlei updated this revision to Diff 434640.Jun 6 2022, 4:25 PM

use std::move to avoid memory copy
remove the "memory free" code, will add in a separate patch.

hoy accepted this revision.Jun 10 2022, 10:54 AM

LGTM.

This revision is now accepted and ready to land.Jun 10 2022, 10:54 AM
wlei updated this revision to Diff 437025.Jun 14 2022, 8:55 PM

Put all FSamples in a std::list<FunctionSamples> to avoid memory new and delete.

Thanks for the change -- good work. I can see how it improves performance of llvm-profgen. A few minor comments.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
718

FSamplesList.emplace_back(FunctionSamples());
->
FSamplesList.emplace_back();

emplace_back should be able to take ctor parameters directly, which avoids a copy/move.

916–931

nit: we can simplify this by having a single populateInferredFunctionSamples function with default argument point to root context node.

946

Thanks for changing it to use move. With that the original trie still gets destroyed because it no longer holds valid FunctionSamples. So perhaps we should name this function "convert" instead of "build"?

And since this is stateful, add an assertion/check to verify? like assert that ProfileMap is empty before we start the conversion, and the trie is still valid (holds valid function samples).

llvm/tools/llvm-profgen/ProfileGenerator.h
304

nit: might want to unify the naming convention, getOrCreateFunctionSamples doesn't have ForXYZ. So getOrCreateContextNode?

wlei updated this revision to Diff 439231.Jun 22 2022, 6:18 PM

addressing Wenlei's feedback

wenlei added inline comments.Jun 22 2022, 10:09 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
954

nit: convertProfileMap -> convertToProfileMap

971–972

what makes function samples empty during conversion?

980–981

I was thinking about checking to make sure things are valid before conversion. Checking after conversion doesn't prevent double conversion where the 2nd conversion would be wrong. We could also have a simple state boolean to represent the statefulness and use that for checks.

wlei added inline comments.Jun 22 2022, 10:18 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
971–972

IMO, The std::move will move the BodySamples from the trie node's FSamples to the ProfileMap's FSamples. This is to check after the conversion no FSamples left not moved.

980–981

By a valid state, do you mean to check that the trie node's getAllChildren is not empty?

wlei updated this revision to Diff 439265.Jun 22 2022, 10:28 PM

Updating D125246: [CSSPGO][llvm-profgen] Reimplement CS profile generator using context trie

wenlei added inline comments.Jun 22 2022, 10:29 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
971–972

The move-from object are in unspecified state according to standard, and we'd want to avoid accessing move-from objects.

980–981

I was thinking of valid state as an abstract term -- since the conversion destroy the original profile on trie through move etc, after conversion the function samples on the trie is no longer valid. We could simply have a boolean isProfileValidOnTrie as a state, set it to false inside convert function, and also check it is true at the beginning of conversion..

wlei updated this revision to Diff 439266.Jun 22 2022, 10:49 PM

add a valid state check for the trie conversion.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
980–981

Ah, I see, thanks for the clarification!

wenlei accepted this revision.Jun 22 2022, 10:51 PM

lgtm, thanks.