One performance issue happened in profile generation and it turned out the line 525 loop is the bottleneck.
Moving the code outside of loop scope can fix this issue. The run time is improved from 30+mins to ~30s.
Details
- Reviewers
hoy wenlei wmi - Commits
- rGa8a38ef3d99c: [llvm-profgen] Fix bug of loop scope mismatch
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | I'm wondering here if FunctionSamples * should be Vector<FunctionSamples *> as same frame can point to different FunctionSamples? |
Yes, it's today's slowing down and regression(it should also improve other cases), I reproduced @spupyrev 's script and saw 99% of the time spent is in line 525 loop and the loop size is 100,000+.
Good catch, thanks for the fix!
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | An inline context under a given calling context should uniquely identify an inline frame, therefore can only point to one leaf profile. Am I missing anything? |
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | Sorry I meant std::set<FunctionSamples *>. I'm thinking the multiple probes cases from different call stack and they have the same leaf probe. like Here Probe1 will have two FunctionSamples. |
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | Oh. Here in populateBodySamplesWithProbes has already guaranteed to have unique call stack, so no multiple probe cases. Sorry for the confusing. |
Hmm, I compared this patch against previous output, it's not identical. I still need some time digging it before land this.
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | @hoy I found that the probe->getInlineTreeNode() is not the leaf frame so here it's not unique, the unique one should be like "probe + leaf function" and the leaf function is given from its FuncDesc. It happened that there are different the leaf functions. See the reference below, "Note that the context from probe doesn't include leaf frame, hence we need to retrieve and prepend leaf if requested.", when we extract the inline context, we still need to append the leaf frame. void MCPseudoProbeDecoder::getInlineContextForProbe( const MCDecodedPseudoProbe *Probe, SmallVectorImpl<std::string> &InlineContextStack, bool IncludeLeaf) const { Probe->getInlineContext(InlineContextStack, GUID2FuncDescMap, true); if (!IncludeLeaf) return; // Note that the context from probe doesn't include leaf frame, // hence we need to retrieve and prepend leaf if requested. const auto *FuncDesc = getFuncDescForGUID(Probe->getGuid()); InlineContextStack.emplace_back(FuncDesc->FuncName + ":" + Twine(Probe->getIndex()).str()); } So I changed to use unordered_set for multiple leaf frame FunctionSample. See if this looks good to you? |
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | I see, a callsite can lead to different leaf frames, if it is an indirect callsite. That's a good find, thanks! Yeah, using unordered_set sounds good to me. Does it close the gap you saw with the loop scope change? |
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | Yes, Now the output of this change is identical to the one without this change while building the clang script |
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | That's nice. Wondering what could cause same inline context to point to different frames. For the clang pass1 build, ICP shouldn't be triggered? |
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | My understanding is the same frame can point to different inline contexts, same inline context should point to same frame. e.g. probe1(GUID:foo) and probe2(GUID:bar) can point to the same frame(inlinetree) and when the inlinetree's context is extracted like main @ goo and there full context will be |
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | In your example, does foo and bar share the same callsite in goo? |
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | Yes, that's why here changed to unordered_set, and as you mentioned, foo and bar is indirect call |
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | I was wondering for clang pass1 build, how was the indirect call promoted. We don't use profile for pass1 thus ICP should not be triggered? |
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | I guess two probes with inline contexts can be merged, and the merged inline context is trimmed to empty of a common part of the inputs. Anyway, that's unrelated to this patch. Using an unordered_set sounds good to me. |
llvm/tools/llvm-profgen/ProfileGenerator.cpp | ||
---|---|---|
523 | Thanks for the discussion! That's interesting question, I'm trying to find something in the clang asm, but it's too large to debug. I will land this first and later try to run on SPEC to see if something easy to catch. |
I'm wondering here if FunctionSamples * should be Vector<FunctionSamples *> as same frame can point to different FunctionSamples?