Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

[BOLT][NFC] Speedup YAML profile processing

Authored by Amir on Tue, Sep 5, 6:54 PM.


Group Reviewers
Restricted Project
rG7b750943d722: [BOLT][NFC] Speedup YAML profile processing

Reduce YAML profile processing times:

  • preprocessProfile: speed up buildNameMaps by replacing ProfileNameToProfile mapping with ProfileFunctionNames set and ProfileBFs vector. Pre-look up YamlBF->BF correspondence, memoize in ProfileBFs.
  • readProfile: replace iteration over all functions in the binary by iteration over profile functions (strict match and LTO name match).

On a large binary (1.9M functions) and large YAML profile (121MB, 30k functions)
reduces profile steps runtime:
pre-process profile data: 12.4953s -> 10.7123s
process profile data: 9.8195s -> 5.6639s

Compared to fdata profile reading:
pre-process profile data: 8.0268s
process profile data: 1.0265s
process profile data pre-CFG: 0.1644s

Diff Detail

Event Timeline

Amir created this revision.Tue, Sep 5, 6:54 PM
Herald added a reviewer: maksfb. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
Amir requested review of this revision.Tue, Sep 5, 6:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptTue, Sep 5, 6:54 PM

Nice - good improvement. Do you know where the majority of the profile-processing time is spent with this change?


Is the change to always return true necessary?


This should always return non-null value.

Amir updated this revision to Diff 556493.Mon, Sep 11, 3:27 PM
Amir marked an inline comment as done.

Address comments

maksfb accepted this revision.Mon, Sep 11, 3:28 PM
This revision is now accepted and ready to land.Mon, Sep 11, 3:28 PM
Amir marked an inline comment as done.Mon, Sep 11, 3:28 PM
Amir added inline comments.

Not really, just to avoid curly braces in matchProfile lambda. Reverted this change back to returning void and added curly braces down below.

Amir marked an inline comment as done.Mon, Sep 11, 3:41 PM

Nice - good improvement. Do you know where the majority of the profile-processing time is spent with this change?

Here's what perf report shows when narrowed down to YAMLProfileReader class:

Samples: 49M of event 'cycles', Event count (approx.): 19618

-  100.00%              llvm-bolt
   -  100.00%              llvm-bolt
      -   49.06%              [.] llvm::bolt::YAMLProfileReader::mayHaveProfileData
         +   18.37%              [.] llvm::StringMapImpl::FindKey
         +   15.31%              [.] llvm::bolt::YAMLProfileReader::mayHaveProfileDat
         +    6.72%              [.] llvm::bolt::BinaryFunction::forEachName<llvm::bo
         +    3.94%              [.] operator delete@plt
         +    2.66%              [.] llvm::bolt::RewriteInstance::selectFunctionsToPr
         +    2.06%              [.] llvm::bolt::getLTOCommonName
         +    0.01%              [k] asm_sysvec_apic_timer_interrupt
      +   25.63%              [.] llvm::bolt::YAMLProfileReader::parseFunctionProfile
      +   22.19%              [.] llvm::bolt::YAMLProfileReader::buildNameMaps
      +    1.56%              [.] llvm::bolt::YAMLProfileReader::readProfile
      +    0.68%              [.] llvm::bolt::YAMLProfileReader::matchProfileToFunction
      +    0.32%              [.] llvm::bolt::YAMLProfileReader::parseFunctionProfile(llvm::bolt::BinaryFunction&
      +    0.32%              [.] llvm::bolt::YAMLProfileReader::~YAMLProfileReader
      +    0.24%              [.] llvm::bolt::YAMLProfileReader::preprocessProfile

Quick analysis:

  • mayHaveProfileData is not called from profile {pre-,}processing. It's only called from RI::selectFunctionsToProcess.
  • parseFunctionProfile is invoked once per profile function and does the heavy lifting attaching the profile to CFG.
  • buildNameMaps loops over profile functions and binary functions, hence the cost. I tried to reduce overhead per summary.
  • readProfile actually reads YAML profile, but the bulk of overhead is in llvm::yaml code - see below.

llvm::yaml methods:

Samples: 49M of event 'cycles', Event count (approx.): 298169

-  100.00%              llvm-bolt
   -  100.00%              llvm-bolt
      +   15.42%              [.] llvm::yaml::Scanner::peekNext
      +    8.81%              [.] llvm::yaml::Scanner::scanPlainScalar
      +    8.18%              [.] llvm::yaml::Scanner::removeStaleSimpleKeyCandidates
      +    7.88%              [.] llvm::yaml::Scanner::fetchMoreTokens
      +    4.94%              [.] llvm::yaml::Document::parseBlockNode
      +    4.86%              [.] llvm::yaml::Scanner::getNext
      +    4.53%              [.] llvm::yaml::Scanner::scanToNextToken
      +    4.49%              [.] llvm::yaml::Input::createHNodes

As you can see createHNodes is no longer the most expensive part. I don't see easy optimization opportunities here.

This revision was landed with ongoing or failed builds.Mon, Sep 11, 4:08 PM
This revision was automatically updated to reflect the committed changes.