This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO] Unblock optimizations with pseudo probe instrumentation part 3.
ClosedPublic

Authored by hoy on Sep 30 2021, 9:25 AM.

Details

Summary

This patch continues unblocking optimizations that are blocked by pseudo probe instrumentation.

Not exactly like DbgIntrinsics, PseudoProbe intrinsic has other attributes (such as mayread, maywrite, mayhaveSideEffect) that can block optimizations. The issues fixed are:

  • Flipped default param of getFirstNonPHIOrDbg API to skip pseudo probes
  • Unblocked CSE by avoiding pseudo probe from clobbering memory SSA
  • Unblocked induction variable simpliciation
  • Allow empty loop deletion by treating probe intrinsic isDroppable
  • Some refactoring.

Diff Detail

Event Timeline

hoy created this revision.Sep 30 2021, 9:25 AM
hoy requested review of this revision.Sep 30 2021, 9:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 30 2021, 9:25 AM

Thanks for the changes to reduce probe overhead. As we chatted off patch, some information and verification of this change's impact on profile quality would be helpful.

hoy added a comment.Oct 11 2021, 9:25 AM

Thanks for the changes to reduce probe overhead. As we chatted off patch, some information and verification of this change's impact on profile quality would be helpful.

The original patch contained several general changes, e.g, flipping the default values of those skipping APIs, which were observed to affect profile quality. After narrowing down the offending changes that are in SimplifyCFG, reverting them brought back the profile quality while still reduced the probe overhead in a decent amount.

hoy updated this revision to Diff 378697.Oct 11 2021, 9:25 AM

Updating D110847: [CSSPGO] Unblock optimizations with pseudo probe instrumentation part 3.

Thanks for the changes to reduce probe overhead. As we chatted off patch, some information and verification of this change's impact on profile quality would be helpful.

The original patch contained several general changes, e.g, flipping the default values of those skipping APIs, which were observed to affect profile quality. After narrowing down the offending changes that are in SimplifyCFG, reverting them brought back the profile quality while still reduced the probe overhead in a decent amount.

What are the specific optimizations that were critical for profile quality in SimplifyCFG? On the other hand, it looks to me that some of the other unblocked optimizations like InstCombine could also have some impact on profile quality.
The change looks good, but I think it'd also be nice to have some insight and reasoning as to why certain optimizations have be blocked for profile quality, beyond experiment results which could vary from one workload to another.

hoy added a comment.Oct 12 2021, 9:09 AM

Thanks for the changes to reduce probe overhead. As we chatted off patch, some information and verification of this change's impact on profile quality would be helpful.

The original patch contained several general changes, e.g, flipping the default values of those skipping APIs, which were observed to affect profile quality. After narrowing down the offending changes that are in SimplifyCFG, reverting them brought back the profile quality while still reduced the probe overhead in a decent amount.

What are the specific optimizations that were critical for profile quality in SimplifyCFG? On the other hand, it looks to me that some of the other unblocked optimizations like InstCombine could also have some impact on profile quality.
The change looks good, but I think it'd also be nice to have some insight and reasoning as to why certain optimizations have be blocked for profile quality, beyond experiment results which could vary from one workload to another.

Good question. In general SimplifyCFG is more disruptive to the flow graph by folding, merging or removing blocks. Such flow graph changes affected the execution frequency of blocks, and samples collected on the new blocks have a problem to correlate to original blocks. A particular case is

/// If this basic block is simple enough, and if a predecessor branches to us
/// and one of our successors, fold the block into the predecessor and use
/// logical operations to pick the right destination.
bool llvm::FoldBranchToCommonDest(BranchInst *BI, DomTreeUpdater *DTU,

Compared to SimplifyCFG, InstCombine is more like optimizing on instruction level. Moving an instruction around or folding instructions are likely having a smaller impact on the shape of flow graph and block frequencies.

wenlei accepted this revision.Oct 12 2021, 9:17 AM

Thanks for the changes to reduce probe overhead. As we chatted off patch, some information and verification of this change's impact on profile quality would be helpful.

The original patch contained several general changes, e.g, flipping the default values of those skipping APIs, which were observed to affect profile quality. After narrowing down the offending changes that are in SimplifyCFG, reverting them brought back the profile quality while still reduced the probe overhead in a decent amount.

What are the specific optimizations that were critical for profile quality in SimplifyCFG? On the other hand, it looks to me that some of the other unblocked optimizations like InstCombine could also have some impact on profile quality.
The change looks good, but I think it'd also be nice to have some insight and reasoning as to why certain optimizations have be blocked for profile quality, beyond experiment results which could vary from one workload to another.

Good question. In general SimplifyCFG is more disruptive to the flow graph by folding, merging or removing blocks. Such flow graph changes affected the execution frequency of blocks, and samples collected on the new blocks have a problem to correlate to original blocks. A particular case is

/// If this basic block is simple enough, and if a predecessor branches to us
/// and one of our successors, fold the block into the predecessor and use
/// logical operations to pick the right destination.
bool llvm::FoldBranchToCommonDest(BranchInst *BI, DomTreeUpdater *DTU,

Compared to SimplifyCFG, InstCombine is more like optimizing on instruction level. Moving an instruction around or folding instructions are likely having a smaller impact on the shape of flow graph and block frequencies.

Ok, while some are destructive, perhaps there're still recoverable cases (e.g. SimplifyCondBranchToCondBranch). But we can deal with them later.

This revision is now accepted and ready to land.Oct 12 2021, 9:17 AM