This is an archive of the discontinued LLVM Phabricator instance.

[hot-cold-split] split more than a cold region per function
AbandonedPublic

Authored by sebpop on Oct 23 2018, 10:10 AM.

Download Raw Diff

Details

Reviewers

tejohnson
hiraditya
vsk

Summary

Remove a FIXME comment: allow all cold regions of a function to be outlined.

Diff Detail

Event Timeline

sebpop created this revision.Oct 23 2018, 10:10 AM

@sebpop thanks for this patch! I don't see any problems with it (although I would prefer that the test explicitly check that outlined functions contain the correct instructions).

At a higher-level, I'm seeing some problems with the forward/back cold propagation done in getHotBlocks on internal projects. The propagation seems to stop when it encounters simple control flow, like an if-then-else or a for loop, after which cold code is unconditionally executed.

I have a prototype of a different propagation scheme which overcomes some of these limitations. The idea is to mark blocks which are post-dominated by a cold block, or are dominated by a cold block, as cold. This is able to handle the control flow I described, and isn't limited to requiring a single exit block. Could you give me a day to evaluate it further, run benchmarks etc. and report back? If it turns out to be promising, istm that it'd make sense to rebase this patch on top of it.

In D53588#1272789, @vsk wrote:

@sebpop thanks for this patch! I don't see any problems with it (although I would prefer that the test explicitly check that outlined functions contain the correct instructions).

At a higher-level, I'm seeing some problems with the forward/back cold propagation done in getHotBlocks on internal projects. The propagation seems to stop when it encounters simple control flow, like an if-then-else or a for loop, after which cold code is unconditionally executed.

I have a prototype of a different propagation scheme which overcomes some of these limitations. The idea is to mark blocks which are post-dominated by a cold block, or are dominated by a cold block, as cold. This is able to handle the control flow I described, and isn't limited to requiring a single exit block. Could you give me a day to evaluate it further, run benchmarks etc. and report back? If it turns out to be promising, istm that it'd make sense to rebase this patch on top of it.

I was discussing with Aditya over the llvm dev meeting that it would be good to move the static analysis of hot/cold blocks to an analysis pass instead of carrying it in the hot/cold split pass.
That way we will have a smaller transform pass and other passes could use the static analysis of hot/cold blocks.

In D53588#1272789, @vsk wrote:

@sebpop thanks for this patch! I don't see any problems with it (although I would prefer that the test explicitly check that outlined functions contain the correct instructions).

At a higher-level, I'm seeing some problems with the forward/back cold propagation done in getHotBlocks on internal projects. The propagation seems to stop when it encounters simple control flow, like an if-then-else or a for loop, after which cold code is unconditionally executed.

Btw, here's an example. ToT does not outline given this code:

extern void sideeffect(int);

extern void __attribute__((noreturn)) sink();

void foo(int cond) {
  if (cond) {
    while (cond > 10) {
      --cond;
      sideeffect(0);
    }

    sink();
  }

  sideeffect(1);
}

With my prototype, the loop may be outlined:

Outlined Region:
; Function Attrs: minsize nounwind optsize ssp uwtable
define internal void @__outlined_foo_if.then(i32 %cond) #3 {
newFuncRoot:
  br label %if.then

if.then:                                          ; preds = %newFuncRoot
  %cmp3 = icmp sgt i32 %cond, 10
  br i1 %cmp3, label %while.body.preheader, label %while.end

while.body.preheader:                             ; preds = %if.then
  br label %while.body

while.body:                                       ; preds = %while.body.preheader, %while.body
  %cond.addr.04 = phi i32 [ %dec, %while.body ], [ %cond, %while.body.preheader ]
  %dec = add nsw i32 %cond.addr.04, -1
  tail call void @sideeffect(i32 0) #4
  %cmp = icmp sgt i32 %cond.addr.04, 11
  br i1 %cmp, label %while.body, label %while.end.loopexit

while.end.loopexit:                               ; preds = %while.body
  br label %while.end

while.end:                                        ; preds = %while.end.loopexit, %if.then
  tail call void (...) @sink() #5
  unreachable
}
HotColdSplitting: Outlined 12 insts

I have a prototype of a different propagation scheme which overcomes some of these limitations. The idea is to mark blocks which are post-dominated by a cold block, or are dominated by a cold block, as cold. This is able to handle the control flow I described, and isn't limited to requiring a single exit block. Could you give me a day to evaluate it further, run benchmarks etc. and report back? If it turns out to be promising, istm that it'd make sense to rebase this patch on top of it.

In D53588#1272895, @sebpop wrote:

In D53588#1272789, @vsk wrote:

@sebpop thanks for this patch! I don't see any problems with it (although I would prefer that the test explicitly check that outlined functions contain the correct instructions).

At a higher-level, I'm seeing some problems with the forward/back cold propagation done in getHotBlocks on internal projects. The propagation seems to stop when it encounters simple control flow, like an if-then-else or a for loop, after which cold code is unconditionally executed.

I have a prototype of a different propagation scheme which overcomes some of these limitations. The idea is to mark blocks which are post-dominated by a cold block, or are dominated by a cold block, as cold. This is able to handle the control flow I described, and isn't limited to requiring a single exit block. Could you give me a day to evaluate it further, run benchmarks etc. and report back? If it turns out to be promising, istm that it'd make sense to rebase this patch on top of it.

I was discussing with Aditya over the llvm dev meeting that it would be good to move the static analysis of hot/cold blocks to an analysis pass instead of carrying it in the hot/cold split pass.
That way we will have a smaller transform pass and other passes could use the static analysis of hot/cold blocks.

Aditya mentioned this to me as well :). Definitely on my list!

tejohnson mentioned this in D53534: [hot-cold-split] Name split functions with ".cold" suffix.Oct 23 2018, 12:27 PM

As noted in my latest comment on D53534, the current naming scheme of extracted functions is such that for IR from a release built clang we would get identical names if multiple functions are extracted from the same original function.

tejohnson mentioned this in rL345178: [hot-cold-split] Name split functions with ".cold" suffix.Oct 24 2018, 11:56 AM

@sebpop are you interested in rebasing this on the new cold block propagation code? If not, I'd be happy to give it a try. Is the main challenge in teaching CodeExtractor to update the DT and PDT?

In D53588#1276693, @vsk wrote:

@sebpop are you interested in rebasing this on the new cold block propagation code? If not, I'd be happy to give it a try. Is the main challenge in teaching CodeExtractor to update the DT and PDT?

Please go ahead and give it a try. I haven't thought about maintaining DT and PDT: you are right they need to be maintained after your last change.

For splitting more than one cold region, maintaining a DT maybe expensive but we don't have to do that. All we need is to 'color/mark' the blocks which we want to outline. In the next iteration the colored blocks need not be considered. It may be slightly non-trivial in a general case where coloring SEME and reasoning about DT/PDT of blocks which are non-colored. What we can do is in the subsequent iterations we can take a sub-graph which do not intersect anywhere except at entry or exit. This way DT/PDT will still be preserved in the non-intersecting regions. I think jump-threading also works(should work) on similar lines.

In D53588#1280639, @hiraditya wrote:

For splitting more than one cold region, maintaining a DT maybe expensive but we don't have to do that. All we need is to 'color/mark' the blocks which we want to outline. In the next iteration the colored blocks need not be considered. It may be slightly non-trivial in a general case where coloring SEME and reasoning about DT/PDT of blocks which are non-colored. What we can do is in the subsequent iterations we can take a sub-graph which do not intersect anywhere except at entry or exit. This way DT/PDT will still be preserved in the non-intersecting regions. I think jump-threading also works(should work) on similar lines.

I'm prototyping a patch along these lines and will try to share it soon. Outlining non-intersecting sub-graphs is key to the approach. The idea is to build a worklist of all (possibly multiple-entry) outlining regions up front, then extract+outline sub-graphs from each region until the worklist is empty.

For reference: https://reviews.llvm.org/D53887

sebpop abandoned this revision.Nov 1 2018, 11:22 AM

brzycki added a subscriber: brzycki.Nov 1 2018, 11:22 AM

brzycki removed a subscriber: brzycki.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

HotColdSplitting.cpp

34 lines

test/

Transforms/

HotColdSplit/

split-cold-3.ll

50 lines

Diff 170687

llvm/lib/Transforms/IPO/HotColdSplitting.cpp

Show First 20 Lines • Show All 255 Lines • ▼ Show 20 Lines	HotColdSplitting(ProfileSummaryInfo *ProfSI,
function_ref<BlockFrequencyInfo *(Function &)> GBFI,		function_ref<BlockFrequencyInfo *(Function &)> GBFI,
function_ref<TargetTransformInfo &(Function &)> GTTI,		function_ref<TargetTransformInfo &(Function &)> GTTI,
std::function<OptimizationRemarkEmitter &(Function &)> *GORE)		std::function<OptimizationRemarkEmitter &(Function &)> *GORE)
: PSI(ProfSI), GetBFI(GBFI), GetTTI(GTTI), GetORE(GORE) {}		: PSI(ProfSI), GetBFI(GBFI), GetTTI(GTTI), GetORE(GORE) {}
bool run(Module &M);		bool run(Module &M);

private:		private:
bool shouldOutlineFrom(const Function &F) const;		bool shouldOutlineFrom(const Function &F) const;
const Function *outlineColdBlocks(Function &F, const DenseSetBB &ColdBlock,		void outlineColdBlocks(Function &F, const DenseSetBB &ColdBlock,
DominatorTree DT, PostDomTree PDT);		DominatorTree DT, PostDomTree PDT);
Function extractColdRegion(const SmallVectorImpl<BasicBlock > &Region,		Function extractColdRegion(const SmallVectorImpl<BasicBlock > &Region,
DominatorTree DT, BlockFrequencyInfo BFI,		DominatorTree DT, BlockFrequencyInfo BFI,
OptimizationRemarkEmitter &ORE);		OptimizationRemarkEmitter &ORE);
bool isOutlineCandidate(const SmallVectorImpl<BasicBlock *> &Region,		bool isOutlineCandidate(const SmallVectorImpl<BasicBlock *> &Region,
const BasicBlock *Exit) const {		const BasicBlock *Exit) const {
if (!Exit)		if (!Exit)
return false;		return false;

▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	return OptimizationRemarkMissed(DEBUG_TYPE, "ExtractFailed",
&*Region[0]->begin())		&*Region[0]->begin())
<< "Failed to extract region at block "		<< "Failed to extract region at block "
<< ore::NV("Block", Region.front());		<< ore::NV("Block", Region.front());
});		});
return nullptr;		return nullptr;
}		}

// Return the function created after outlining, nullptr otherwise.		// Return the function created after outlining, nullptr otherwise.
const Function *HotColdSplitting::outlineColdBlocks(Function &F,		void HotColdSplitting::outlineColdBlocks(Function &F,
const DenseSetBB &HotBlocks,		const DenseSetBB &HotBlocks,
DominatorTree *DT,		DominatorTree DT, PostDomTree PDT) {
PostDomTree *PDT) {
auto BFI = GetBFI(F);		auto BFI = GetBFI(F);
auto &ORE = (*GetORE)(F);		auto &ORE = (*GetORE)(F);
// Walking the dominator tree allows us to find the largest		// Walking the dominator tree allows us to find the largest
// cold region.		// cold region.
BasicBlock *Begin = DT->getRootNode()->getBlock();		BasicBlock *Begin = DT->getRootNode()->getBlock();

// Early return if the beginning of the function has been marked cold,		// Early return if the beginning of the function has been marked cold,
// otherwise all the function gets outlined.		// otherwise all the function gets outlined.
if (PSI->isColdBB(Begin, BFI) \|\| !HotBlocks.count(Begin))		if (PSI->isColdBB(Begin, BFI) \|\| !HotBlocks.count(Begin))
return nullptr;		return;

		DenseSetBB OutlinedBBs;
for (auto I = df_begin(Begin), E = df_end(Begin); I != E; ++I) {		for (auto I = df_begin(Begin), E = df_end(Begin); I != E; ++I) {
BasicBlock BB = I;		BasicBlock BB = I;

		// Skip over outlined blocks.
		if (OutlinedBBs.count(BB))
		continue;

if (PSI->isColdBB(BB, BFI) \|\| !HotBlocks.count(BB)) {		if (PSI->isColdBB(BB, BFI) \|\| !HotBlocks.count(BB)) {
SmallVector<BasicBlock *, 4> ValidColdRegion, Region;		SmallVector<BasicBlock *, 4> ValidColdRegion, Region;
BasicBlock Exit = (PDT)[BB]->getIDom()->getBlock();		BasicBlock Exit = (PDT)[BB]->getIDom()->getBlock();
BasicBlock *ExitColdRegion = nullptr;		BasicBlock *ExitColdRegion = nullptr;

// Estimated cold region between a BB and its dom-frontier.		// Estimated cold region between a BB and its dom-frontier.
while (Exit && isSingleEntrySingleExit(BB, Exit, DT, PDT, Region) &&		while (Exit && isSingleEntrySingleExit(BB, Exit, DT, PDT, Region) &&
isOutlineCandidate(Region, Exit)) {		isOutlineCandidate(Region, Exit)) {
ExitColdRegion = Exit;		ExitColdRegion = Exit;
ValidColdRegion = Region;		ValidColdRegion = Region;
Region.clear();		Region.clear();
// Update Exit recursively to its dom-frontier.		// Update Exit recursively to its dom-frontier.
Exit = (*PDT)[Exit]->getIDom()->getBlock();		Exit = (*PDT)[Exit]->getIDom()->getBlock();
}		}

if (ExitColdRegion) {		if (ExitColdRegion) {
// Do not outline a region with only one block.		// Do not outline a region with only one block.
if (ValidColdRegion.size() == 1)		if (ValidColdRegion.size() == 1)
continue;		continue;

++NumColdSESEFound;		++NumColdSESEFound;
ValidColdRegion.push_back(ExitColdRegion);		ValidColdRegion.push_back(ExitColdRegion);
// Candidate for outlining. FIXME: Continue outlining.
return extractColdRegion(ValidColdRegion, DT, BFI, ORE);		if (const Function *Outlined =
		extractColdRegion(ValidColdRegion, DT, BFI, ORE)) {
		OutlinedFunctions.insert(Outlined);
		for (BasicBlock *I : ValidColdRegion)
		OutlinedBBs.insert(I);
		}
}		}
}		}
}		}
return nullptr;
}		}

bool HotColdSplitting::run(Module &M) {		bool HotColdSplitting::run(Module &M) {
for (auto &F : M) {		for (auto &F : M) {
if (!shouldOutlineFrom(F))		if (!shouldOutlineFrom(F))
continue;		continue;
DominatorTree DT(F);		DominatorTree DT(F);
PostDomTree PDT(F);		PostDomTree PDT(F);
PDT.recalculate(F);		PDT.recalculate(F);
DenseSetBB HotBlocks;		DenseSetBB HotBlocks;
if (EnableStaticAnalyis) // Static analysis of cold blocks.		if (EnableStaticAnalyis) // Static analysis of cold blocks.
HotBlocks = getHotBlocks(F);		HotBlocks = getHotBlocks(F);

const Function *Outlined = outlineColdBlocks(F, HotBlocks, &DT, &PDT);		outlineColdBlocks(F, HotBlocks, &DT, &PDT);
if (Outlined)
OutlinedFunctions.insert(Outlined);
}		}
return true;		return true;
}		}

bool HotColdSplittingLegacyPass::runOnModule(Module &M) {		bool HotColdSplittingLegacyPass::runOnModule(Module &M) {
if (skipModule(M))		if (skipModule(M))
return false;		return false;
ProfileSummaryInfo *PSI =		ProfileSummaryInfo *PSI =
▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

llvm/test/Transforms/HotColdSplit/split-cold-3.ll

This file was added.

				; RUN: opt -hotcoldsplit -pass-remarks=hotcoldsplit < %s 2>&1 \| FileCheck %s
				; RUN: opt -passes=hotcoldsplit -pass-remarks=hotcoldsplit < %s 2>&1 \| FileCheck %s

				; CHECK: remark: <unknown>:0:0: fun split cold code into fun_B.else
				; CHECK: remark: <unknown>:0:0: fun split cold code into fun_A.else

				define void @fun() {
				entry:
				br i1 undef, label %A.then, label %A.else

				A.else:
				br label %A.then4

				A.then4:
				br i1 undef, label %A.then5, label %A.end

				A.then5:
				br label %A.cleanup

				A.end:
				br label %A.cleanup

				A.cleanup:
				%A.cleanup.dest.slot.0 = phi i32 [ 1, %A.then5 ], [ 0, %A.end ]
				unreachable

				A.then:
				br i1 undef, label %B.then, label %B.else

				B.then:
				ret void

				B.else:
				br label %B.then4

				B.then4:
				br i1 undef, label %B.then5, label %B.end

				B.then5:
				br label %B.cleanup

				B.end:
				br label %B.cleanup

				B.cleanup:
				%B.cleanup.dest.slot.0 = phi i32 [ 1, %B.then5 ], [ 0, %B.end ]
				unreachable


				}