Download Raw Diff

Details

Reviewers

ChuanqiXu
SjoerdMeijer
chill
vzakhari
SBallantyne
kiranchandramohan

Commits

rG386aa2ab9d19: [FuncSpec] Increase the maximum number of times the specializer can run.

Summary

Changes the default value of FuncSpecMaxIters from 1 to 10. This allows specialization of recursive functions.
Adds an option to control the maximum codesize growth per function.
Measured ~45% performance uplift for SPEC2017:548.exchange2_r on AWS Graviton3.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

labrinea created this revision.Mar 10 2023, 11:38 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2023, 11:38 AM

Herald added subscribers: snehasish, ormris, hiraditya. · View Herald Transcript

labrinea requested review of this revision.Mar 10 2023, 11:38 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2023, 11:38 AM

labrinea added a parent revision: D145394: [FuncSpec] Do not run pre-link when doing LTO..Mar 10 2023, 11:38 AM

Harbormaster completed remote builds in B218735: Diff 504224.Mar 10 2023, 2:25 PM

Bumps the score of 548.exchange2_r from SPEC2017 by about 60% on Neoverse N1.

kiranchandramohan added reviewers: vzakhari, SBallantyne, kiranchandramohan.Mar 11 2023, 3:24 AM

In D145819#4186533, @labrinea wrote:

Bumps the score of 548.exchange2_r from SPEC2017 by about 60% on Neoverse N1.

Can we achieve this with lower FuncSpecMaxIters than 10 as 10 could be quite constly in terms of compile times?

In D145819#4186807, @xbolva00 wrote:

In D145819#4186533, @labrinea wrote:

Bumps the score of 548.exchange2_r from SPEC2017 by about 60% on Neoverse N1.

Can we achieve this with lower FuncSpecMaxIters than 10 as 10 could be quite constly in terms of compile times?

Exchange requires 8 iterations, but we should be making the specializer as generic as possible. Note that FuncSpecMaxIters is the maximum number of times the specializer will run as long as each previous iteration has found specializations. As described in the summary we only consider callsites which reside in the bodies of previous specializations after the propagation of constant arguments. For now, that specialization on literal constants is disabled, this only triggers for recursive functions through promotion of constant stack values to globals which is quite rare. Therefore the impact in compilation times is zero for the general case. Only in code like exchange I would expect significant regression. That said I will post some numbers from the compile time tracker on the LLVM test suite.

Another important observation is that the compile time hit is attributed to the number of clones (the more, the higher) which suggests that the time is not spent on the specializer but on other optimization passes.

Metric: Instruction Count
Comparison with base (parent of D145374 - top of the chain)

Non LTO

testname	delta %
ClamAV	-0.101
7zip	+0.030
tramp3d-v4	-0.027
kimwitu++	-0.068
sqlite3	-0.213
mafft	-0.030
lencod	+0.020
SPASS	-0.324
consumer-typeset	+0.016
Bullet	+0.029
geomean	-0.067

testname	delta %
ClamAV	+0.221
7zip	-0.034
tramp3d-v4	-0.003
kimwitu++	-0.128
sqlite3	+0.081
mafft	-0.039
lencod	-0.016
SPASS	-0.795
consumer-typeset	-0.001
Bullet	-0.044
geomean	-0.076

As I said the compile time hit is attributed to the number of clones so these changes are coming from D145379 (added one extra specialization for sqlite3 and one for ClamAV when linking) and from D145394 (removed all the specializations for SPASS).

@labrinea Is this patch the top-level patch for recursive function-specialization that benefits exchange2? Is there any setting required or would this perform specialization by default? Also, assuming this is with llvm-project/flang.

In D145819#4192593, @kiranchandramohan wrote:

@labrinea Is this patch the top-level patch for recursive function-specialization that benefits exchange2? Is there any setting required or would this perform specialization by default? Also, assuming this is with llvm-project/flang.

You can see the patch dependencies in the chain. Some of them are merged, others are pending review. I will rebase it once the chain is ready. The patch benefits exchange2_r when compiled with flang. No other changes are necessary.

High level question first, it looks like that geomean compile-times improve, and that is very surprising.
My guess is that one of the functional changes in findSpecialization is a good improvement on itself. Or maybe it is the caching of the CodeMetrics? Anyway, that's my question, just curious if you know what it is?

@SjoerdMeijer

As I said the compile time hit is attributed to the number of clones so these changes are coming from D145379 (added one extra specialization for sqlite3 and one for ClamAV when linking) and from D145394 (removed all the specializations for SPASS).

(the compile time data include other patches of this chain, not just this one)

labrinea planned changes to this revision.Mar 21 2023, 7:49 AM

Discarded most changes from previous revisions since the refactoring didn't seem necessary for keeping the compile times low.

Herald added a subscriber: hoy. · View Herald TranscriptMay 26 2023, 3:56 AM

Ping!

Harbormaster completed remote builds in B234822: Diff 526016.May 26 2023, 3:57 AM

In D145819#4375747, @labrinea wrote:

Ping!

@labrinea : Have you seen that there are problems with https://reviews.llvm.org/D150375 ?

chill mentioned this in D152556: [FuncSpec] Promote stack values before specialization.Jun 9 2023, 9:51 AM

mtsamis added a subscriber: mtsamis.Jun 9 2023, 10:20 AM

labrinea added a comment.Jun 12 2023, 2:56 AM

This comment was removed by labrinea.

rebased on D152799
added an option to control max clones per function across multiple iterations

labrinea edited parent revisions, added: D152799: [FuncSpec] Promote stack values before specialization; removed: D145394: [FuncSpec] Do not run pre-link when doing LTO..Jun 14 2023, 9:28 AM

Harbormaster completed remote builds in B238843: Diff 531379.Jun 14 2023, 9:29 AM

I think this revision makes more sense than the previous attempts. Perhaps it also makes the option MaxClones redundant, and so we could save some compile time by avoiding the heap-sort of the specialization candidates. But that's an improvement to consider for the future.

Ping @ChuanqiXu

Harbormaster completed remote builds in B253833: Diff 551999.Aug 21 2023, 7:09 AM

The patch itself looks pretty simple and the numbers look good too. So I think there may not be reason to block this.

llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
875–877	It looks a little better to remove the variable.

This revision is now accepted and ready to land.Aug 21 2023, 11:20 PM

This revision was landed with ongoing or failed builds.Aug 22 2023, 1:41 AM

Closed by commit rG386aa2ab9d19: [FuncSpec] Increase the maximum number of times the specializer can run. (authored by labrinea). · Explain Why

This revision was automatically updated to reflect the committed changes.

labrinea added a commit: rG386aa2ab9d19: [FuncSpec] Increase the maximum number of times the specializer can run..

Diff 552266

llvm/include/llvm/Transforms/IPO/FunctionSpecialization.h

Show First 20 Lines • Show All 219 Lines • ▼ Show 20 Lines	class FunctionSpecializer {
std::function<BlockFrequencyInfo &(Function &)> GetBFI;		std::function<BlockFrequencyInfo &(Function &)> GetBFI;
std::function<const TargetLibraryInfo &(Function &)> GetTLI;		std::function<const TargetLibraryInfo &(Function &)> GetTLI;
std::function<TargetTransformInfo &(Function &)> GetTTI;		std::function<TargetTransformInfo &(Function &)> GetTTI;
std::function<AssumptionCache &(Function &)> GetAC;		std::function<AssumptionCache &(Function &)> GetAC;

SmallPtrSet<Function *, 32> Specializations;		SmallPtrSet<Function *, 32> Specializations;
SmallPtrSet<Function *, 32> FullySpecialized;		SmallPtrSet<Function *, 32> FullySpecialized;
DenseMap<Function *, CodeMetrics> FunctionMetrics;		DenseMap<Function *, CodeMetrics> FunctionMetrics;
		DenseMap<Function *, unsigned> FunctionGrowth;

public:		public:
FunctionSpecializer(		FunctionSpecializer(
SCCPSolver &Solver, Module &M, FunctionAnalysisManager *FAM,		SCCPSolver &Solver, Module &M, FunctionAnalysisManager *FAM,
std::function<BlockFrequencyInfo &(Function &)> GetBFI,		std::function<BlockFrequencyInfo &(Function &)> GetBFI,
std::function<const TargetLibraryInfo &(Function &)> GetTLI,		std::function<const TargetLibraryInfo &(Function &)> GetTLI,
std::function<TargetTransformInfo &(Function &)> GetTTI,		std::function<TargetTransformInfo &(Function &)> GetTTI,
std::function<AssumptionCache &(Function &)> GetAC)		std::function<AssumptionCache &(Function &)> GetAC)
▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/FunctionSpecialization.cpp

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines

static cl::opt<unsigned> MaxBlockPredecessors(

"The maximum number of predecessors a basic block can have to be "

"considered during the estimation of dead code"));

static cl::opt<unsigned> MinFunctionSize(

"funcspec-min-function-size", cl::init(300), cl::Hidden, cl::desc(

"Don't specialize functions that have less than this number of "

"instructions"));

static cl::opt<unsigned> MaxCodeSizeGrowth(

"funcspec-max-codesize-growth", cl::init(3), cl::Hidden, cl::desc(

"Maximum codesize growth allowed per function"));

static cl::opt<unsigned> MinCodeSizeSavings(

"funcspec-min-codesize-savings", cl::init(20), cl::Hidden, cl::desc(

"Reject specializations whose codesize savings are less than this"

"much percent of the original function size"));

static cl::opt<unsigned> MinLatencySavings(

"funcspec-min-latency-savings", cl::init(70), cl::Hidden, cl::desc(

"Reject specializations whose latency savings are less than this"

▲ Show 20 Lines • Show All 732 Lines • ▼ Show 20 Lines

if (auto It = UniqueSpecs.find(S); It != UniqueSpecs.end()) {

}

B += Visitor.getBonusFromPendingPHIs();

LLVM_DEBUG(dbgs() << "FnSpecialization: Specialization bonus {CodeSize = "

<< B.CodeSize << ", Latency = " << B.Latency

<< ", Inlining = " << Score << "}\n");

auto IsProfitable = [&FuncSize](Bonus &B, unsigned Score) -> bool {

FunctionGrowth[F] += FuncSize - B.CodeSize;

auto IsProfitable = [](Bonus &B, unsigned Score, unsigned FuncSize,

unsigned FuncGrowth) -> bool {

// No check required.

if (ForceSpecialization)

return true;

// Minimum inlining bonus.

if (Score > MinInliningBonus * FuncSize / 100)

return true;

// Minimum codesize savings.

if (B.CodeSize < MinCodeSizeSavings * FuncSize / 100)

return false;

// Minimum latency savings.

if (B.Latency < MinLatencySavings * FuncSize / 100)

return false;

// Maximum codesize growth.

if (FuncGrowth / FuncSize > MaxCodeSizeGrowth)

return false;

return true;

};

// Discard unprofitable specialisations.

if (!IsProfitable(B, Score))

if (!IsProfitable(B, Score, FuncSize, FunctionGrowth[F]))

continue;

// Create a new specialisation entry.

Score += std::max(B.CodeSize, B.Latency);

auto &Spec = AllSpecs.emplace_back(F, S, Score);

if (CS.getFunction() != F)

ChuanqiXuUnsubmitted

Not Done

// Discard specializations which exceed the codesize budget.

- unsigned &CodeSizeGrowth = FunctionGrowth[F];

- CodeSizeGrowth += FuncSize - B.CodeSize;

- if (CodeSizeGrowth / FuncSize > MaxCodeSizeGrowth)

+ FunctionGrowth[F] += FuncSize - B.CodeSize;

+ if (FunctionGrowth[F] / FuncSize > MaxCodeSizeGrowth)

continue;

It looks a little better to remove the variable.

ChuanqiXu: It looks a little better to remove the variable.

Spec.CallSites.push_back(&CS);

const unsigned Index = AllSpecs.size() - 1;

UniqueSpecs[S] = Index;

if (auto [It, Inserted] = SM.try_emplace(F, Index, Index + 1); !Inserted)

It->second.second = Index + 1;

}

▲ Show 20 Lines • Show All 219 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/SCCP.cpp

	Show All 37 Lines
	STATISTIC(NumInstRemoved, "Number of instructions removed");			STATISTIC(NumInstRemoved, "Number of instructions removed");
	STATISTIC(NumArgsElimed ,"Number of arguments constant propagated");			STATISTIC(NumArgsElimed ,"Number of arguments constant propagated");
	STATISTIC(NumGlobalConst, "Number of globals found to be constant");			STATISTIC(NumGlobalConst, "Number of globals found to be constant");
	STATISTIC(NumDeadBlocks , "Number of basic blocks unreachable");			STATISTIC(NumDeadBlocks , "Number of basic blocks unreachable");
	STATISTIC(NumInstReplaced,			STATISTIC(NumInstReplaced,
	"Number of instructions replaced with (simpler) instruction");			"Number of instructions replaced with (simpler) instruction");

	static cl::opt<unsigned> FuncSpecMaxIters(			static cl::opt<unsigned> FuncSpecMaxIters(
	"funcspec-max-iters", cl::init(1), cl::Hidden, cl::desc(			"funcspec-max-iters", cl::init(10), cl::Hidden, cl::desc(
	"The maximum number of iterations function specialization is run"));			"The maximum number of iterations function specialization is run"));

	static void findReturnsToZap(Function &F,			static void findReturnsToZap(Function &F,
	SmallVector<ReturnInst *, 8> &ReturnsToZap,			SmallVector<ReturnInst *, 8> &ReturnsToZap,
	SCCPSolver &Solver) {			SCCPSolver &Solver) {
	// We can only do this if we know that nothing else can call the function.			// We can only do this if we know that nothing else can call the function.
	if (!Solver.isArgumentTrackedFunction(&F))			if (!Solver.isArgumentTrackedFunction(&F))
	return;			return;
	▲ Show 20 Lines • Show All 357 Lines • Show Last 20 Lines

llvm/test/Transforms/FunctionSpecialization/recursive-penalty.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt -passes="ipsccp<func-spec>,inline,instcombine,simplifycfg" -S \
				; RUN: -funcspec-min-function-size=23 -funcspec-max-iters=100 \
				; RUN: -debug-only=function-specialization < %s 2>&1 \| FileCheck %s

				; Make sure the number of specializations created are not
				; linear to the number of iterations (funcspec-max-iters).

				; CHECK: FnSpecialization: Created 4 specializations in module

				@Global = internal constant i32 1, align 4

				define internal void @recursiveFunc(ptr readonly %arg) {
				%temp = alloca i32, align 4
				%arg.load = load i32, ptr %arg, align 4
				%arg.cmp = icmp slt i32 %arg.load, 10000
				br i1 %arg.cmp, label %loop1, label %ret.block

				loop1:
				br label %loop2

				loop2:
				br label %loop3

				loop3:
				br label %loop4

				loop4:
				br label %block6

				block6:
				call void @print_val(i32 %arg.load)
				%arg.add = add nsw i32 %arg.load, 1
				store i32 %arg.add, ptr %temp, align 4
				call void @recursiveFunc(ptr %temp)
				br label %loop4.end

				loop4.end:
				%exit_cond1 = call i1 @exit_cond()
				br i1 %exit_cond1, label %loop4, label %loop3.end

				loop3.end:
				%exit_cond2 = call i1 @exit_cond()
				br i1 %exit_cond2, label %loop3, label %loop2.end

				loop2.end:
				%exit_cond3 = call i1 @exit_cond()
				br i1 %exit_cond3, label %loop2, label %loop1.end

				loop1.end:
				%exit_cond4 = call i1 @exit_cond()
				br i1 %exit_cond4, label %loop1, label %ret.block

				ret.block:
				ret void
				}

				define i32 @main() {
				call void @recursiveFunc(ptr @Global)
				ret i32 0
				}

				declare dso_local void @print_val(i32)
				declare dso_local i1 @exit_cond()

This is an archive of the discontinued LLVM Phabricator instance.

[FuncSpec] Increase the maximum number of times the specializer can run.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 552266

llvm/include/llvm/Transforms/IPO/FunctionSpecialization.h

llvm/lib/Transforms/IPO/FunctionSpecialization.cpp

llvm/lib/Transforms/IPO/SCCP.cpp

llvm/test/Transforms/FunctionSpecialization/recursive-penalty.ll

This is an archive of the discontinued LLVM Phabricator instance.

[FuncSpec] Increase the maximum number of times the specializer can run.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 552266

llvm/include/llvm/Transforms/IPO/FunctionSpecialization.h

llvm/lib/Transforms/IPO/FunctionSpecialization.cpp

llvm/lib/Transforms/IPO/SCCP.cpp

llvm/test/Transforms/FunctionSpecialization/recursive-penalty.ll

[FuncSpec] Increase the maximum number of times the specializer can run.
ClosedPublic