This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/
-
Passes/
1/2
PassBuilder.cpp
-
Transforms/IPO/
-
IPO/
-
PassManagerBuilder.cpp
-
test/Transforms/
-
Transforms/
-
PGOProfile/
-
Inputs/
-
thinlto_cspgo_bar_use.ll
-
cspgo_profile_summary.ll
-
thinlto_cspgo_use.ll
-
PhaseOrdering/AArch64/
-
AArch64/
-
hoisting-sinking-required-for-vectorization.ll

Differential D101468

[Passes] Run sinking/hoisting in SimplifyCFG earlier.
ClosedPublic

Authored by fhahn on Apr 28 2021, 9:15 AM.

Download Raw Diff

Details

Reviewers

nikic
spatel
RKSimon
lebedev.ri

Commits

rGed9df5bd2f50: [Passes] Run sinking/hoisting in SimplifyCFG earlier.

Summary

Hoisting and sinking instructions out of conditional blocks enables
additional vectorization by:

Executing memory accesses unconditionally.
Reducing the number of instructions that need predication.

After disabling early hoisting / sinking, we miss out on a few
vectorization opportunities. One of those is causing a ~10% performance
regression in one of the Geekbench benchmarks on AArch64.

This patch tires to recover the regression by running hoisting/sinking
as part of a SimplifyCFG run after LoopRotate and before LoopVectorize.

Note that in the legacy pass-manager, we run LoopRotate just before
vectorization again and there's no SimplifyCFG run in between, so the
sinking/hoisting may impact the later run on LoopRotate. But the impact
should be limited and the benefit of hosting/sinking at this stage
should outweigh the risk of not rotating.

Compile-time impact looks slightly positive for most cases.
http://llvm-compile-time-tracker.com/compare.php?from=2ea7fb7b1c045a7d60fcccf3df3ebb26aa3699e5&to=e58b4a763c691da651f25996aad619cb3d946faf&stat=instructions

NewPM-O3: geomean -0.19%
NewPM-ReleaseThinLTO: geoman -0.54%
NewPM-ReleaseLTO-g: geomean -0.03%

With a few benchmarks seeing a notable increase, but also some
improvements.

Alternative to D101290.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Apr 28 2021, 9:15 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald TranscriptApr 28 2021, 9:15 AM

fhahn requested review of this revision.Apr 28 2021, 9:15 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2021, 9:15 AM

fhahn mentioned this in D101290: [LV] Try to sink and hoist inside candidate loops for vectorization..Apr 28 2021, 9:23 AM

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

It might make sense to do this even earlier, as I suspect that passes like GVN would also benefit from hoisted/sunk IR. But this looks like a reasonable starting point, at least to me.

In D101468#2723180, @nikic wrote:

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

Nonono, this is actually the opposite from what should happen.
We will then over-inline, and grow the size even more.

Do we have any simplifycfg run before LV but after inliner?

It might make sense to do this even earlier, as I suspect that passes like GVN would also benefit from hoisted/sunk IR. But this looks like a reasonable starting point, at least to me.

In D101468#2723229, @lebedev.ri wrote:

In D101468#2723180, @nikic wrote:

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

Nonono, this is actually the opposite from what should happen.
We will then over-inline, and grow the size even more.

Do we have any simplifycfg run before LV but after inliner?

It might make sense to do this even earlier, as I suspect that passes like GVN would also benefit from hoisted/sunk IR. But this looks like a reasonable starting point, at least to me.

This is even visible in the http://llvm-compile-time-tracker.com/compare.php?from=2ea7fb7b1c045a7d60fcccf3df3ebb26aa3699e5&to=e58b4a763c691da651f25996aad619cb3d946faf&stat=size-total

In D101468#2723229, @lebedev.ri wrote:

In D101468#2723180, @nikic wrote:

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

Nonono, this is actually the opposite from what should happen.
We will then over-inline, and grow the size even more.

I don't follow. It makes the inlining cost more accurate, so if that results in over-inlining, then the issue would be with the inlining cost model, not this change. It is my understanding that the inliner should see functions in their maximally simplified form (and prior to size-increasing optimizations like runtime unrolling and vectorization), so that it can make the most accurate decisions. Intentionally crippling the function simplification pipeline to get less inlining seems rather backward. Taken ad absurdum, that would mean that we shouldn't simplify functions prior to inlining at all.

This is even visible in the http://llvm-compile-time-tracker.com/compare.php?from=2ea7fb7b1c045a7d60fcccf3df3ebb26aa3699e5&to=e58b4a763c691da651f25996aad619cb3d946faf&stat=size-total

This looks pretty normal for a phase ordering change. SPASS on O3 is up 0.46%, lencod on ThinLTO is down 0.66%. The geomeans are 0.1% up or down for NewPM.

In D101468#2723273, @nikic wrote:

In D101468#2723229, @lebedev.ri wrote:

In D101468#2723180, @nikic wrote:

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

Nonono, this is actually the opposite from what should happen.
We will then over-inline, and grow the size even more.

I don't follow. It makes the inlining cost more accurate, so if that results in over-inlining, then the issue would be with the inlining cost model, not this change. It is my understanding that the inliner should see functions in their maximally simplified form (and prior to size-increasing optimizations like runtime unrolling and vectorization), so that it can make the most accurate decisions. Intentionally crippling the function simplification pipeline to get less inlining seems rather backward. Taken ad absurdum, that would mean that we shouldn't simplify functions prior to inlining at all.

This is even visible in the http://llvm-compile-time-tracker.com/compare.php?from=2ea7fb7b1c045a7d60fcccf3df3ebb26aa3699e5&to=e58b4a763c691da651f25996aad619cb3d946faf&stat=size-total

This looks pretty normal for a phase ordering change. SPASS on O3 is up 0.46%, lencod on ThinLTO is down 0.66%. The geomeans are 0.1% up or down for NewPM.

I see. So i take it, D101231, if restricted to function-terminating block, makes more sense to you now, in general?

Harbormaster completed remote builds in B101440: Diff 341237.Apr 28 2021, 10:49 AM

In D101468#2723323, @lebedev.ri wrote:

In D101468#2723273, @nikic wrote:

In D101468#2723229, @lebedev.ri wrote:

In D101468#2723180, @nikic wrote:

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

Nonono, this is actually the opposite from what should happen.
We will then over-inline, and grow the size even more.

I don't follow. It makes the inlining cost more accurate, so if that results in over-inlining, then the issue would be with the inlining cost model, not this change. It is my understanding that the inliner should see functions in their maximally simplified form (and prior to size-increasing optimizations like runtime unrolling and vectorization), so that it can make the most accurate decisions. Intentionally crippling the function simplification pipeline to get less inlining seems rather backward. Taken ad absurdum, that would mean that we shouldn't simplify functions prior to inlining at all.

This is even visible in the http://llvm-compile-time-tracker.com/compare.php?from=2ea7fb7b1c045a7d60fcccf3df3ebb26aa3699e5&to=e58b4a763c691da651f25996aad619cb3d946faf&stat=size-total

This looks pretty normal for a phase ordering change. SPASS on O3 is up 0.46%, lencod on ThinLTO is down 0.66%. The geomeans are 0.1% up or down for NewPM.

I see. So i take it, D101231, if restricted to function-terminating block, makes more sense to you now, in general?

I think I might be missing how D101231 is related to the current patch.

Is the concern that we may hoist/sink instructions from cold into hot blocks? I think for hosting this should definitely not happen, because we only hoist common instructions on both paths. I think the same is mostly true for sinking, although it supports sinking from only a subset of predecessors I think.

In any case, IIUC the size reduction should be accurate and I fail to see how this would lead to over-inlining. If I am missing anything, it would be great if you could elaborate in a bit more detail what cases you are concerned about.

I'm actually in favor of doing this pre-inliner, because that will simplify an upcoming patch, should i post it :)
If everyone is okay with extra inlining, i think this is fine.
I checked, and since this runs post-looprotation, we could do this.
So LG.

llvm/lib/Passes/PassBuilder.cpp
1325–1326	Drop comment

This revision is now accepted and ready to land.Apr 29 2021, 4:36 AM

This revision was landed with ongoing or failed builds.Apr 30 2021, 4:35 AM

Closed by commit rGed9df5bd2f50: [Passes] Run sinking/hoisting in SimplifyCFG earlier. (authored by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rGed9df5bd2f50: [Passes] Run sinking/hoisting in SimplifyCFG earlier..

Herald added subscribers: wenlei, steven_wu. · View Herald TranscriptApr 30 2021, 4:35 AM

fhahn added inline comments.Apr 30 2021, 4:36 AM

llvm/lib/Passes/PassBuilder.cpp
1325–1326	done, thanks!

jeroen.dobbelaere added a subscriber: jeroen.dobbelaere.May 31 2021, 2:15 AM

lebedev.ri mentioned this in D104445: [SimplifyCFGPass] Tail-merging function-terminating blocks.Jun 17 2021, 3:07 AM

lebedev.ri mentioned this in D104870: [SimplifyCFG] Tail-merging all blocks with `unreachable` terminator.Jun 30 2021, 12:58 AM

nikic mentioned this in D156532: [Pipelines] Perform hoisting prior to GVN.Jul 28 2023, 6:21 AM

nikic mentioned this in rG1f37088679a5: [Pipelines] Perform hoisting prior to GVN.Aug 7 2023, 1:06 AM

Revision Contents

Path

Size

llvm/

lib/

Passes/

PassBuilder.cpp

5 lines

Transforms/

IPO/

PassManagerBuilder.cpp

6 lines

test/

Transforms/

PGOProfile/

Inputs/

thinlto_cspgo_bar_use.ll

5 lines

cspgo_profile_summary.ll

10 lines

thinlto_cspgo_use.ll

1 line

PhaseOrdering/

AArch64/

hoisting-sinking-required-for-vectorization.ll

59 lines

Diff 341843

llvm/lib/Passes/PassBuilder.cpp

Show First 20 Lines • Show All 841 Lines • ▼ Show 20 Lines	FPM.addPass(createFunctionToLoopPassAdaptor(
EnableMSSALoopDependency, /UseBlockFrequencyInfo=/true, DebugLogging));		EnableMSSALoopDependency, /UseBlockFrequencyInfo=/true, DebugLogging));

if (PTO.Coroutines)		if (PTO.Coroutines)
FPM.addPass(CoroElidePass());		FPM.addPass(CoroElidePass());

for (auto &C : ScalarOptimizerLateEPCallbacks)		for (auto &C : ScalarOptimizerLateEPCallbacks)
C(FPM, Level);		C(FPM, Level);

FPM.addPass(SimplifyCFGPass());		FPM.addPass(SimplifyCFGPass(
		SimplifyCFGOptions().hoistCommonInsts(true).sinkCommonInsts(true)));
FPM.addPass(InstCombinePass());		FPM.addPass(InstCombinePass());
invokePeepholeEPCallbacks(FPM, Level);		invokePeepholeEPCallbacks(FPM, Level);

if (EnableCHR && Level == OptimizationLevel::O3 && PGOOpt &&		if (EnableCHR && Level == OptimizationLevel::O3 && PGOOpt &&
(PGOOpt->Action == PGOOptions::IRUse \|\|		(PGOOpt->Action == PGOOptions::IRUse \|\|
PGOOpt->Action == PGOOptions::SampleUse))		PGOOpt->Action == PGOOptions::SampleUse))
FPM.addPass(ControlHeightReductionPass());		FPM.addPass(ControlHeightReductionPass());

▲ Show 20 Lines • Show All 457 Lines • ▼ Show 20 Lines	PassBuilder::buildModuleOptimizationPipeline(OptimizationLevel Level,
// Now that we've formed fast to execute loop structures, we do further		// Now that we've formed fast to execute loop structures, we do further
// optimizations. These are run afterward as they might block doing complex		// optimizations. These are run afterward as they might block doing complex
// analyses and transforms such as what are needed for loop vectorization.		// analyses and transforms such as what are needed for loop vectorization.

// Cleanup after loop vectorization, etc. Simplification passes like CVP and		// Cleanup after loop vectorization, etc. Simplification passes like CVP and
// GVN, loop transforms, and others have already run, so it's now better to		// GVN, loop transforms, and others have already run, so it's now better to
// convert to more optimized IR using more aggressive simplify CFG options.		// convert to more optimized IR using more aggressive simplify CFG options.
// The extra sinking transform can create larger basic blocks, so do this		// The extra sinking transform can create larger basic blocks, so do this
// before SLP vectorization.		// before SLP vectorization.
// FIXME: study whether hoisting and/or sinking of common instructions should
// be delayed until after SLP vectorizer.
OptimizePM.addPass(SimplifyCFGPass(SimplifyCFGOptions()		OptimizePM.addPass(SimplifyCFGPass(SimplifyCFGOptions()
		lebedev.riUnsubmitted Not Done Reply Inline Actions Drop comment lebedev.ri: Drop comment
		fhahnAuthorUnsubmitted Done Reply Inline Actions done, thanks! fhahn: done, thanks!
.forwardSwitchCondToPhi(true)		.forwardSwitchCondToPhi(true)
.convertSwitchToLookupTable(true)		.convertSwitchToLookupTable(true)
.needCanonicalLoops(false)		.needCanonicalLoops(false)
.hoistCommonInsts(true)		.hoistCommonInsts(true)
.sinkCommonInsts(true)));		.sinkCommonInsts(true)));

// Optimize parallel scalar instruction chains into SIMD instructions.		// Optimize parallel scalar instruction chains into SIMD instructions.
if (PTO.SLPVectorization) {		if (PTO.SLPVectorization) {
▲ Show 20 Lines • Show All 1,858 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 503 Lines • ▼ Show 20 Lines	if (OptLevel > 1) {
MPM.add(createLICMPass(LicmMssaOptCap, LicmMssaNoAccForPromotionCap));		MPM.add(createLICMPass(LicmMssaOptCap, LicmMssaNoAccForPromotionCap));
}		}

addExtensionsToPM(EP_ScalarOptimizerLate, MPM);		addExtensionsToPM(EP_ScalarOptimizerLate, MPM);

if (RerollLoops)		if (RerollLoops)
MPM.add(createLoopRerollPass());		MPM.add(createLoopRerollPass());

MPM.add(createCFGSimplificationPass()); // Merge & remove BBs		// Merge & remove BBs and sink & hoist common instructions.
		MPM.add(createCFGSimplificationPass(
		SimplifyCFGOptions().hoistCommonInsts(true).sinkCommonInsts(true)));
// Clean up after everything.		// Clean up after everything.
MPM.add(createInstructionCombiningPass());		MPM.add(createInstructionCombiningPass());
addExtensionsToPM(EP_Peephole, MPM);		addExtensionsToPM(EP_Peephole, MPM);

if (EnableCHR && OptLevel >= 3 &&		if (EnableCHR && OptLevel >= 3 &&
(!PGOInstrUse.empty() \|\| !PGOSampleUse.empty() \|\| EnablePGOCSInstrGen))		(!PGOInstrUse.empty() \|\| !PGOSampleUse.empty() \|\| EnablePGOCSInstrGen))
MPM.add(createControlHeightReductionLegacyPass());		MPM.add(createControlHeightReductionLegacyPass());
}		}
▲ Show 20 Lines • Show All 297 Lines • ▼ Show 20 Lines	if (OptLevel > 1 && ExtraVectorizerPasses) {
MPM.add(createInstructionCombiningPass());		MPM.add(createInstructionCombiningPass());
}		}

// Cleanup after loop vectorization, etc. Simplification passes like CVP and		// Cleanup after loop vectorization, etc. Simplification passes like CVP and
// GVN, loop transforms, and others have already run, so it's now better to		// GVN, loop transforms, and others have already run, so it's now better to
// convert to more optimized IR using more aggressive simplify CFG options.		// convert to more optimized IR using more aggressive simplify CFG options.
// The extra sinking transform can create larger basic blocks, so do this		// The extra sinking transform can create larger basic blocks, so do this
// before SLP vectorization.		// before SLP vectorization.
// FIXME: study whether hoisting and/or sinking of common instructions should
// be delayed until after SLP vectorizer.
MPM.add(createCFGSimplificationPass(SimplifyCFGOptions()		MPM.add(createCFGSimplificationPass(SimplifyCFGOptions()
.forwardSwitchCondToPhi(true)		.forwardSwitchCondToPhi(true)
.convertSwitchToLookupTable(true)		.convertSwitchToLookupTable(true)
.needCanonicalLoops(false)		.needCanonicalLoops(false)
.hoistCommonInsts(true)		.hoistCommonInsts(true)
.sinkCommonInsts(true)));		.sinkCommonInsts(true)));

if (SLPVectorize) {		if (SLPVectorize) {
▲ Show 20 Lines • Show All 459 Lines • Show Last 20 Lines

llvm/test/Transforms/PGOProfile/Inputs/thinlto_cspgo_bar_use.ll

	target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@odd = common dso_local global i32 0, align 4			@odd = common dso_local global i32 0, align 4
	@even = common dso_local global i32 0, align 4			@even = common dso_local global i32 0, align 4

	define dso_local void @bar(i32 %n) #0 !prof !29 {			define dso_local void @bar(i32 %n) #0 !prof !29 {
	entry:			entry:
	%call = tail call fastcc i32 @cond(i32 %n)			%call = tail call fastcc i32 @cond(i32 %n)
	%tobool = icmp eq i32 %call, 0			%tobool = icmp eq i32 %call, 0
	br i1 %tobool, label %if.else, label %if.then, !prof !30			br i1 %tobool, label %if.else, label %if.then, !prof !30

	if.then:			if.then:
				; The calls here ensure that the instructions are not hoisted by SimplifyCFG.
				call void @clobber()
	%0 = load i32, i32* @odd, align 4			%0 = load i32, i32* @odd, align 4
	%inc = add i32 %0, 1			%inc = add i32 %0, 1
	store i32 %inc, i32* @odd, align 4			store i32 %inc, i32* @odd, align 4
				call void @clobber()
	br label %if.end			br label %if.end

	if.else:			if.else:
	%1 = load i32, i32* @even, align 4			%1 = load i32, i32* @even, align 4
	%inc1 = add i32 %1, 1			%inc1 = add i32 %1, 1
	store i32 %inc1, i32* @even, align 4			store i32 %inc1, i32* @even, align 4
	br label %if.end			br label %if.end

	if.end:			if.end:
	ret void			ret void
	}			}

				declare void @clobber()

	define internal fastcc i32 @cond(i32 %i) #1 !prof !29 !PGOFuncName !35 {			define internal fastcc i32 @cond(i32 %i) #1 !prof !29 !PGOFuncName !35 {
	entry:			entry:
	%rem = srem i32 %i, 2			%rem = srem i32 %i, 2
	ret i32 %rem			ret i32 %rem
	}			}

	attributes #0 = { "target-cpu"="x86-64" }			attributes #0 = { "target-cpu"="x86-64" }
	attributes #1 = { inlinehint noinline }			attributes #1 = { inlinehint noinline }
	Show All 34 Lines

llvm/test/Transforms/PGOProfile/cspgo_profile_summary.ll

Show First 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	for.body:
call fastcc void @barbar()		call fastcc void @barbar()
%add4 = add nsw i32 %call, 2		%add4 = add nsw i32 %call, 2
br label %for.cond		br label %for.cond

for.end:		for.end:
ret void		ret void
}		}
; CSPGOSUMMARY-LABEL: @foo		; CSPGOSUMMARY-LABEL: @foo
; CSPGOSUMMARY: %even.sink{{[0-9]}} = select i1 %tobool.i{{[0-9]}}, i32* @even, i32* @odd		; CSPGOSUMMARY: %odd.sink.i{{[0-9]}} = select i1 %tobool.i{{[0-9]}}, i32* @even, i32* @odd
; CSPGOSUMMARY-SAME: !prof ![[BW1_CSPGO_FOO:[0-9]+]]		; CSPGOSUMMARY-SAME: !prof ![[BW_CSPGO_BAR]]
; CSPGOSUMMARY: %even.sink{{[0-9]}} = select i1 %tobool.i{{[0-9]}}, i32* @even, i32* @odd		; CSPGOSUMMARY: %odd.sink.i{{[0-9]}} = select i1 %tobool.i{{[0-9]}}, i32* @even, i32* @odd
; CSPGOSUMMARY-SAME: !prof ![[BW2_CSPGO_FOO:[0-9]+]]		; CSPGOSUMMARY-SAME: !prof ![[BW_CSPGO_BAR]]

declare dso_local i32 @bar_m(i32)		declare dso_local i32 @bar_m(i32)
declare dso_local i32 @bar_m2(i32)		declare dso_local i32 @bar_m2(i32)

define internal fastcc void @barbar() {		define internal fastcc void @barbar() {
entry:		entry:
%0 = load i32, i32* @odd, align 4		%0 = load i32, i32* @odd, align 4
%inc = add i32 %0, 1		%inc = add i32 %0, 1
Show All 29 Lines
; CSPGOSUMMARY: {{![0-9]+}} = !{i32 1, !"CSProfileSummary", !{{[0-9]+}}}		; CSPGOSUMMARY: {{![0-9]+}} = !{i32 1, !"CSProfileSummary", !{{[0-9]+}}}
; CSPGOSUMMARY: {{![0-9]+}} = !{!"ProfileFormat", !"CSInstrProf"}		; CSPGOSUMMARY: {{![0-9]+}} = !{!"ProfileFormat", !"CSInstrProf"}
; CSPGOSUMMARY: {{![0-9]+}} = !{!"TotalCount", i64 1299950}		; CSPGOSUMMARY: {{![0-9]+}} = !{!"TotalCount", i64 1299950}
; CSPGOSUMMARY: {{![0-9]+}} = !{!"MaxCount", i64 200000}		; CSPGOSUMMARY: {{![0-9]+}} = !{!"MaxCount", i64 200000}
; CSPGOSUMMARY: {{![0-9]+}} = !{!"MaxInternalCount", i64 100000}		; CSPGOSUMMARY: {{![0-9]+}} = !{!"MaxInternalCount", i64 100000}
; CSPGOSUMMARY: {{![0-9]+}} = !{!"MaxFunctionCount", i64 200000}		; CSPGOSUMMARY: {{![0-9]+}} = !{!"MaxFunctionCount", i64 200000}
; CSPGOSUMMARY: {{![0-9]+}} = !{!"NumCounts", i64 23}		; CSPGOSUMMARY: {{![0-9]+}} = !{!"NumCounts", i64 23}
; CSPGOSUMMARY-DAG: ![[BW_CSPGO_BAR]] = !{!"branch_weights", i32 100000, i32 100000}		; CSPGOSUMMARY-DAG: ![[BW_CSPGO_BAR]] = !{!"branch_weights", i32 100000, i32 100000}
; CSPGOSUMMARY-DAG: ![[BW1_CSPGO_FOO]] = !{!"branch_weights", i32 100000, i32 0}
; CSPGOSUMMARY-DAG: ![[BW2_CSPGO_FOO]] = !{!"branch_weights", i32 0, i32 100000}

llvm/test/Transforms/PGOProfile/thinlto_cspgo_use.ll

	; REQUIRES: x86-registered-target			; REQUIRES: x86-registered-target

	; RUN: opt -module-summary %s -o %t1.bc			; RUN: opt -module-summary %s -o %t1.bc
	; RUN: opt -module-summary %S/Inputs/thinlto_cspgo_bar_use.ll -o %t2.bc			; RUN: opt -module-summary %S/Inputs/thinlto_cspgo_bar_use.ll -o %t2.bc
	; RUN: llvm-profdata merge %S/Inputs/thinlto_cs.proftext -o %t3.profdata			; RUN: llvm-profdata merge %S/Inputs/thinlto_cs.proftext -o %t3.profdata
	; RUN: llvm-lto2 run -lto-cspgo-profile-file=%t3.profdata -pgo-instrument-entry=false -save-temps -o %t %t1.bc %t2.bc \			; RUN: llvm-lto2 run -lto-cspgo-profile-file=%t3.profdata -pgo-instrument-entry=false -save-temps -o %t %t1.bc %t2.bc \
	; RUN: -r=%t1.bc,foo,pl \			; RUN: -r=%t1.bc,foo,pl \
	; RUN: -r=%t1.bc,bar,l \			; RUN: -r=%t1.bc,bar,l \
	; RUN: -r=%t1.bc,main,plx \			; RUN: -r=%t1.bc,main,plx \
	; RUN: -r=%t2.bc,bar,pl \			; RUN: -r=%t2.bc,bar,pl \
				; RUN: -r=%t2.bc,clobber,pl \
	; RUN: -r=%t2.bc,odd,pl \			; RUN: -r=%t2.bc,odd,pl \
	; RUN: -r=%t2.bc,even,pl			; RUN: -r=%t2.bc,even,pl
	; RUN: llvm-dis %t.1.4.opt.bc -o - \| FileCheck %s --check-prefix=CSUSE			; RUN: llvm-dis %t.1.4.opt.bc -o - \| FileCheck %s --check-prefix=CSUSE

	; CSUSE: {{![0-9]+}} = !{i32 1, !"ProfileSummary", {{![0-9]+}}}			; CSUSE: {{![0-9]+}} = !{i32 1, !"ProfileSummary", {{![0-9]+}}}
	; CSUSE: {{![0-9]+}} = !{i32 1, !"CSProfileSummary", {{![0-9]+}}}			; CSUSE: {{![0-9]+}} = !{i32 1, !"CSProfileSummary", {{![0-9]+}}}
	; CSUSE-DAG: {{![0-9]+}} = !{!"branch_weights", i32 100000, i32 0}			; CSUSE-DAG: {{![0-9]+}} = !{!"branch_weights", i32 100000, i32 0}
	; CSUSE-DAG: {{![0-9]+}} = !{!"branch_weights", i32 0, i32 100000}			; CSUSE-DAG: {{![0-9]+}} = !{!"branch_weights", i32 0, i32 100000}
	▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines

llvm/test/Transforms/PhaseOrdering/AArch64/hoisting-sinking-required-for-vectorization.ll

Show First 20 Lines • Show All 134 Lines • ▼ Show 20 Lines	for.end: ; preds = %for.cond.cleanup
ret void		ret void
}		}

; Test that requires sinking/hoisting of instructions for vectorization.		; Test that requires sinking/hoisting of instructions for vectorization.

define void @loop2(float* %A, float* %B, i32* %C, float %x) {		define void @loop2(float* %A, float* %B, i32* %C, float %x) {
; CHECK-LABEL: @loop2(		; CHECK-LABEL: @loop2(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[LOOP_BODY:%.*]]		; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr float, float [[B:%.*]], i64 10000
		; CHECK-NEXT: [[SCEVGEP6:%.]] = getelementptr i32, i32 [[C:%.*]], i64 10000
		; CHECK-NEXT: [[SCEVGEP9:%.]] = getelementptr float, float [[A:%.*]], i64 10000
		; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[SCEVGEP6]] to float*
		; CHECK-NEXT: [[BOUND0:%.]] = icmp ugt float [[TMP0]], [[B]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[SCEVGEP]] to i32*
		; CHECK-NEXT: [[BOUND1:%.]] = icmp ugt i32 [[TMP1]], [[C]]
		; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
		; CHECK-NEXT: [[BOUND011:%.]] = icmp ugt float [[SCEVGEP9]], [[B]]
		; CHECK-NEXT: [[BOUND112:%.]] = icmp ugt float [[SCEVGEP]], [[A]]
		; CHECK-NEXT: [[FOUND_CONFLICT13:%.*]] = and i1 [[BOUND011]], [[BOUND112]]
		; CHECK-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT13]]
		; CHECK-NEXT: br i1 [[CONFLICT_RDX]], label [[LOOP_BODY:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <4 x float> poison, float [[X:%.]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x float> [[BROADCAST_SPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
		; CHECK-NEXT: [[DOT0:%.]] = getelementptr inbounds i32, i32 [[C]], i64 0
		; CHECK-NEXT: [[DOT017:%.]] = getelementptr inbounds float, float [[A]], i64 0
		; CHECK-NEXT: [[DOT018:%.]] = getelementptr inbounds float, float [[B]], i64 0
		; CHECK-NEXT: [[INDEX_NEXT_0:%.*]] = add i64 0, 4
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX_NEXT_PHI:%.]] = phi i64 [ [[INDEX_NEXT_0]], [[VECTOR_PH]] ], [ [[INDEX_NEXT_1:%.]], [[VECTOR_BODY_VECTOR_BODY_CRIT_EDGE:%.*]] ]
		; CHECK-NEXT: [[DOTPHI:%.]] = phi float [ [[DOT018]], [[VECTOR_PH]] ], [ [[DOT120:%.*]], [[VECTOR_BODY_VECTOR_BODY_CRIT_EDGE]] ]
		; CHECK-NEXT: [[DOTPHI21:%.]] = phi float [ [[DOT017]], [[VECTOR_PH]] ], [ [[DOT119:%.*]], [[VECTOR_BODY_VECTOR_BODY_CRIT_EDGE]] ]
		; CHECK-NEXT: [[DOTPHI22:%.]] = phi i32 [ [[DOT0]], [[VECTOR_PH]] ], [ [[DOT1:%.*]], [[VECTOR_BODY_VECTOR_BODY_CRIT_EDGE]] ]
		; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[DOTPHI22]] to <4 x i32>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4, !alias.scope !8
		; CHECK-NEXT: [[TMP3:%.*]] = icmp eq <4 x i32> [[WIDE_LOAD]], <i32 20, i32 20, i32 20, i32 20>
		; CHECK-NEXT: [[TMP4:%.]] = bitcast float [[DOTPHI21]] to <4 x float>*
		; CHECK-NEXT: [[WIDE_LOAD14:%.]] = load <4 x float>, <4 x float> [[TMP4]], align 4, !alias.scope !11
		; CHECK-NEXT: [[TMP5:%.*]] = fmul <4 x float> [[WIDE_LOAD14]], [[BROADCAST_SPLAT]]
		; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[DOTPHI]] to <4 x float>*
		; CHECK-NEXT: [[WIDE_LOAD15:%.]] = load <4 x float>, <4 x float> [[TMP6]], align 4, !alias.scope !13, !noalias !15
		; CHECK-NEXT: [[TMP7:%.*]] = fadd <4 x float> [[TMP5]], [[WIDE_LOAD15]]
		; CHECK-NEXT: [[PREDPHI:%.*]] = select <4 x i1> [[TMP3]], <4 x float> [[TMP5]], <4 x float> [[TMP7]]
		; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[DOTPHI]] to <4 x float>*
		; CHECK-NEXT: store <4 x float> [[PREDPHI]], <4 x float>* [[TMP8]], align 4, !alias.scope !13, !noalias !15
		; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT_PHI]], 10000
		; CHECK-NEXT: br i1 [[TMP9]], label [[EXIT:%.*]], label [[VECTOR_BODY_VECTOR_BODY_CRIT_EDGE]], !llvm.loop [[LOOP16:![0-9]+]]
		; CHECK: vector.body.vector.body_crit_edge:
		; CHECK-NEXT: [[DOT1]] = getelementptr inbounds i32, i32* [[C]], i64 [[INDEX_NEXT_PHI]]
		; CHECK-NEXT: [[DOT119]] = getelementptr inbounds float, float* [[A]], i64 [[INDEX_NEXT_PHI]]
		; CHECK-NEXT: [[DOT120]] = getelementptr inbounds float, float* [[B]], i64 [[INDEX_NEXT_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT_1]] = add i64 [[INDEX_NEXT_PHI]], 4
		; CHECK-NEXT: br label [[VECTOR_BODY]]
; CHECK: loop.body:		; CHECK: loop.body:
; CHECK-NEXT: [[IV1:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[IV_NEXT:%.]], [[LOOP_LATCH:%.]] ]		; CHECK-NEXT: [[IV1:%.]] = phi i64 [ [[IV_NEXT:%.]], [[LOOP_LATCH:%.]] ], [ 0, [[ENTRY:%.]] ]
; CHECK-NEXT: [[C_GEP:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i64 [[IV1]]		; CHECK-NEXT: [[C_GEP:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[IV1]]
; CHECK-NEXT: [[C_LV:%.]] = load i32, i32 [[C_GEP]], align 4		; CHECK-NEXT: [[C_LV:%.]] = load i32, i32 [[C_GEP]], align 4
; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[C_LV]], 20		; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[C_LV]], 20
; CHECK-NEXT: [[A_GEP_0:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[IV1]]		; CHECK-NEXT: [[A_GEP_0:%.]] = getelementptr inbounds float, float [[A]], i64 [[IV1]]
; CHECK-NEXT: [[A_LV_0:%.]] = load float, float [[A_GEP_0]], align 4		; CHECK-NEXT: [[A_LV_0:%.]] = load float, float [[A_GEP_0]], align 4
; CHECK-NEXT: [[MUL2_I81_I:%.]] = fmul float [[A_LV_0]], [[X:%.]]		; CHECK-NEXT: [[MUL2_I81_I:%.*]] = fmul float [[A_LV_0]], [[X]]
; CHECK-NEXT: [[B_GEP_0:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[IV1]]		; CHECK-NEXT: [[B_GEP_0:%.]] = getelementptr inbounds float, float [[B]], i64 [[IV1]]
; CHECK-NEXT: br i1 [[CMP]], label [[LOOP_LATCH]], label [[ELSE:%.*]]		; CHECK-NEXT: br i1 [[CMP]], label [[LOOP_LATCH]], label [[ELSE:%.*]]
; CHECK: else:		; CHECK: else:
; CHECK-NEXT: [[B_LV:%.]] = load float, float [[B_GEP_0]], align 4		; CHECK-NEXT: [[B_LV:%.]] = load float, float [[B_GEP_0]], align 4
; CHECK-NEXT: [[ADD:%.*]] = fadd float [[MUL2_I81_I]], [[B_LV]]		; CHECK-NEXT: [[ADD:%.*]] = fadd float [[MUL2_I81_I]], [[B_LV]]
; CHECK-NEXT: br label [[LOOP_LATCH]]		; CHECK-NEXT: br label [[LOOP_LATCH]]
; CHECK: loop.latch:		; CHECK: loop.latch:
; CHECK-NEXT: [[ADD_SINK:%.*]] = phi float [ [[ADD]], [[ELSE]] ], [ [[MUL2_I81_I]], [[LOOP_BODY]] ]		; CHECK-NEXT: [[ADD_SINK:%.*]] = phi float [ [[ADD]], [[ELSE]] ], [ [[MUL2_I81_I]], [[LOOP_BODY]] ]
; CHECK-NEXT: store float [[ADD_SINK]], float* [[B_GEP_0]], align 4		; CHECK-NEXT: store float [[ADD_SINK]], float* [[B_GEP_0]], align 4
; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1		; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
; CHECK-NEXT: [[CMP_0:%.*]] = icmp ult i64 [[IV1]], 9999		; CHECK-NEXT: [[CMP_0:%.*]] = icmp ult i64 [[IV1]], 9999
; CHECK-NEXT: br i1 [[CMP_0]], label [[LOOP_BODY]], label [[EXIT:%.*]]		; CHECK-NEXT: br i1 [[CMP_0]], label [[LOOP_BODY]], label [[EXIT]], !llvm.loop [[LOOP17:![0-9]+]]
; CHECK: exit:		; CHECK: exit:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
br label %loop.header		br label %loop.header

loop.header:		loop.header:
%iv = phi i64 [ %iv.next, %loop.latch ], [ 0, %entry ]		%iv = phi i64 [ %iv.next, %loop.latch ], [ 0, %entry ]
Show All 38 Lines