This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/IPO/
-
Transforms/
-
IPO/
1
PassManagerBuilder.cpp
-
test/
-
Analysis/GlobalsModRef/
-
GlobalsModRef/
-
memset-escape.ll
-
Other/
-
pass-pipelines.ll

Differential D17442

[LPM] Remove the last worrisome split in the primary loop pass pipeline, allowing LICM and friends to always run over the outer loop after unrolling has a chance to remove the inner loop.
AcceptedPublic

Authored by chandlerc on Feb 19 2016, 2:51 AM.

Download Raw Diff

Details

Reviewers

• zinob
mzolotukhin
jmolloy
hfinkel
sanjoy

Summary

This fully fixes PR24804 and should return some serious sanity to the
loop pass pipeline. I've added comments to try to prevent folks from
accidentially breaking this and there is a test to catch changes here as
well now.

However, this is a very substantial change. We now rely on
loop-simplifycfg and loop-instsimplify to do the mid-loop-pass cleanup.
These actually work with the loop pass pipeline but are no where near as
powerful as simplifycfg and instcombine. I've added a direct run of
those two immediately after the loop pipeline to try to make sure we
adequately clean up any cruft produced, but this isn't the same as
running them in the middle.

Overall, I *strongly* suspect this is a net win. It is the model we
want. If there are regressions, the correct fix will just be to enhance
loop-simplifycfg and loop-instsimplify (potentially making
a loop-instcombine variant if needed) until they catch the necessary
cases.

However, I don't actually have a good loop-heavy benchmark suite. So I'd
really appreciate it if folks who do could benchmark this change and see
what happens. If there are serious regressions, I can even take a pretty
naive stab at enhancing the two passes to be more powerful. But I'd like
to see if this is actually already enough to handle the real cases we
have today.

There is one test I had to update because it used -O1 and after this
change we actually avoid forming memset in the awkward position that it
wanted to test for (instead we will pretty much completely nuke all the
code, so I'm happier with the end result personally). But to keep the
test relevant, I've taken the particualr memset pattern formed
previously and directly put it into the test case to make sure we don't
somehow perturb it with GMR.

Diff Detail

Event Timeline

chandlerc updated this revision to Diff 48470.Feb 19 2016, 2:51 AM

chandlerc retitled this revision from to [LPM] Remove the last worrisome split in the primary loop pass pipeline, allowing LICM and friends to always run over the outer loop after unrolling has a chance to remove the inner loop..

chandlerc updated this object.

chandlerc added reviewers: hfinkel, jmolloy, mzolotukhin, sanjoy.

chandlerc added a subscriber: llvm-commits.

Herald added subscribers: mcrosier, mehdi_amini. · View Herald TranscriptFeb 19 2016, 2:51 AM

LGTM! One loop pass pipeline is a lot better than 7!

This revision is now accepted and ready to land.Feb 19 2016, 2:55 AM

mehdi_amini added inline comments.Feb 19 2016, 8:40 AM

lib/Transforms/IPO/PassManagerBuilder.cpp
248	I'd add a reference to the PR between parentheses at the end, because "bad things can happen" is typically the kind of comments that will stay there forever with no-one remembering what it refers to.

Hm, when I tried something like this I got a number of failures/huge regressions because LoopSimplifyCFG currently is much weaker than SimplifyCFG, so all later passes were basically crippled. Have you seen any of these?

Michael

@mzolotukhin I'm curious which sort of CFG simplifications *need* to be done within the loop pass pipeline (other than the extremely trivial one I added). I would hope Chandler's adding of regular SimplifyCFG afterwards would be enough in the short-term, but cases where it's not would in and of themselves be rather interesting.

It does feel like ideally we just want a LoopSimplifyCFG that does everything SimplifyCFG does, though, unless there's a good reason that would be less efficient.

@escha SimplifyCFG is actually pretty complicated pass (it's 5K LOC), so it does a lot of stuff. The problem is that some of the transformations will not preserve LCSSA and might not be located in the current loop, so they can't be applied as-is. In my experiments we were missing, like, almost everything SimplifyCFG does - e.g. the first simplification ConstFoldTerminator.

Yeah, I understand that it's way more complex (and breaks LCSSA); I was more curious about whether there were cases where:

LoopPassManager [ loop passes A -> LoopSimplifyCFG -> loop passes B ]
SimplifyCFG

does worse than

LoopPassManager [ loop passes A ]
SimplifyCFG
LoopPassManager [ loop passes B ]

in other words, cases where having the primary SimplifyCFG after "loop passes B" caused us to miss things in "loop passes B" that we care about.

LoopPassManager [ loop passes A -> LoopSimplifyCFG -> loop passes B ]
SimplifyCFG

does worse than

LoopPassManager [ loop passes A ]
SimplifyCFG
LoopPassManager [ loop passes B ]

Yeah, in my experiments we did have a lot of cases like you described. Basically, LoopSimplifyCFG in the first option wasn't able to cleanup the code enough for loop passes B to be effective. I also tried adding loop deletion/etc. but with no luck either.
That said, I didn't spend much time on it, so I don't have deep details. I think it's worth investigating further, especially since this is an active are now.

In D17442#357429, @mzolotukhin wrote:

LoopPassManager [ loop passes A -> LoopSimplifyCFG -> loop passes B ]
SimplifyCFG

does worse than

LoopPassManager [ loop passes A ]
SimplifyCFG
LoopPassManager [ loop passes B ]

Yeah, in my experiments we did have a lot of cases like you described. Basically, LoopSimplifyCFG in the first option wasn't able to cleanup the code enough for loop passes B to be effective. I also tried adding loop deletion/etc. but with no luck either.
That said, I didn't spend much time on it, so I don't have deep details. I think it's worth investigating further, especially since this is an active are now.

The question is, what should we do *now*.

Note that today, the simplify-cfg run comes before the thing that I think needs the most cleanup: unroll. So this change won't regress *any* cleanup taking place prior to loop unrolling. So if "unroll" is the "loop passes A" in the above example, it already doesn't work, and this patch doesn't make it worse.

Today, loop unrolling definitely creates invariant instructions that we then fail to hoist with LICM. This patch will fix that. Similarly, this patch will fix cases where we need to re-rotate an outer loop after unrolling an inner loop in order to unroll the outer loop.

So that's why I suspect the best path is to make this change, and then incrementally improve the loop simplification passes until they are sufficient to handle the other cases.

Thoughts?

Just to make it clear: I absolutely support any steps in this direction, and I'm just concerned that if we do it right now we might be hit by huge regressions. If testing proves me wrong, I'm totally happy with it:)

As for the current situation - we might need a clean-up after loop-unswitch and loop-rotate. Also, the issue is that if any of this transformations fails due to lack of cleanup, it'll create a chain reaction: others will likely also fail.

So, I think we need to test this change, and if there are major problems, try to address them first (probably, by developing LoopSimplifyCFG).

In D17442#357429, @mzolotukhin wrote:

LoopPassManager [ loop passes A -> LoopSimplifyCFG -> loop passes B ]
SimplifyCFG

does worse than

LoopPassManager [ loop passes A ]
SimplifyCFG
LoopPassManager [ loop passes B ]

Yeah, in my experiments we did have a lot of cases like you described. Basically, LoopSimplifyCFG in the first option wasn't able to cleanup the code enough for loop passes B to be effective. I also tried adding loop deletion/etc. but with no luck either.
That said, I didn't spend much time on it, so I don't have deep details. I think it's worth investigating further, especially since this is an active are now.

The question is, what should we do *now*.

So that's why I suspect the best path is to make this change, and then incrementally improve the loop simplification passes until they are sufficient to handle the other cases.

Thoughts?

In D17442#357515, @mzolotukhin wrote:

Just to make it clear: I absolutely support any steps in this direction, and I'm just concerned that if we do it right now we might be hit by huge regressions. If testing proves me wrong, I'm totally happy with it:)

As for the current situation - we might need a clean-up after loop-unswitch and loop-rotate. Also, the issue is that if any of this transformations fails due to lack of cleanup, it'll create a chain reaction: others will likely also fail.

So, I think we need to test this change, and if there are major problems, try to address them first (probably, by developing LoopSimplifyCFG).

Cool, could you help with benchmarking here? As I said, I don't think I have any interesting benchmarks for this kind of loop heavy code to really have confidence in anything. =/ I mean, I can run the test suite, but I'm really not sure how much data that really gives anyone.

I believe I ran into issues with just llvm-testsuite, but yeah, I can test it + SPECs.

Michael

The testing just finished, i got a huge number of stability failures though. They're caused by two assertions:

Assertion failed: (hasDedicatedExits() && "getUniqueExitBlocks assumes the loop has canonical form exits!"), function getUniqueExitBlocks, file /Users/buildslave/devel/llvm.git/lib/Analysis/LoopInfo.cpp, line 333.

Reproducer:

; RUN: opt -loop-instsimplify < %s
define void @main() {
entry:
  br label %L1
L1:
  br label %L2
L2:
  indirectbr i8* undef, [label %L1, label %L2]
}

The issue here is probably that we're not cleaning up enough.

Assertion failed: (InnerAST && "Where is my AST?"), function collectAliasInfoFromSubLoops, file /Users/buildslave/devel/llvm.git/lib/Transforms/Scalar/LICM.cpp, line 1048.

My bugpoint-fu failed me here, so I don't have a small testcase. But if you run testsuite, you should definitely see it.

There are several gains and regressions in both compile and execution times, but I suggest remeasuring it when the assertion failures are fixed.

haicheng added a subscriber: haicheng.Feb 19 2016, 7:04 PM

In D17442#357662, @mzolotukhin wrote:

Assertion failed: (InnerAST && "Where is my AST?"), function collectAliasInfoFromSubLoops, file /Users/buildslave/devel/llvm.git/lib/Transforms/Scalar/LICM.cpp, line 1048.

I ran into this issue before. Once LICM is done processing the inner loop, it assumes no other pass can modify it before LICM starts processing the outer loop. This assumption is true when LICM is the only one in the loop pipe. However, changing the loop pass manager can break the assumption.

Before this patch, I have a work-around fix in D17370 which uses the original loop pass manager. It is supposed to work with this patch at the cost of longer compilation time.

Adding Zino as we were talking about this over lunch.

sanjoy resigned from this revision.Jan 29 2022, 5:25 PM

Herald added a subscriber: ormris. · View Herald TranscriptJan 29 2022, 5:25 PM

Revision Contents

Path

Size

lib/

Transforms/

IPO/

PassManagerBuilder.cpp

16 lines

test/

Analysis/

GlobalsModRef/

memset-escape.ll

46 lines

Other/

pass-pipelines.ll

10 lines

Diff 48470

lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 235 Lines • ▼ Show 20 Lines	void PassManagerBuilder::addFunctionSimplificationPasses(
MPM.add(createTailCallEliminationPass()); // Eliminate tail calls		MPM.add(createTailCallEliminationPass()); // Eliminate tail calls
MPM.add(createCFGSimplificationPass()); // Merge & remove BBs		MPM.add(createCFGSimplificationPass()); // Merge & remove BBs
MPM.add(createReassociatePass()); // Reassociate expressions		MPM.add(createReassociatePass()); // Reassociate expressions
if (PrepareForThinLTO) {		if (PrepareForThinLTO) {
MPM.add(createAggressiveDCEPass()); // Delete dead instructions		MPM.add(createAggressiveDCEPass()); // Delete dead instructions
MPM.add(createInstructionCombiningPass()); // Combine silly seq's		MPM.add(createInstructionCombiningPass()); // Combine silly seq's
return;		return;
}		}

		// This starts the main loop pass pipeline. It is critically important to not
		// introduce non-loop passes into the middle of this. We want to re-run the
		// entire pipeline on outer loops after the pipeline simplifies inner loops.
		// Without this, significant phase ordering problems can develop.
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions I'd add a reference to the PR between parentheses at the end, because "bad things can happen" is typically the kind of comments that will stay there forever with no-one remembering what it refers to. mehdi_amini: I'd add a reference to the PR between parentheses at the end, because "bad things can happen"…

// Rotate Loop - disable header duplication at -Oz		// Rotate Loop - disable header duplication at -Oz
MPM.add(createLoopRotatePass(SizeLevel == 2 ? 0 : -1));		MPM.add(createLoopRotatePass(SizeLevel == 2 ? 0 : -1));
MPM.add(createLICMPass()); // Hoist loop invariants		MPM.add(createLICMPass()); // Hoist loop invariants
MPM.add(createLoopUnswitchPass(SizeLevel \|\| OptLevel < 3));		MPM.add(createLoopUnswitchPass(SizeLevel \|\| OptLevel < 3));
MPM.add(createCFGSimplificationPass());		MPM.add(createLoopSimplifyCFGPass());
MPM.add(createInstructionCombiningPass());		MPM.add(createLoopInstSimplifyPass());
MPM.add(createIndVarSimplifyPass()); // Canonicalize indvars		MPM.add(createIndVarSimplifyPass()); // Canonicalize indvars
MPM.add(createLoopIdiomPass()); // Recognize idioms like memset.		MPM.add(createLoopIdiomPass()); // Recognize idioms like memset.
MPM.add(createLoopDeletionPass()); // Delete dead loops		MPM.add(createLoopDeletionPass()); // Delete dead loops
if (EnableLoopInterchange) {		if (EnableLoopInterchange) {
MPM.add(createLoopInterchangePass()); // Interchange loops		MPM.add(createLoopInterchangePass()); // Interchange loops
MPM.add(createCFGSimplificationPass());		MPM.add(createCFGSimplificationPass());
}		}
if (!DisableUnrollLoops)		if (!DisableUnrollLoops)
MPM.add(createSimpleLoopUnrollPass()); // Unroll small loops		MPM.add(createSimpleLoopUnrollPass()); // Unroll small loops
addExtensionsToPM(EP_LoopOptimizerEnd, MPM);		addExtensionsToPM(EP_LoopOptimizerEnd, MPM);

		// Now we are done with the main loop pass pipeline.

		// Clean up the function body from any stuff produced by the loop passes.
		MPM.add(createCFGSimplificationPass());
		MPM.add(createInstructionCombiningPass());

if (OptLevel > 1) {		if (OptLevel > 1) {
if (EnableMLSM)		if (EnableMLSM)
MPM.add(createMergedLoadStoreMotionPass()); // Merge ld/st in diamonds		MPM.add(createMergedLoadStoreMotionPass()); // Merge ld/st in diamonds
MPM.add(createGVNPass(DisableGVNLoadPRE)); // Remove redundancies		MPM.add(createGVNPass(DisableGVNLoadPRE)); // Remove redundancies
}		}
MPM.add(createMemCpyOptPass()); // Remove memcpy / form memset		MPM.add(createMemCpyOptPass()); // Remove memcpy / form memset
MPM.add(createSCCPPass()); // Constant prop with SCCP		MPM.add(createSCCPPass()); // Constant prop with SCCP

▲ Show 20 Lines • Show All 577 Lines • Show Last 20 Lines

test/Analysis/GlobalsModRef/memset-escape.ll

	Show All 11 Lines
	; CHECK-LABEL: @main			; CHECK-LABEL: @main
	; CHECK: call void @llvm.memset.p0i8.i64{{.*}} @a			; CHECK: call void @llvm.memset.p0i8.i64{{.*}} @a
	; CHECK: store i32 3			; CHECK: store i32 3
	; CHECK: load i32, i32* getelementptr {{.*}} @a			; CHECK: load i32, i32* getelementptr {{.*}} @a
	; CHECK: icmp eq i32			; CHECK: icmp eq i32
	; CHECK: br i1			; CHECK: br i1

	define i32 @main() {			define i32 @main() {
	entry:			for.end:
	%retval = alloca i32, align 4			call void @llvm.memset.p0i8.i64(i8* bitcast ([3 x i32]* @a to i8*), i8 0, i64 12, i32 4, i1 false)
	%c = alloca [1 x i32], align 4			store i32 3, i32* @b, align 4
	store i32 0, i32* %retval, align 4			%0 = load i32, i32* getelementptr inbounds ([3 x i32], [3 x i32]* @a, i64 0, i64 2), align 4
	%0 = bitcast [1 x i32]* %c to i8*			%cmp1 = icmp eq i32 %0, 0
	call void @llvm.memset.p0i8.i64(i8* %0, i8 0, i64 4, i32 4, i1 false)			br i1 %cmp1, label %if.end, label %if.then
	store i32 1, i32* getelementptr inbounds ([3 x i32], [3 x i32]* @a, i64 0, i64 2), align 4
	store i32 0, i32* @b, align 4
	br label %for.cond

	for.cond: ; preds = %for.inc, %entry
	%1 = load i32, i32* @b, align 4
	%cmp = icmp slt i32 %1, 3
	br i1 %cmp, label %for.body, label %for.end

	for.body: ; preds = %for.cond
	%2 = load i32, i32* @b, align 4
	%idxprom = sext i32 %2 to i64
	%arrayidx = getelementptr inbounds [3 x i32], [3 x i32]* @a, i64 0, i64 %idxprom
	store i32 0, i32* %arrayidx, align 4
	br label %for.inc

	for.inc: ; preds = %for.body
	%3 = load i32, i32* @b, align 4
	%inc = add nsw i32 %3, 1
	store i32 %inc, i32* @b, align 4
	br label %for.cond

	for.end: ; preds = %for.cond
	%4 = load i32, i32* getelementptr inbounds ([3 x i32], [3 x i32]* @a, i64 0, i64 2), align 4
	%cmp1 = icmp ne i32 %4, 0
	br i1 %cmp1, label %if.then, label %if.end

	if.then: ; preds = %for.end			if.then:
	call void @abort() #3			call void @abort()
	unreachable			unreachable

	if.end: ; preds = %for.end			if.end:
	ret i32 0			ret i32 0
	}			}

	; Function Attrs: nounwind argmemonly			; Function Attrs: nounwind argmemonly
	declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i32, i1) nounwind argmemonly			declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i32, i1) nounwind argmemonly

	; Function Attrs: noreturn nounwind			; Function Attrs: noreturn nounwind
	declare void @abort() noreturn nounwind			declare void @abort() noreturn nounwind

test/Other/pass-pipelines.ll

	Show All 30 Lines
	; CHECK-O2-NEXT: Call Graph SCC Pass Manager			; CHECK-O2-NEXT: Call Graph SCC Pass Manager
	; CHECK-O2-NEXT: Remove unused exception handling info			; CHECK-O2-NEXT: Remove unused exception handling info
	; CHECK-O2-NEXT: Function Integration/Inlining			; CHECK-O2-NEXT: Function Integration/Inlining
	; CHECK-O2-NEXT: Deduce function attributes			; CHECK-O2-NEXT: Deduce function attributes
	; Next up is the main function pass pipeline. It shouldn't be split up and			; Next up is the main function pass pipeline. It shouldn't be split up and
	; should contain the main loop pass pipeline as well.			; should contain the main loop pass pipeline as well.
	; CHECK-O2-NEXT: FunctionPass Manager			; CHECK-O2-NEXT: FunctionPass Manager
	; CHECK-O2-NOT: Manager			; CHECK-O2-NOT: Manager
	; CHECK-O2: Loop Pass Manager			; Now the main loop pass pipeline.
	; CHECK-O2-NOT: Manager
	; FIXME: We shouldn't be pulling out to simplify-cfg and instcombine and
	; causing new loop pass managers.
	; CHECK-O2: Simplify the CFG
	; CHECK-O2-NOT: Manager
	; CHECK-O2: Combine redundant instructions
	; CHECK-O2-NOT: Manager
	; CHECK-O2: Loop Pass Manager			; CHECK-O2: Loop Pass Manager
	; CHECK-O2-NOT: Manager			; CHECK-O2-NOT: Manager
	; FIXME: It isn't clear that we need yet another loop pass pipeline			; FIXME: It isn't clear that we need yet another loop pass pipeline
	; and run of LICM here.			; and run of LICM here.
	; CHECK-O2-NOT: Manager
	; CHECK-O2: Loop Pass Manager			; CHECK-O2: Loop Pass Manager
	; CHECK-O2-NEXT: Loop Invariant Code Motion			; CHECK-O2-NEXT: Loop Invariant Code Motion
	; CHECK-O2-NOT: Manager			; CHECK-O2-NOT: Manager
	; Next we break out of the main Function passes inside the CGSCC pipeline with			; Next we break out of the main Function passes inside the CGSCC pipeline with
	; a barrier pass.			; a barrier pass.
	; CHECK-O2: A No-Op Barrier Pass			; CHECK-O2: A No-Op Barrier Pass
	; CHECK-O2-NOT: Manager			; CHECK-O2-NOT: Manager
	; Next is the late function pass pipeline.			; Next is the late function pass pipeline.
	Show All 30 Lines