This is an archive of the discontinued LLVM Phabricator instance.

[LPM] Remove the last worrisome split in the primary loop pass pipeline, allowing LICM and friends to always run over the outer loop after unrolling has a chance to remove the inner loop.
AcceptedPublic

Authored by chandlerc on Feb 19 2016, 2:51 AM.

Details

Summary

This fully fixes PR24804 and should return some serious sanity to the
loop pass pipeline. I've added comments to try to prevent folks from
accidentially breaking this and there is a test to catch changes here as
well now.

However, this is a very substantial change. We now rely on
loop-simplifycfg and loop-instsimplify to do the mid-loop-pass cleanup.
These actually work with the loop pass pipeline but are no where near as
powerful as simplifycfg and instcombine. I've added a direct run of
those two immediately after the loop pipeline to try to make sure we
adequately clean up any cruft produced, but this isn't the same as
running them in the middle.

Overall, I *strongly* suspect this is a net win. It is the model we
want. If there are regressions, the correct fix will just be to enhance
loop-simplifycfg and loop-instsimplify (potentially making
a loop-instcombine variant if needed) until they catch the necessary
cases.

However, I don't actually have a good loop-heavy benchmark suite. So I'd
really appreciate it if folks who do could benchmark this change and see
what happens. If there are serious regressions, I can even take a pretty
naive stab at enhancing the two passes to be more powerful. But I'd like
to see if this is actually already enough to handle the real cases we
have today.

There is one test I had to update because it used -O1 and after this
change we actually avoid forming memset in the awkward position that it
wanted to test for (instead we will pretty much completely nuke all the
code, so I'm happier with the end result personally). But to keep the
test relevant, I've taken the particualr memset pattern formed
previously and directly put it into the test case to make sure we don't
somehow perturb it with GMR.

Diff Detail

Event Timeline

chandlerc updated this revision to Diff 48470.Feb 19 2016, 2:51 AM
chandlerc retitled this revision from to [LPM] Remove the last worrisome split in the primary loop pass pipeline, allowing LICM and friends to always run over the outer loop after unrolling has a chance to remove the inner loop..
chandlerc updated this object.
chandlerc added a subscriber: llvm-commits.
jmolloy accepted this revision.Feb 19 2016, 2:55 AM
jmolloy edited edge metadata.

LGTM! One loop pass pipeline is a lot better than 7!

This revision is now accepted and ready to land.Feb 19 2016, 2:55 AM
mehdi_amini added inline comments.Feb 19 2016, 8:40 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
248

I'd add a reference to the PR between parentheses at the end, because "bad things can happen" is typically the kind of comments that will stay there forever with no-one remembering what it refers to.

mzolotukhin edited edge metadata.Feb 19 2016, 10:03 AM

Hm, when I tried something like this I got a number of failures/huge regressions because LoopSimplifyCFG currently is much weaker than SimplifyCFG, so all later passes were basically crippled. Have you seen any of these?

Michael

escha added a subscriber: escha.Feb 19 2016, 10:08 AM

@mzolotukhin I'm curious which sort of CFG simplifications *need* to be done within the loop pass pipeline (other than the extremely trivial one I added). I would hope Chandler's adding of regular SimplifyCFG afterwards would be enough in the short-term, but cases where it's not would in and of themselves be rather interesting.

It does feel like ideally we just want a LoopSimplifyCFG that does everything SimplifyCFG does, though, unless there's a good reason that would be less efficient.

@escha SimplifyCFG is actually pretty complicated pass (it's 5K LOC), so it does a lot of stuff. The problem is that some of the transformations will not preserve LCSSA and might not be located in the current loop, so they can't be applied as-is. In my experiments we were missing, like, almost everything SimplifyCFG does - e.g. the first simplification ConstFoldTerminator.

Yeah, I understand that it's way more complex (and breaks LCSSA); I was more curious about whether there were cases where:

LoopPassManager [ loop passes A -> LoopSimplifyCFG -> loop passes B ]
SimplifyCFG

does worse than

LoopPassManager [ loop passes A ]
SimplifyCFG
LoopPassManager [ loop passes B ]

in other words, cases where having the primary SimplifyCFG after "loop passes B" caused us to miss things in "loop passes B" that we care about.

LoopPassManager [ loop passes A -> LoopSimplifyCFG -> loop passes B ]
SimplifyCFG

does worse than

LoopPassManager [ loop passes A ]
SimplifyCFG
LoopPassManager [ loop passes B ]

Yeah, in my experiments we did have a lot of cases like you described. Basically, LoopSimplifyCFG in the first option wasn't able to cleanup the code enough for loop passes B to be effective. I also tried adding loop deletion/etc. but with no luck either.
That said, I didn't spend much time on it, so I don't have deep details. I think it's worth investigating further, especially since this is an active are now.

LoopPassManager [ loop passes A -> LoopSimplifyCFG -> loop passes B ]
SimplifyCFG

does worse than

LoopPassManager [ loop passes A ]
SimplifyCFG
LoopPassManager [ loop passes B ]

Yeah, in my experiments we did have a lot of cases like you described. Basically, LoopSimplifyCFG in the first option wasn't able to cleanup the code enough for loop passes B to be effective. I also tried adding loop deletion/etc. but with no luck either.
That said, I didn't spend much time on it, so I don't have deep details. I think it's worth investigating further, especially since this is an active are now.

The question is, what should we do *now*.

Note that today, the simplify-cfg run comes before the thing that I think needs the most cleanup: unroll. So this change won't regress *any* cleanup taking place prior to loop unrolling. So if "unroll" is the "loop passes A" in the above example, it already doesn't work, and this patch doesn't make it worse.

Today, loop unrolling definitely creates invariant instructions that we then fail to hoist with LICM. This patch will fix that. Similarly, this patch will fix cases where we need to re-rotate an outer loop after unrolling an inner loop in order to unroll the outer loop.

So that's why I suspect the best path is to make this change, and then incrementally improve the loop simplification passes until they are sufficient to handle the other cases.

Thoughts?

Just to make it clear: I absolutely support any steps in this direction, and I'm just concerned that if we do it right now we might be hit by huge regressions. If testing proves me wrong, I'm totally happy with it:)

As for the current situation - we might need a clean-up after loop-unswitch and loop-rotate. Also, the issue is that if any of this transformations fails due to lack of cleanup, it'll create a chain reaction: others will likely also fail.

So, I think we need to test this change, and if there are major problems, try to address them first (probably, by developing LoopSimplifyCFG).

LoopPassManager [ loop passes A -> LoopSimplifyCFG -> loop passes B ]
SimplifyCFG

does worse than

LoopPassManager [ loop passes A ]
SimplifyCFG
LoopPassManager [ loop passes B ]

Yeah, in my experiments we did have a lot of cases like you described. Basically, LoopSimplifyCFG in the first option wasn't able to cleanup the code enough for loop passes B to be effective. I also tried adding loop deletion/etc. but with no luck either.
That said, I didn't spend much time on it, so I don't have deep details. I think it's worth investigating further, especially since this is an active are now.

The question is, what should we do *now*.

Note that today, the simplify-cfg run comes before the thing that I think needs the most cleanup: unroll. So this change won't regress *any* cleanup taking place prior to loop unrolling. So if "unroll" is the "loop passes A" in the above example, it already doesn't work, and this patch doesn't make it worse.

Today, loop unrolling definitely creates invariant instructions that we then fail to hoist with LICM. This patch will fix that. Similarly, this patch will fix cases where we need to re-rotate an outer loop after unrolling an inner loop in order to unroll the outer loop.

So that's why I suspect the best path is to make this change, and then incrementally improve the loop simplification passes until they are sufficient to handle the other cases.

Thoughts?

Just to make it clear: I absolutely support any steps in this direction, and I'm just concerned that if we do it right now we might be hit by huge regressions. If testing proves me wrong, I'm totally happy with it:)

As for the current situation - we might need a clean-up after loop-unswitch and loop-rotate. Also, the issue is that if any of this transformations fails due to lack of cleanup, it'll create a chain reaction: others will likely also fail.

So, I think we need to test this change, and if there are major problems, try to address them first (probably, by developing LoopSimplifyCFG).

Cool, could you help with benchmarking here? As I said, I don't think I have any interesting benchmarks for this kind of loop heavy code to really have confidence in anything. =/ I mean, I can run the test suite, but I'm really not sure how much data that really gives anyone.

I believe I ran into issues with just llvm-testsuite, but yeah, I can test it + SPECs.

Michael

The testing just finished, i got a huge number of stability failures though. They're caused by two assertions:

  1. Assertion failed: (hasDedicatedExits() && "getUniqueExitBlocks assumes the loop has canonical form exits!"), function getUniqueExitBlocks, file /Users/buildslave/devel/llvm.git/lib/Analysis/LoopInfo.cpp, line 333.

Reproducer:

; RUN: opt -loop-instsimplify < %s
define void @main() {
entry:
  br label %L1
L1:
  br label %L2
L2:
  indirectbr i8* undef, [label %L1, label %L2]
}

The issue here is probably that we're not cleaning up enough.

  1. Assertion failed: (InnerAST && "Where is my AST?"), function collectAliasInfoFromSubLoops, file /Users/buildslave/devel/llvm.git/lib/Transforms/Scalar/LICM.cpp, line 1048.

My bugpoint-fu failed me here, so I don't have a small testcase. But if you run testsuite, you should definitely see it.

There are several gains and regressions in both compile and execution times, but I suggest remeasuring it when the assertion failures are fixed.

haicheng added a comment.EditedFeb 19 2016, 9:01 PM
  1. Assertion failed: (InnerAST && "Where is my AST?"), function collectAliasInfoFromSubLoops, file /Users/buildslave/devel/llvm.git/lib/Transforms/Scalar/LICM.cpp, line 1048.

I ran into this issue before. Once LICM is done processing the inner loop, it assumes no other pass can modify it before LICM starts processing the outer loop. This assumption is true when LICM is the only one in the loop pipe. However, changing the loop pass manager can break the assumption.

Before this patch, I have a work-around fix in D17370 which uses the original loop pass manager. It is supposed to work with this patch at the cost of longer compilation time.

Adding Zino as we were talking about this over lunch.

sanjoy resigned from this revision.Jan 29 2022, 5:25 PM