Page MenuHomePhabricator

[LSR] Add a cap for reassociation of AllFixupsOutsideLoop type LSRUse to protect compile time
ClosedPublic

Authored by wmi on Feb 24 2017, 1:41 PM.

Details

Summary

In PR32043, we saw a testcase containing a AllFixupsOutsideLoop type LSRUse with huge SCEVAddExpr. LSRInstance::GenerateReassociations generates lots of new formula for the LSRUse because of the huge AddExpr, and causes compilation to hang.

Since AllFixupsOutsideLoop type LSRUses are outside of current loop, reassociation for them should have much less impact compared with that for normal LSRUses. The fix is to add a cap in reassociation if the LSRUse is of AllFixupsOutsideLoop type. I admit this is a workround. AllFixupsOutsideLoop LSRUse needs to be handled in a better way to reduce compile time and improve LSR results.

No test because I am not sure the potential hanging test from PR32043 is proper to be added.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi created this revision.Feb 24 2017, 1:41 PM
davide edited edge metadata.Feb 24 2017, 1:55 PM

Probably this is OK as stopgap solution but I'm generally very wary of adding cutoffs to the code as they add technical debt etc..

davide added inline comments.Feb 25 2017, 6:04 PM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
3429 ↗(On Diff #89714)

How did you chose this cap, BTW?
Maybe we should do some measurements and pick a less arbitrary value?

Also, I think the test from the PR (after some polishing, maybe) is good to add. If somebody removes these lines at least all (or a subset of) the bots will timeout and we'll notice the regression.

wmi added a comment.Feb 27 2017, 9:53 AM

I don't like a hanging test to be killed after timeout. I will try to create another testcase by checking the LSR trace.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
3429 ↗(On Diff #89714)

On one side, for AddOps.size()==5, even without the workaround, the compile time will not bloat up too much, so I think it can cover the cases suffering compile time issue. On the other side, not many cases have very large AddOps numbers. I choose 5 which is not a too small number so it will not kick in frequently. Since current model handling AllFixupsOutsideLoop is very imprecise, I think it is better to just fix the compile time issue and don't change the existing code.

I found no regressions using some internal benchmarks.

@qcolombet what do you think?

wmi updated this revision to Diff 89938.Feb 27 2017, 2:24 PM

Add a testcase by checking the LSR debug output.

wmi updated this revision to Diff 89939.Feb 27 2017, 2:26 PM

Fixed a typo in the testcase.

sanjoy added a subscriber: sanjoy.Mar 3 2017, 10:45 PM
sanjoy added inline comments.
lib/Transforms/Scalar/LoopStrengthReduce.cpp
3430 ↗(On Diff #89939)

I'm not too familiar with LSR, but this looks pretty ad-hoc -- why can't the same compile-time blowup happen for an LSRUse with AllFixupsOutsideLoop = false?

Wei, any progress on this? We should either commit a workaround or revert the patchset to unblock folks.

atrick edited edge metadata.Mar 8 2017, 1:48 PM

LSR is fundamentally combinatorial. It relies on arbitrary pruning. I don't like it, but that's the way it is, and we need to guard against pathological cases. I just wonder why you complete bypass the reassociation code instead of simply limiting the number of expressions that participate.

wmi added a comment.Mar 8 2017, 2:43 PM

Andy,

Thanks for the comment.

I just wonder why you complete bypass the reassociation code instead of simply limiting the number of expressions that participate.

Because I feel it brings less uncertainty about the LSR result. If I limit the number of expressions that participate, suppose there is a candidate formula matters for the final LSR result, whether it will be used depends on when it is seen and it brings some uncertainty to the LSR result.

Another reason is, for LSRUse with AllFixupsOutsideLoop = true, I don't feel reassociation for their formulas matters much for the performance, since the uses are all outside of loop. From my understanding, LSRUse with AllFixupsOutsideLoop = true shouldn't have equal weight as other LSRUse with fixups inside the loop. It is ok for a LSRFixup user outside of loop to use many registers. That is why I simply choose to bypass the reassociation. If it is for LSRUse with AllFixupsOutsideLoop = false, I feel I need to be more careful.

Thanks,
Wei.

qcolombet edited edge metadata.Mar 24 2017, 11:58 PM

Hi Wei,

Sorry for the delay, I thought Andy's comment was clear enough for you to explore other directions. So I didn't pay attention to the review.

Anyhow, at this point you would have guessed it, but I agree with Andy. I think this patch is a workaround for a problem we yet don't understand.

I wanted to provide a patch to demonstrate Andy's point and what I found when playing around is interesting. (Patch attached nonetheless:

)

TL;DR I believe the problem is in the SCEVExpander code and not in LSR per se, I would suggest to recreate a profile using opt -loop-reduce and look where the time is spent.

Basically, I wanted to limit the number of reassociations per recursive call to 5. The max depth of the call is 3. The call happens for each base register (typically less than 3) of all formulae (typically a couple) for each LSRUse.
So, that gives us a total of: 5 * 3 * 3 * 2 = 90 reassociation formulae per LSRUse and given that they are typically 4 uses per loop, the grant total is less than 400, which does not sound too bad to explore.

However, with such limit, the compile time in the example from PR32042 was already blowing up (I killed it after 4 min).

So, I tried with a limit of 2 reassociations per call, i.e., about 36 formulae per LSRUse (to be exact the example generate 25 formulae), and still opt didn't finish after 4 min.

In the end, the only way to get a reasonable compile time was to not allow any reassociation. I started to become suspicious that the problem is not in the size of the search space and after doing a quick profile, I found that at least 80% of the compile time is spent materializing the solution in SCEVExpander. The problem seems to be related to the "nesting-level" of the expression of the formula that the solver picks.
Apply my patch and compare the output of lsr-num-reassoc=1 and lsr-num-reassoc=2 to see how the nesting level materialize.

Admittedly my profile may be bogus (my Instruments is quite old and the trace looked weird), but I would nonetheless recommend to look into creating a proper profile and investigating what is going on instead of bypassing the whole reassociation process.

That being said, I also understand that this patch blocks some people, so I would be okay for you to push it now as long as you commit to look at the actual problem in a reasonable time. In other words, if this patch is really needed to unblock people, go for it, but I want a different solution in a near future.

Cheers,
-Quentin

lib/Transforms/Scalar/LoopStrengthReduce.cpp
3430 ↗(On Diff #89939)

I agree with Sanjoy that there is nothing preventing this for happening when AllFixupsOutsideLoop is false.

3429 ↗(On Diff #89714)

What is the highest number you get from the LLVM test suite?
Same for clang self-host?

@qcolombet, thanks for the analysis. It would be even better if we had a fix to the SCEVExpander's pathological behavior. Now I'm afraid this patch just hides the problem for this particular category of test case.

wmi added a comment.Mar 25 2017, 5:53 PM

Quentin, it is really a good finding, thanks a lot! I was cheated by the large amount of reassociation candidates and I have verified the non-linear increase of compile time is indeed because of SCEVExpand!

I digged into it a little and found a problem in SCEV:
For an SCEVAddRecExpr like this:
{{{{{{{{{{{{2002,+,-1}<nsw><%bb5>,+,-1}<nsw><%bb23>,+,-1}<nsw><%bb41>,+,-1}<nsw><%bb59>,+,-1}<nsw><%bb77>,+,-1}<nsw><%bb95>,+,-1}<nsw><%bb113>,+,-1}<nsw><%bb131>,+,-1}<nsw><%bb149>,+,-1}<nsw><%bb167>,+,-1}<nsw><%bb185>,+,-1}<nw><%bb203>

When ScalarEvolution::getZeroExtendExpr is called for a SCEVAddRecExpr, it will try to prove the wrap flag by computing whether sext(start + step *iterations) == sext(start) + sext(step*iterations). It means getZeroExtendExpr will be called at least twice for each level. If the SCEVAddRecExpr above has N level of SCEVAddRecExpr, the complexity is O(2^N). However, if we have cache for wrap flag, many such getZeroExtendExpr calls can be saved.

If the result of wrap flag analysis for a SCEVAddRecExpr is FlagNSW/FlagNUW, it will be recorded. If the wrap analysis result is FlagAnyWrap, it will be computed again next time. A problem of current implementation is: FlagAnyWrap cannot tell us whether this is the first time to analyze the wrap flag of the SCEV.

I try a hack by adding a FlagMayWrap in SCEV::NoWrapFlags. When we have done analysis for a SCEVAddRecExpr and still have no idea whether it will wrap, we set it to be FlagMayWrap. Then we can skip wrap analysis for any SCEV with flag other than FlagAnyWrap. I find the hack can solve the compile time problem, but it[[

| name ]] definitely needs to be improved.

It would be even better if we had a fix to the SCEVExpander's pathological behavior.

Completely agree! The patch was a mean to demonstrate the problem :).

sanjoy edited edge metadata.Mar 25 2017, 10:36 PM

Hi Wei,

If this indeed is a problem with ScalarEvolution::getZeroExtendExpr then you should be able to write a .ll file that triggers the exponential behavior on -analyze -scalar-evolution. Do you mind giving writing such a test case (as a first step) a shot?

wmi updated this revision to Diff 94115.Apr 4 2017, 2:08 PM

I extended the test and now it took more than one hour on my sandybridge machine when built with clang in release mode.
I added early returns in getZeroExtendExpr/getSignExtendExpr for SCEVAddRecExpr with NW flag. Like the test shows, the compile explosion can only happen when the step of SCEVAddRecExpr is negative and NW flag can be marked. With the change, the test now takes less than one second.

Sanjoy, could you take a look?

sanjoy requested changes to this revision.Apr 9 2017, 11:55 PM

Hi Wei,

I may be wrong, but I'm not convinced that your approach is correct (the reason is inline).

Is it possible to solve this by adding a cache that maps SCEV expressions to their zero (and sign) extended variants? We'd not want this cache to be permanent, since it can lock SCEV into a pessimistic state that it won't get out of even after (say) proving a value to be nuw. However, we can create one such cache in the lexical scope of the top level getZeroExtendExpr (and maybe split out a getZeroExtendImpl that takes a pointer to said cache).

lib/Analysis/ScalarEvolution.cpp
1790 ↗(On Diff #94115)

I'm not sure if this is correct. You're saying that, e.g., sext({A,+,B}<nw> is {sext A,+,zext B}, but what if A is INT_MAX, B is 1, and the backedge taken count count is 1?

unittests/Analysis/ScalarEvolutionTest.cpp
625 ↗(On Diff #94115)

Did you consider generating this IR programmatically? That would be easier to modify.

This revision now requires changes to proceed.Apr 9 2017, 11:55 PM
wmi added a comment.Apr 10 2017, 2:11 PM

Sanjoy, thanks for the comments.

Is it possible to solve this by adding a cache that maps SCEV expressions to their zero (and sign) extended variants? We'd not want this cache to be permanent, since it can lock SCEV into a pessimistic state that it won't get out of even after (say) proving a value to be nuw. However, we can create one such cache in the lexical scope of the top level getZeroExtendExpr (and maybe split out a getZeroExtendImpl that takes a pointer to said cache).

It sounds a good solution. I will try it if my simple fix is still not enough after adding the constraint.

lib/Analysis/ScalarEvolution.cpp
1790 ↗(On Diff #94115)

Is adding one more constraint isKnownNegative(Step) enough?

unittests/Analysis/ScalarEvolutionTest.cpp
625 ↗(On Diff #94115)

Ok, I will do it.

sanjoy added inline comments.Apr 10 2017, 2:17 PM
lib/Analysis/ScalarEvolution.cpp
1790 ↗(On Diff #94115)

I don't think so (btw, the same issue applies to the getSignExtendExpr case as well): {-1,0,-1} is <nw> for a loop that runs (say) 10 times, but it unsigned overflows the first time the loop's backedge is taken.

wmi updated this revision to Diff 95000.Apr 12 2017, 11:09 AM
wmi edited edge metadata.

Address Sanjoy's comments.

  • Implement local caches for getZeroExtendExpr and getSignExtendExpr.
  • Generate the IR programatically for the testcase.
sanjoy requested changes to this revision.Apr 12 2017, 2:15 PM
sanjoy added inline comments.
lib/Analysis/ScalarEvolution.cpp
1509 ↗(On Diff #95000)

DenseMap is probably too heavyweight here -- can you change this to SmallDenseMap<8>?

2002 ↗(On Diff #95000)

Can you create a lambda for this that use as return InsertResult(SExt); instead of goto RET;?

This revision now requires changes to proceed.Apr 12 2017, 2:15 PM
wmi updated this revision to Diff 95071.Apr 12 2017, 6:12 PM
wmi edited edge metadata.

Address Sanjoy's comments: Use SmallDenseMap and lambda.

sanjoy requested changes to this revision.Apr 12 2017, 6:32 PM

I think the code can be structured to make things more robust against future changes. PTAL and see if you agree:

lib/Analysis/ScalarEvolution.cpp
1520 ↗(On Diff #95071)

s/use/Use/
s/func/function/

1547 ↗(On Diff #95071)

No need to move this comment?

1734 ↗(On Diff #95071)

Can you please make SmallDenseMap<std::pair<const SCEV *, Type *>, const SCEV *, 8> a typedef?

1739 ↗(On Diff #95071)

s/use/Use/
s/func/function/

Also CacheResult may be a better name (sorry for not suggesting it before! -- hopefully this should be a quick find-replace).

Also, what do you think about an RAII object (under #ifndef NDEBUG) that checks that {Op, Ty} is in the cache on destruction? I can imagine someone adding a new return path and forgetting to call CacheResult.

Actually, a better solution would be to make the caching guarantee true by construction:

getZeroExtendExpr creates a cache, and calls getZeroExtendExprCached with the cache. getZeroExtendExprCached calls getZeroExtendExprImpl to do the heavy lifting (which recursively calls getZeroExtendExprCached when needed), and also does the pre-check in the cache, and also the insertion into the cache (that way getZeroExtendExprImpl does not have to remember to insert into the cache at each return site).

This revision now requires changes to proceed.Apr 12 2017, 6:32 PM
wmi updated this revision to Diff 95151.Apr 13 2017, 10:00 AM
wmi edited edge metadata.

Address Sanjoy's comments: Add another two proxy functions: getZeroExtendExprCached and getSignExtendExprCached.

sanjoy accepted this revision.Apr 16 2017, 10:26 PM

lgtm with nits

include/llvm/Analysis/ScalarEvolution.h
1163 ↗(On Diff #95151)

I think the usual naming scheme here is ExtendCacheTy?

lib/Analysis/ScalarEvolution.cpp
1522 ↗(On Diff #95151)

If possible, you should

auto InsertResult = Cache.insert({{Op, Ty}, ZExt});
assert(InsertResult.second);

otherwise state that it is possible (how?) that the key is already present in the cache.

This revision is now accepted and ready to land.Apr 16 2017, 10:26 PM
wmi added a comment.Apr 17 2017, 10:52 AM

Sanjoy, thanks for all the helpful comments.

include/llvm/Analysis/ScalarEvolution.h
1163 ↗(On Diff #95151)

Fixed.

lib/Analysis/ScalarEvolution.cpp
1522 ↗(On Diff #95151)

I added the assertion and at least unittest and llvm testsuite didn't trigger it.

This revision was automatically updated to reflect the committed changes.