This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
-
LoopUnrollPass.cpp
-
test/Transforms/LoopUnroll/
-
Transforms/
-
LoopUnroll/
-
full-unroll-bad-geps.ll
-
full-unroll-heuristics.ll

Differential D8816

Reimplement heuristic for estimating complete-unroll optimization effects.
ClosedPublic

Authored by mzolotukhin on Apr 2 2015, 7:26 PM.

Download Raw Diff

Details

Reviewers

chandlerc
hfinkel

Commits

rG8c68171fef79: Reimplement heuristic for estimating complete-unroll optimization effects.
rL237156: Reimplement heuristic for estimating complete-unroll optimization effects.

Summary

This patch reimplements heuristic that tries to estimate optimization beneftis
from complete loop unrolling.

In this patch I kept the minimal changes - e.g. I removed code handling
branches and folding compares. That's a promising area, but now there
are too many questions to discuss before we can enable it.

Diff Detail

Repository: rL LLVM

Event Timeline

mzolotukhin updated this revision to Diff 23200.Apr 2 2015, 7:26 PM

mzolotukhin retitled this revision from to Reimplement heuristic for estimating complete-unroll optimization effects..

mzolotukhin updated this object.

mzolotukhin edited the test plan for this revision. (Show Details)Apr 2 2015, 7:26 PM

mzolotukhin added reviewers: hfinkel, chandlerc.

mzolotukhin added a subscriber: Unknown Object (MLST).

mzolotukhin mentioned this in D7837: Reimplement heuristic for estimating complete-unroll optimization effects..Apr 2 2015, 7:28 PM

Ping!

Ok, full pass of comments below.

Generally, this is definitely looking better to me. I think there is still a number of things that could be simplified or refactored, but that can come later. The stuff below is just to try and get this iteration ready to go.

lib/Transforms/Scalar/LoopUnrollPass.cpp
254–255 ↗	(On Diff #23200)	Indentation seems off here..
257–261 ↗	(On Diff #23200)	No need to repeat the variable name. I'd also call the variable "HaveSeenAR" for consistency with LLVM's naming conventions.
272–274 ↗	(On Diff #23200)	While you're here, can you make the argument be 'L' instead of 'loop'?
284–285 ↗	(On Diff #23200)	I somewhat dislike an assignment in a return expression. It's really hard to see when reading the code. Could you instead set the variable and then return? Or maybe use a function that aborts the traversal?
287 ↗	(On Diff #23200)	This is the only raw false here. Everything else that returns false sets IndexIsConstant to false first. If this is correct, could you comment on why?
363 ↗	(On Diff #23200)	SmallDenseMap?
413–415 ↗	(On Diff #23200)	Please don't use DenseMap like this. You're inserting a null value for every AddrOp you visit. You want to use exactly the logic used above for the LHS and RHS expresisons: Value AddrOp = I.getPointerOperand(); if (!isa<Constant>(AddrOp)) if (Constant SimplifiedAddrOp = SimplifiedValues.lookup(AddrOp)) AddrOp = SimplifiedAddrOp; Better yet, just hoist this entire pattern into a helper called 'simplifyValue' or some such?
416–418 ↗	(On Diff #23200)	Don't do two map lookups. Lookup the key once (using find because you don't just get a null pointer) and then test the iterator and use the iterator.
420 ↗	(On Diff #23200)	In what case is the base address null? It doesn't really make sense to add null base addr records to the cache to me?
429–430 ↗	(On Diff #23200)	What ensures that this is always safe? If something does ensure that this is always safe, I must assume it is when populating the structure. If that's the case, why can't we only store the unsigned variant?
433 ↗	(On Diff #23200)	What happens when "Step * Iteration" overflows? What about when "(Start + Step * Iteration)" overflows?
437–441 ↗	(On Diff #23200)	Under what circumstances is this not a constant? If this is genuinely not a constant, should we really be considering the load "optimized"? (Also, the cast<Value> shouldn't be needed?)
449–454 ↗	(On Diff #23200)	Would this second comment paragraph go better on the SCEV evaluation tool above? Or sunk into the implementation below? It seems kind of out of place here.
456 ↗	(On Diff #23200)	Shouldn't you rinse this through the simplification map?
457–458 ↗	(On Diff #23200)	You should probably comment specifically that we expect to re-visit the same instructions repeatedly (once per iteration) and so we only want to do iteration-independent SCEV queries and computations once. I'd also probably extract all of the SCEV computation stuff below into a separate member function that you can comment as being iteration independent, etc. Then you can structure the visit somewhat more naturally.
489 ↗	(On Diff #23200)	If the base address is a constant the GEP will also fold away though, so we should be able to mark it as optimized? (and we should be able to do this on each iteration)
498–499 ↗	(On Diff #23200)	I find it really weird to count optimized instructions rather than counting instructions that will remain after optimizations.
504–510 ↗	(On Diff #23200)	Rather than setting UnrolledLoopSize to UINT_MAX below to signal some inability to reasonably compute the unrolled size estimation, why not return true or false here? If this returns false, we have no useful data about the loop. Move on. If this returns true, then you can query for the detailed numbers.
515–520 ↗	(On Diff #23200)	I think you should have a FIXME to eventually remove the max iteration count to analyze. Once we shake the bugs out of the algorithm, it shouldn't be necessary. We should be willing to analyze any number of iterations as long as the un-optimized resulting instruction count is below a threshold.
562–566 ↗	(On Diff #23200)	I would handle all of this below where you're actually dealing with percentages. Handling it here seems really surprising and hard to understand.
690–703 ↗	(On Diff #23200)	Can you explain some of your motivations for having the double threshold and percentage query? It seems really awkward to implement, and so I'm curious what the need is here. If we could get around with just having a flat threshold, it'd make me happy. =]
832–842 ↗	(On Diff #23200)	I may just be forgetting where this is handled, but do we somewhere short-circuit the case where the total size of the loop body * the trip count is already below the threshold? Because we should. We shouldn't go and do the expensive analysis unless we at least have a large enough loop to be on the fence.

mzolotukhin added inline comments.Apr 9 2015, 6:15 PM

lib/Transforms/Scalar/LoopUnrollPass.cpp
690–703 ↗	(On Diff #23200)	The idea is the following: currently we have a threshold for unrolling small loops (~200 instructions). What I want to add is a possiblity to go beyond this threshold, but only if that gives a performance benefit. E.g. if unrolled loop would be 500 instructions, but it would be 30% faster than the original loop, then we want to unroll it. But we do not want to unroll this loop if it would become only 5% faster (in terms of cost of executed instructions). On the other hand, we don't want to unroll loops with huge trip counts, even if the resultant code seems to be faster. I.e. if unrolling would help to eliminate 50% of instructions, but the trip count is 10^9, we definitely don't want to unroll it. And several examples to illustrate the idea: a) int b[] = {0,0,0...0,1}; // most of the values are 0 for (i = 0; i < 500; i++) { t = b[i] * c[i]; a[i] = t * d[i]; } If we completely unroll the loop, we'll get something like: t = b[0]c[0]; a[0] = t d[0]; t = b[1]c[1]; a[1] = t d[1]; ... t = b[499]c[499]; a[499] = t d[499]; which would be simplified to: a[0] = 0; // b[0] == 0 a[1] = 0; // b[1] == 0 ... a[498] = 0; // b[498] == 0 a[499] = c[499]d[499]; //b[499] == 1 That is, unrolling helps to remove ~50% instructions in this case - and that's not about code size, it's about execution time, because in the original loop we have to execute every MUL instruction, since we don't know exact value of b[i]. b) / The same example as before, but with a huge trip count. / int b[] = {0,0,0...0,1}; // most of the values are 0 for (i = 0; i < 500000; i++) { t = b[i] c[i]; a[i] = t * d[i]; } We want to give up on this loop, because unrolled version would be way too big. We might have some problems compiling it, and even if we compile it succesfully, we might be hit hard by cache/memory effects. c) /* The same example as (a), but unrolling doesn't help to simplify anything. / int b[] = {6,2,3...4,7}; // no 0 or 1 values for (i = 0; i < 500; i++) { t = b[i] c[i]; a[i] = t * d[i]; } We don't want to just start unrolling any loop with higher trip count than we unrolled before, if that doesn't promise any performance benefits. So, to distinguish (a) and (b), we use 'AbsoluteThreshold'. To distinguish (a) and (c) we use percentage.

Move check for possible NumberOfOptimizedInstructions*100 overflow into canUnrollCompletely.
Hoist computing SCEV expressions out of the main traversal loop. This step is semantically different from visiting instructions to check if they could become redundant after loop unrolling. It's different because we need to do it only once, while other visitors need to be run for every iteration of the loop.
Make analyzeLoop return bool.
Don't run the analysis if we can unroll the loop even without it.
Make SCEVGEPDescriptor's fields Start and Step uint64_t (previously, they were APInt).
Compute Index in uint64_t and make sure operands fit into 32-bit int.
Other small changes.

Hi Chandler,
Please find my answers inline. The patch is updated correspondingly, is it ok to commit it?

lib/Transforms/Scalar/LoopUnrollPass.cpp
254–255 ↗	(On Diff #23200)	Fixed.
257–261 ↗	(On Diff #23200)	Fixed.
272–274 ↗	(On Diff #23200)	Fixed.
284–285 ↗	(On Diff #23200)	Fixed.
287 ↗	(On Diff #23200)	I rewrote return statements in the function, now they use raw true/false values. I also return `false` instead of `true` in `if (isa<SCEVConstant>(S))` - that doesn't matter since we can't "follow" into the SCEVConstant anyway, but `false` is more consistent here.
363 ↗	(On Diff #23200)	Makes sense, fixed.
413–415 ↗	(On Diff #23200)	Thanks, fixed! Though I didn't add a separate function for it.
416–418 ↗	(On Diff #23200)	Fixed.
420 ↗	(On Diff #23200)	You are right, it can't be null (it's checked when we prepare a new entry to SCEVCache). Fixed.
429–430 ↗	(On Diff #23200)	Thanks, I added constraints on the operands. Also, we now store unsigned invariant instead of APInt, as you suggested.
433 ↗	(On Diff #23200)	I made Index, Step, and Start uint64_t, while value in Start, Step and Iteration can't exceed 32-bit maximum. That should prevent overflows.
437–441 ↗	(On Diff #23200)	Thanks, fixed!
457–458 ↗	(On Diff #23200)	I think that we want to return to the original `cacheSCEVResults` approach. With that, we'd explicitly distinguish actions we want to do once (compute SCEVs and store interesting ones) from actions we want to perform on every simulated loop iteration (like trying to optimize LoadInst/BinaryOp/etc.). With that, added `cacheSCEVResults` and removed `visitGetElementPtr`.
489 ↗	(On Diff #23200)	We don't support such optimization for GEPS (for now). We can add it later, and it'll naturally go to `visitGetElementPtr`, which is currently removed.
498–499 ↗	(On Diff #23200)	I can change it, but in fact it doesn't sound so weird to me:)
504–510 ↗	(On Diff #23200)	That's a good idea, fixed according to your suggestion!
515–520 ↗	(On Diff #23200)	Is it possible that we can optimize all instructions in the loop body, and thus don't reach the threshold? I think it isn't, because we would have at least one instruction (branch) in the loop body, but I'm not confident here - maybe there could be some weird cases (i.e. cost of the branch is 0). If it's guaranteed that cost of the loop body is always > 0, then we can remove this limit.
562–566 ↗	(On Diff #23200)	Fixed!
832–842 ↗	(On Diff #23200)	Good point, fixed!

Ping!

Rebase to trunk:

Reimplement heuristic for estimating complete-unroll optimization effects.
Address Chandler's remarks.
Address Chandler's remarks.
Address Chandler's remarks.
Fix merge fail.
Address Chandler's remarks.
Address Chandler's remarks.

Rebase on trunk.

Add a helper-function lookupSimplifiedValue.

Ping ^3.

Rebase on recent trunk.
Prevent precision loss in UnrolledSize.
Remove unused VisitedGEPs.
Fix a typo in comment.

Ping ^4.

Whew! Back to this at last. Sorry for the huge delay.

Lots of comments below, but I've marked some as good candidates for follow-up patches.

Can you let me know if it makes sense for me to take a look at the DCE stuff or if we should focus on getting this one landed first?

lib/Transforms/Scalar/LoopUnrollPass.cpp
277 ↗	(On Diff #25389)	Please add a comment to this function explaining what its trying to do.
333–334 ↗	(On Diff #25389)	Rather than all of this, you can just use a SmallSetVector<BasicBlock *, 16>. I would prefer this somewhat rather than the typedef...
339 ↗	(On Diff #25389)	Just use a named inner struct -- typedef-ing structs is only necessary in C modes.
341–342 ↗	(On Diff #25389)	While 64-bits should be enough for any common cases, if the SCEV code has APInts I would just continue to use them here so that we have full fidelity.
377–384 ↗	(On Diff #25389)	The spacing and mixture of doxygen style and non-doxygen style seems really messy here.
419 ↗	(On Diff #25389)	'd' isn't a very helpful name (aside from not matching the variable naming conventions).
421–423 ↗	(On Diff #25389)	Comment here to remind the reader that you're checking for the specific types of SCEVGEP loads that can be folded completely to a constant.
432–433 ↗	(On Diff #25389)	We at least need a comment or FIXME here as we shouldn't return false here. A load past the end of sequential constant data is an error, and so we should be free to fold it to nothing for the purpose of loop unroll cost estimation.
450 ↗	(On Diff #25389)	Why not a range based loop here?
454–455 ↗	(On Diff #25389)	It feels like we could probably hoist some of this out of the loop? Feel free to just leave a FIXME and not deal with it in this patch.
463 ↗	(On Diff #25389)	Again, 'd' is a bad name here. Actually, I don't know why you create the descriptor this early? It seems like this region of code could just use the base addr from the visitor.
480 ↗	(On Diff #25389)	Is this to prevent overflow of the offset computations later? If so, comment that please. If not, what is it for?
483–486 ↗	(On Diff #25389)	I feel like this could just be an insert call? Or if you'd rather, something like: SCEVCache[V] = {Visitor.BaseAddress, StartAP.getLimitedValue(), StepAP.getLimitedValue()};
504–511 ↗	(On Diff #25389)	I feel like this should probably be a doxygen comment.
514–515 ↗	(On Diff #25389)	Why do we zero these here rather than in the constructor?
537–538 ↗	(On Diff #25389)	This seems vacuous due to the requirement of a terminator...
544–549 ↗	(On Diff #25389)	Is there a reason you don't make visit() return a bool indicating whether it's cost should be counted or not, and localize all the counting in this function? It would be much easier to understand IMO. I think I would also find it easier to read this as counting the number of instructions that will actually result from unrolling (essentially, the unoptimized instructions) and the optimized instructions. You could still sum them and divide to compute the percentage, but it would make the threshold check not require subtraction. That could be done in a follow-up patch though.
687–691 ↗	(On Diff #25389)	I would find it much more clear to just write the "percentage" check below in a way that wouldn't overflow: uint64_t PercentOfOptimizedInstructions = (uint64_t)NumberOfOptimizedINstructions * 100ull / UnrolledSize;
695 ↗	(On Diff #25389)	I don't think the comment here is helping. I would just add an assert about it above, after the test.
848 ↗	(On Diff #25389)	Why do you need this? I'm surprised this isn't just directly using the UA's values?

This revision now requires changes to proceed.May 11 2015, 3:28 PM

Add a comment before follow().
Replace BBSetVector with SmallSetVector<BasicBlock *, 16>.
Doxygen-ify some comments.
Remove unnecessary variable 'SCEVGEPDescriptor d'.
Use range-base loop.
Replace 'd' with a meaningfull name.
Add some comments and fixme-s.
Initialize NumberOfOptimizedInstructions and UnrolledLoopSize in constructor.
Rewrite expression to avoid overflow.
Remove no longer needed overflow check.
Replace a comment with an assert.
Use UA.UnrolledLoopSize instead of min(UA.UnrolledLoopSize, UnrolledSize).
Add FIXME for out-of-bound access handling.
Remove redundant if(BB->empty()) check.
Replace typedef with a named struct.
Use APInt instead of uint64_t.
Add sanity checks before accessing SCEV.

Hi Chandler,

Thanks for the comments! I believe I've addressed in the source all of them, except this one:

Is there a reason you don't make visit() return a bool indicating whether it's cost should be counted or not, and localize all the counting in this function? It would be much easier to understand IMO.

I think I would also find it easier to read this as counting the number of instructions that will actually result from unrolling (essentially, the *un*optimized instructions) and the optimized instructions. You could still sum them and divide to compute the percentage, but it would make the threshold check not require subtraction. That could be done in a follow-up patch though.

I kept counting the cost inside the visit function, because we might have three cases there:

instruction was simplified to a constant (e.g. x = a[0] * y = 0 * y = 0)
instruction was simplified, but not to a constant (e.g. x = a[0] + y = 0 + y = y)
instruction wasn't simplified

In case we want to distinguish (1) and (2) outside the visit function, bool won't be enough. Currently we won't lose much by merging them though, but I didn't want to limit ourselves here from the very beginning.

Generally, I think this can go in. There are a bunch of things I think should be cleaned up here, but they're fairly minor and I'm happy to just fix those and for a few that have more impact, send you patches.

In D8816#170783, @mzolotukhin wrote:

Hi Chandler,

Thanks for the comments! I believe I've addressed in the source all of them, except this one:

Is there a reason you don't make visit() return a bool indicating whether it's cost should be counted or not, and localize all the counting in this function? It would be much easier to understand IMO.

I think I would also find it easier to read this as counting the number of instructions that will actually result from unrolling (essentially, the *un*optimized instructions) and the optimized instructions. You could still sum them and divide to compute the percentage, but it would make the threshold check not require subtraction. That could be done in a follow-up patch though.

I kept counting the cost inside the visit function, because we might have three cases there:

instruction was simplified to a constant (e.g. x = a[0] * y = 0 * y = 0)

instruction was simplified, but not to a constant (e.g. x = a[0] + y = 0 + y = y)

instruction wasn't simplified

In case we want to distinguish (1) and (2) outside the visit function, bool won't be enough. Currently we won't lose much by merging them though, but I didn't want to limit ourselves here from the very beginning.

I don't think we need to distinguish between them. The key to realize is that if 'y' above were inside the loop body, we would already have accounted for its cost. The critical thing is whether we can fold 'x' away.

While perhaps we'll want more advanced heuristics, but I would rather assume not and simplify the code accordingly until a real use case arrives. (YAGNI, essentially.)

If you agree, I'm happy to make this change.

The only specific change I'd like to request you make in a follow-up are to ensure some of the test cases you mentioned in email exercising the percentage threshold etc are actually checked in as test cases.

Thanks!

This revision is now accepted and ready to land.May 12 2015, 9:11 AM

Closed by commit rL237156: Reimplement heuristic for estimating complete-unroll optimization effects. (authored by mzolotukhin). · Explain WhyMay 12 2015, 10:23 AM

This revision was automatically updated to reflect the committed changes.

Thanks, Chandler!

I've committed the patch and will follow-up with the tests today.

There are a bunch of things I think should be cleaned up here, but they're fairly minor and I'm happy to just fix those and for a few that have more impact, send you patches.

Sure, go ahead!

Michael

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Scalar/

LoopUnrollPass.cpp

556 lines

test/

Transforms/

LoopUnroll/

full-unroll-bad-geps.ll

34 lines

full-unroll-heuristics.ll

4 lines

Diff 25600

llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 180 Lines • ▼ Show 20 Lines	public:

// Select threshold values used to limit unrolling based on a		// Select threshold values used to limit unrolling based on a
// total unrolled size. Parameters Threshold and PartialThreshold		// total unrolled size. Parameters Threshold and PartialThreshold
// are set to the maximum unrolled size for fully and partially		// are set to the maximum unrolled size for fully and partially
// unrolled loops respectively.		// unrolled loops respectively.
void selectThresholds(const Loop *L, bool HasPragma,		void selectThresholds(const Loop *L, bool HasPragma,
const TargetTransformInfo::UnrollingPreferences &UP,		const TargetTransformInfo::UnrollingPreferences &UP,
unsigned &Threshold, unsigned &PartialThreshold,		unsigned &Threshold, unsigned &PartialThreshold,
unsigned NumberOfOptimizedInstructions) {		unsigned &AbsoluteThreshold,
		unsigned &PercentOfOptimizedForCompleteUnroll) {
// Determine the current unrolling threshold. While this is		// Determine the current unrolling threshold. While this is
// normally set from UnrollThreshold, it is overridden to a		// normally set from UnrollThreshold, it is overridden to a
// smaller value if the current function is marked as		// smaller value if the current function is marked as
// optimize-for-size, and the unroll threshold was not user		// optimize-for-size, and the unroll threshold was not user
// specified.		// specified.
Threshold = UserThreshold ? CurrentThreshold : UP.Threshold;		Threshold = UserThreshold ? CurrentThreshold : UP.Threshold;
		PartialThreshold = UserThreshold ? CurrentThreshold : UP.PartialThreshold;
// If we are allowed to completely unroll if we can remove M% of		AbsoluteThreshold = UserAbsoluteThreshold ? CurrentAbsoluteThreshold
// instructions, and we know that with complete unrolling we'll be able
// to kill N instructions, then we can afford to completely unroll loops
// with unrolled size up to N*100/M.
// Adjust the threshold according to that:
unsigned PercentOfOptimizedForCompleteUnroll =
UserPercentOfOptimized ? CurrentMinPercentOfOptimized
: UP.MinPercentOfOptimized;
unsigned AbsoluteThreshold = UserAbsoluteThreshold
? CurrentAbsoluteThreshold
: UP.AbsoluteThreshold;		: UP.AbsoluteThreshold;
if (PercentOfOptimizedForCompleteUnroll)		PercentOfOptimizedForCompleteUnroll = UserPercentOfOptimized
Threshold = std::max<unsigned>(Threshold,		? CurrentMinPercentOfOptimized
NumberOfOptimizedInstructions * 100 /		: UP.MinPercentOfOptimized;
PercentOfOptimizedForCompleteUnroll);
// But don't allow unrolling loops bigger than absolute threshold.
Threshold = std::min<unsigned>(Threshold, AbsoluteThreshold);

PartialThreshold = UserThreshold ? CurrentThreshold : UP.PartialThreshold;
if (!UserThreshold &&		if (!UserThreshold &&
L->getHeader()->getParent()->hasFnAttribute(		L->getHeader()->getParent()->hasFnAttribute(
Attribute::OptimizeForSize)) {		Attribute::OptimizeForSize)) {
Threshold = UP.OptSizeThreshold;		Threshold = UP.OptSizeThreshold;
PartialThreshold = UP.PartialOptSizeThreshold;		PartialThreshold = UP.PartialOptSizeThreshold;
}		}
if (HasPragma) {		if (HasPragma) {
// If the loop has an unrolling pragma, we want to be more		// If the loop has an unrolling pragma, we want to be more
// aggressive with unrolling limits. Set thresholds to at		// aggressive with unrolling limits. Set thresholds to at
// least the PragmaTheshold value which is larger than the		// least the PragmaTheshold value which is larger than the
// default limits.		// default limits.
if (Threshold != NoThreshold)		if (Threshold != NoThreshold)
Threshold = std::max<unsigned>(Threshold, PragmaUnrollThreshold);		Threshold = std::max<unsigned>(Threshold, PragmaUnrollThreshold);
if (PartialThreshold != NoThreshold)		if (PartialThreshold != NoThreshold)
PartialThreshold =		PartialThreshold =
std::max<unsigned>(PartialThreshold, PragmaUnrollThreshold);		std::max<unsigned>(PartialThreshold, PragmaUnrollThreshold);
}		}
}		}
		bool canUnrollCompletely(Loop *L, unsigned Threshold,
		unsigned AbsoluteThreshold, uint64_t UnrolledSize,
		unsigned NumberOfOptimizedInstructions,
		unsigned PercentOfOptimizedForCompleteUnroll);
};		};
}		}

char LoopUnroll::ID = 0;		char LoopUnroll::ID = 0;
INITIALIZE_PASS_BEGIN(LoopUnroll, "loop-unroll", "Unroll loops", false, false)		INITIALIZE_PASS_BEGIN(LoopUnroll, "loop-unroll", "Unroll loops", false, false)
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)		INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LoopSimplify)		INITIALIZE_PASS_DEPENDENCY(LoopSimplify)
INITIALIZE_PASS_DEPENDENCY(LCSSA)		INITIALIZE_PASS_DEPENDENCY(LCSSA)
INITIALIZE_PASS_DEPENDENCY(ScalarEvolution)		INITIALIZE_PASS_DEPENDENCY(ScalarEvolution)
INITIALIZE_PASS_END(LoopUnroll, "loop-unroll", "Unroll loops", false, false)		INITIALIZE_PASS_END(LoopUnroll, "loop-unroll", "Unroll loops", false, false)

Pass *llvm::createLoopUnrollPass(int Threshold, int Count, int AllowPartial,		Pass *llvm::createLoopUnrollPass(int Threshold, int Count, int AllowPartial,
int Runtime) {		int Runtime) {
return new LoopUnroll(Threshold, Count, AllowPartial, Runtime);		return new LoopUnroll(Threshold, Count, AllowPartial, Runtime);
}		}

Pass *llvm::createSimpleLoopUnrollPass() {		Pass *llvm::createSimpleLoopUnrollPass() {
return llvm::createLoopUnrollPass(-1, -1, 0, 0);		return llvm::createLoopUnrollPass(-1, -1, 0, 0);
}		}

static bool isLoadFromConstantInitializer(Value *V) {
if (GlobalVariable *GV = dyn_cast<GlobalVariable>(V))
if (GV->isConstant() && GV->hasDefinitiveInitializer())
return GV->getInitializer();
return false;
}

namespace {		namespace {
		/// \brief SCEV expressions visitor used for finding expressions that would
		/// become constants if the loop L is unrolled.
struct FindConstantPointers {		struct FindConstantPointers {
bool LoadCanBeConstantFolded;		/// \brief Shows whether the expression is ConstAddress+Constant or not.
bool IndexIsConstant;		bool IndexIsConstant;
APInt Step;
APInt StartValue;		/// \brief Used for filtering out SCEV expressions with two or more AddRec
		/// subexpressions.
		///
		/// Used to filter out complicated SCEV expressions, having several AddRec
		/// sub-expressions. We don't handle them, because unrolling one loop
		/// would help to replace only one of these inductions with a constant, and
		/// consequently, the expression would remain non-constant.
		bool HaveSeenAR;

		/// \brief If the SCEV expression becomes ConstAddress+Constant, this value
		/// holds ConstAddress. Otherwise, it's nullptr.
Value *BaseAddress;		Value *BaseAddress;

		/// \brief The loop, which we try to completely unroll.
const Loop *L;		const Loop *L;

ScalarEvolution &SE;		ScalarEvolution &SE;
FindConstantPointers(const Loop *loop, ScalarEvolution &SE)
: LoadCanBeConstantFolded(true), IndexIsConstant(true), L(loop), SE(SE) {}

		FindConstantPointers(const Loop *L, ScalarEvolution &SE)
		: IndexIsConstant(true), HaveSeenAR(false), BaseAddress(nullptr),
		L(L), SE(SE) {}

		/// Examine the given expression S and figure out, if it can be a part of an
		/// expression, that could become a constant after the loop is unrolled.
		/// The routine sets IndexIsConstant and HaveSeenAR according to the analysis
		/// results.
		/// \returns true if we need to examine subexpressions, and false otherwise.
bool follow(const SCEV *S) {		bool follow(const SCEV *S) {
if (const SCEVUnknown *SC = dyn_cast<SCEVUnknown>(S)) {		if (const SCEVUnknown *SC = dyn_cast<SCEVUnknown>(S)) {
// We've reached the leaf node of SCEV, it's most probably just a		// We've reached the leaf node of SCEV, it's most probably just a
// variable. Now it's time to see if it corresponds to a global constant		// variable.
// global (in which case we can eliminate the load), or not.		// If it's the only one SCEV-subexpression, then it might be a base
		// address of an index expression.
		// If we've already recorded base address, then just give up on this SCEV
		// - it's too complicated.
		if (BaseAddress) {
		IndexIsConstant = false;
		return false;
		}
BaseAddress = SC->getValue();		BaseAddress = SC->getValue();
LoadCanBeConstantFolded =
IndexIsConstant && isLoadFromConstantInitializer(BaseAddress);
return false;		return false;
}		}
if (isa<SCEVConstant>(S))		if (isa<SCEVConstant>(S))
return true;		return false;
if (const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(S)) {		if (const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(S)) {
// If the current SCEV expression is AddRec, and its loop isn't the loop		// If the current SCEV expression is AddRec, and its loop isn't the loop
// we are about to unroll, then we won't get a constant address after		// we are about to unroll, then we won't get a constant address after
// unrolling, and thus, won't be able to eliminate the load.		// unrolling, and thus, won't be able to eliminate the load.
if (AR->getLoop() != L)		if (AR->getLoop() != L) {
return IndexIsConstant = false;		IndexIsConstant = false;
// If the step isn't constant, we won't get constant addresses in unrolled		return false;
// version. Bail out.		}
if (const SCEVConstant *StepSE =		// We don't handle multiple AddRecs here, so give up in this case.
dyn_cast<SCEVConstant>(AR->getStepRecurrence(SE)))		if (HaveSeenAR) {
Step = StepSE->getValue()->getValue();		IndexIsConstant = false;
else		return false;
return IndexIsConstant = false;		}
		HaveSeenAR = true;
return IndexIsConstant;
}		}
// If Result is true, continue traversal.
// Otherwise, we have found something that prevents us from (possible) load		// Continue traversal.
// elimination.		return true;
return IndexIsConstant;
}		}
bool isDone() const { return !IndexIsConstant; }		bool isDone() const { return !IndexIsConstant; }
};		};

// This class is used to get an estimate of the optimization effects that we		// This class is used to get an estimate of the optimization effects that we
// could get from complete loop unrolling. It comes from the fact that some		// could get from complete loop unrolling. It comes from the fact that some
// loads might be replaced with concrete constant values and that could trigger		// loads might be replaced with concrete constant values and that could trigger
// a chain of instruction simplifications.		// a chain of instruction simplifications.
//		//
// E.g. we might have:		// E.g. we might have:
// int a[] = {0, 1, 0};		// int a[] = {0, 1, 0};
// v = 0;		// v = 0;
// for (i = 0; i < 3; i ++)		// for (i = 0; i < 3; i ++)
// v += b[i]*a[i];		// v += b[i]*a[i];
// If we completely unroll the loop, we would get:		// If we completely unroll the loop, we would get:
// v = b[0]a[0] + b[1]a[1] + b[2]*a[2]		// v = b[0]a[0] + b[1]a[1] + b[2]*a[2]
// Which then will be simplified to:		// Which then will be simplified to:
// v = b[0]* 0 + b[1]* 1 + b[2]* 0		// v = b[0]* 0 + b[1]* 1 + b[2]* 0
// And finally:		// And finally:
// v = b[1]		// v = b[1]
class UnrollAnalyzer : public InstVisitor<UnrollAnalyzer, bool> {		class UnrollAnalyzer : public InstVisitor<UnrollAnalyzer, bool> {
typedef InstVisitor<UnrollAnalyzer, bool> Base;		typedef InstVisitor<UnrollAnalyzer, bool> Base;
friend class InstVisitor<UnrollAnalyzer, bool>;		friend class InstVisitor<UnrollAnalyzer, bool>;

		struct SCEVGEPDescriptor {
		Value *BaseAddr;
		APInt Start;
		APInt Step;
		};

		/// \brief The loop we're going to analyze.
const Loop *L;		const Loop *L;

		/// \brief TripCount of the given loop.
unsigned TripCount;		unsigned TripCount;

ScalarEvolution &SE;		ScalarEvolution &SE;

const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;

		// While we walk the loop instructions, we we build up and maintain a mapping
		// of simplified values specific to this iteration. The idea is to propagate
		// any special information we have about loads that can be replaced with
		// constants after complete unrolling, and account for likely simplifications
		// post-unrolling.
DenseMap<Value , Constant > SimplifiedValues;		DenseMap<Value , Constant > SimplifiedValues;
DenseMap<LoadInst , Value > LoadBaseAddresses;
SmallPtrSet<Instruction *, 32> CountedInstructions;

/// \brief Count the number of optimized instructions.		// To avoid requesting SCEV info on every iteration, request it once, and
unsigned NumberOfOptimizedInstructions;		// for each value that would become ConstAddress+Constant after loop
		// unrolling, save the corresponding data.
		SmallDenseMap<Value *, SCEVGEPDescriptor> SCEVCache;

		/// \brief Number of currently simulated iteration.
		///
		/// If an expression is ConstAddress+Constant, then the Constant is
		/// Start + Iteration*Step, where Start and Step could be obtained from
		/// SCEVCache.
		unsigned Iteration;

// Provide base case for our instruction visit.		/// \brief Upper threshold for complete unrolling.
		unsigned MaxUnrolledLoopSize;

		/// Base case for the instruction visitor.
bool visitInstruction(Instruction &I) { return false; };		bool visitInstruction(Instruction &I) { return false; };
// TODO: We should also visit ICmp, FCmp, GetElementPtr, Trunc, ZExt, SExt,
// FPTrunc, FPExt, FPToUI, FPToSI, UIToFP, SIToFP, BitCast, Select,		/// TODO: Add visitors for other instruction types, e.g. ZExt, SExt.
// ExtractElement, InsertElement, ShuffleVector, ExtractValue, InsertValue.
//		/// Try to simplify binary operator I.
// Probaly it's worth to hoist the code for estimating the simplifications		///
// effects to a separate class, since we have a very similar code in		/// TODO: Probaly it's worth to hoist the code for estimating the
// InlineCost already.		/// simplifications effects to a separate class, since we have a very similar
		/// code in InlineCost already.
bool visitBinaryOperator(BinaryOperator &I) {		bool visitBinaryOperator(BinaryOperator &I) {
Value LHS = I.getOperand(0), RHS = I.getOperand(1);		Value LHS = I.getOperand(0), RHS = I.getOperand(1);
if (!isa<Constant>(LHS))		if (!isa<Constant>(LHS))
if (Constant *SimpleLHS = SimplifiedValues.lookup(LHS))		if (Constant *SimpleLHS = SimplifiedValues.lookup(LHS))
LHS = SimpleLHS;		LHS = SimpleLHS;
if (!isa<Constant>(RHS))		if (!isa<Constant>(RHS))
if (Constant *SimpleRHS = SimplifiedValues.lookup(RHS))		if (Constant *SimpleRHS = SimplifiedValues.lookup(RHS))
RHS = SimpleRHS;		RHS = SimpleRHS;
Value *SimpleV = nullptr;		Value *SimpleV = nullptr;
const DataLayout &DL = I.getModule()->getDataLayout();		const DataLayout &DL = I.getModule()->getDataLayout();
if (auto FI = dyn_cast<FPMathOperator>(&I))		if (auto FI = dyn_cast<FPMathOperator>(&I))
SimpleV =		SimpleV =
SimplifyFPBinOp(I.getOpcode(), LHS, RHS, FI->getFastMathFlags(), DL);		SimplifyFPBinOp(I.getOpcode(), LHS, RHS, FI->getFastMathFlags(), DL);
else		else
SimpleV = SimplifyBinOp(I.getOpcode(), LHS, RHS, DL);		SimpleV = SimplifyBinOp(I.getOpcode(), LHS, RHS, DL);

if (SimpleV && CountedInstructions.insert(&I).second)		if (SimpleV)
NumberOfOptimizedInstructions += TTI.getUserCost(&I);		NumberOfOptimizedInstructions += TTI.getUserCost(&I);

if (Constant *C = dyn_cast_or_null<Constant>(SimpleV)) {		if (Constant *C = dyn_cast_or_null<Constant>(SimpleV)) {
SimplifiedValues[&I] = C;		SimplifiedValues[&I] = C;
return true;		return true;
}		}
return false;		return false;
}		}

Constant computeLoadValue(LoadInst LI, unsigned Iteration) {		/// Try to fold load I.
if (!LI)		bool visitLoad(LoadInst &I) {
return nullptr;		Value *AddrOp = I.getPointerOperand();
Value *BaseAddr = LoadBaseAddresses[LI];		if (!isa<Constant>(AddrOp))
if (!BaseAddr)		if (Constant *SimplifiedAddrOp = SimplifiedValues.lookup(AddrOp))
return nullptr;		AddrOp = SimplifiedAddrOp;

auto GV = dyn_cast<GlobalVariable>(BaseAddr);		auto It = SCEVCache.find(AddrOp);
if (!GV)		if (It == SCEVCache.end())
return nullptr;		return false;
		SCEVGEPDescriptor GEPDesc = It->second;

		auto GV = dyn_cast<GlobalVariable>(GEPDesc.BaseAddr);
		// We're only interested in loads that can be completely folded to a
		// constant.
		if (!GV \|\| !GV->hasInitializer())
		return false;

ConstantDataSequential *CDS =		ConstantDataSequential *CDS =
dyn_cast<ConstantDataSequential>(GV->getInitializer());		dyn_cast<ConstantDataSequential>(GV->getInitializer());
if (!CDS)		if (!CDS)
return nullptr;		return false;

const SCEV *BaseAddrSE = SE.getSCEV(BaseAddr);
const SCEV *S = SE.getSCEV(LI->getPointerOperand());
const SCEV *OffSE = SE.getMinusSCEV(S, BaseAddrSE);

APInt StepC, StartC;
const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(OffSE);
if (!AR)
return nullptr;

if (const SCEVConstant *StepSE =
dyn_cast<SCEVConstant>(AR->getStepRecurrence(SE)))
StepC = StepSE->getValue()->getValue();
else
return nullptr;

if (const SCEVConstant *StartSE = dyn_cast<SCEVConstant>(AR->getStart()))
StartC = StartSE->getValue()->getValue();
else
return nullptr;

		// Check possible overflow.
		if (GEPDesc.Start.getActiveBits() > 32 \|\| GEPDesc.Step.getActiveBits() > 32)
		return false;
unsigned ElemSize = CDS->getElementType()->getPrimitiveSizeInBits() / 8U;		unsigned ElemSize = CDS->getElementType()->getPrimitiveSizeInBits() / 8U;
unsigned Start = StartC.getLimitedValue();		uint64_t Index = (GEPDesc.Start.getLimitedValue() +
unsigned Step = StepC.getLimitedValue();		GEPDesc.Step.getLimitedValue() * Iteration) /
		ElemSize;
unsigned Index = (Start + Step * Iteration) / ElemSize;		if (Index >= CDS->getNumElements()) {
if (Index >= CDS->getNumElements())		// FIXME: For now we conservatively ignore out of bound accesses, but
return nullptr;		// we're allowed to perform the optimization in this case.
		return false;
		}

Constant *CV = CDS->getElementAsConstant(Index);		Constant *CV = CDS->getElementAsConstant(Index);
		assert(CV && "Constant expected.");
		SimplifiedValues[&I] = CV;

return CV;		NumberOfOptimizedInstructions += TTI.getUserCost(&I);
		return true;
}		}

public:		/// Visit all GEPs in the loop and find those which after complete loop
UnrollAnalyzer(const Loop *L, unsigned TripCount, ScalarEvolution &SE,		/// unrolling would become a constant, or BaseAddress+Constant.
const TargetTransformInfo &TTI)		///
: L(L), TripCount(TripCount), SE(SE), TTI(TTI),		/// Such GEPs could allow to evaluate a load to a constant later - for now we
NumberOfOptimizedInstructions(0) {}		/// just store the corresponding BaseAddress and StartValue with StepValue in
		/// the SCEVCache.
// Visit all loads the loop L, and for those that, after complete loop		void cacheSCEVResults() {
// unrolling, would have a constant address and it will point to a known
// constant initializer, record its base address for future use. It is used
// when we estimate number of potentially simplified instructions.
void findConstFoldableLoads() {
for (auto BB : L->getBlocks()) {		for (auto BB : L->getBlocks()) {
for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ++I) {		for (Instruction &I : *BB) {
if (LoadInst *LI = dyn_cast<LoadInst>(I)) {		if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(&I)) {
if (!LI->isSimple())		Value *V = cast<Value>(GEP);
		if (!SE.isSCEVable(V->getType()))
continue;		continue;
Value *AddrOp = LI->getPointerOperand();		const SCEV *S = SE.getSCEV(V);
const SCEV *S = SE.getSCEV(AddrOp);		// FIXME: Hoist the initialization out of the loop.
FindConstantPointers Visitor(L, SE);		FindConstantPointers Visitor(L, SE);
SCEVTraversal<FindConstantPointers> T(Visitor);		SCEVTraversal<FindConstantPointers> T(Visitor);
		// Try to find (BaseAddress+Step+Offset) tuple.
		// If succeeded, save it to the cache - it might help in folding
		// loads.
T.visitAll(S);		T.visitAll(S);
if (Visitor.IndexIsConstant && Visitor.LoadCanBeConstantFolded) {		if (!Visitor.IndexIsConstant \|\| !Visitor.BaseAddress)
LoadBaseAddresses[LI] = Visitor.BaseAddress;
}
}
}
}
}

// Given a list of loads that could be constant-folded (LoadBaseAddresses),
// estimate number of optimized instructions after substituting the concrete
// values for the given Iteration. Also track how many instructions become
// dead through this process.
unsigned estimateNumberOfOptimizedInstructions(unsigned Iteration) {
// We keep a set vector for the worklist so that we don't wast space in the
// worklist queuing up the same instruction repeatedly. This can happen due
// to multiple operands being the same instruction or due to the same
// instruction being an operand of lots of things that end up dead or
// simplified.
SmallSetVector<Instruction *, 8> Worklist;

// Clear the simplified values and counts for this iteration.
SimplifiedValues.clear();
CountedInstructions.clear();
NumberOfOptimizedInstructions = 0;

// We start by adding all loads to the worklist.
for (auto &LoadDescr : LoadBaseAddresses) {
LoadInst *LI = LoadDescr.first;
SimplifiedValues[LI] = computeLoadValue(LI, Iteration);
if (CountedInstructions.insert(LI).second)
NumberOfOptimizedInstructions += TTI.getUserCost(LI);

for (User *U : LI->users())
Worklist.insert(cast<Instruction>(U));
}

// And then we try to simplify every user of every instruction from the
// worklist. If we do simplify a user, add it to the worklist to process
// its users as well.
while (!Worklist.empty()) {
Instruction *I = Worklist.pop_back_val();
if (!L->contains(I))
continue;		continue;
if (!visit(I))
continue;
for (User *U : I->users())
Worklist.insert(cast<Instruction>(U));
}

// Now that we know the potentially simplifed instructions, estimate number		const SCEV *BaseAddrSE = SE.getSCEV(Visitor.BaseAddress);
// of instructions that would become dead if we do perform the		if (BaseAddrSE->getType() != S->getType())
// simplification.		continue;
		const SCEV *OffSE = SE.getMinusSCEV(S, BaseAddrSE);
// The dead instructions are held in a separate set. This is used to		const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(OffSE);
// prevent us from re-examining instructions and make sure we only count
// the benifit once. The worklist's internal set handles insertion
// deduplication.
SmallPtrSet<Instruction *, 16> DeadInstructions;

// Lambda to enque operands onto the worklist.
auto EnqueueOperands = [&](Instruction &I) {
for (auto *Op : I.operand_values())
if (auto *OpI = dyn_cast<Instruction>(Op))
if (!OpI->use_empty())
Worklist.insert(OpI);
};

// Start by initializing worklist with simplified instructions.		if (!AR)
for (auto &FoldedKeyValue : SimplifiedValues)
if (auto *FoldedInst = dyn_cast<Instruction>(FoldedKeyValue.first)) {
DeadInstructions.insert(FoldedInst);

// Add each instruction operand of this dead instruction to the
// worklist.
EnqueueOperands(*FoldedInst);
}

// If a definition of an insn is only used by simplified or dead
// instructions, it's also dead. Check defs of all instructions from the
// worklist.
while (!Worklist.empty()) {
Instruction *I = Worklist.pop_back_val();
if (!L->contains(I))
continue;		continue;
if (DeadInstructions.count(I))
		const SCEVConstant *StepSE =
		dyn_cast<SCEVConstant>(AR->getStepRecurrence(SE));
		const SCEVConstant *StartSE = dyn_cast<SCEVConstant>(AR->getStart());
		if (!StepSE \|\| !StartSE)
continue;		continue;

if (std::all_of(I->user_begin(), I->user_end(), [&](User *U) {		SCEVCache[V] = {Visitor.BaseAddress, StartSE->getValue()->getValue(),
return DeadInstructions.count(cast<Instruction>(U));		StepSE->getValue()->getValue()};
})) {		}
NumberOfOptimizedInstructions += TTI.getUserCost(I);
DeadInstructions.insert(I);
EnqueueOperands(*I);
}		}
}		}
return NumberOfOptimizedInstructions;
}		}
};
} // namespace

// Complete loop unrolling can make some loads constant, and we need to know if		public:
// that would expose any further optimization opportunities.		UnrollAnalyzer(const Loop *L, unsigned TripCount, ScalarEvolution &SE,
// This routine estimates this optimization effect and returns the number of		const TargetTransformInfo &TTI, unsigned MaxUnrolledLoopSize)
// instructions, that potentially might be optimized away.		: L(L), TripCount(TripCount), SE(SE), TTI(TTI),
static unsigned		MaxUnrolledLoopSize(MaxUnrolledLoopSize),
approximateNumberOfOptimizedInstructions(const Loop *L, ScalarEvolution &SE,		NumberOfOptimizedInstructions(0), UnrolledLoopSize(0) {}
unsigned TripCount,
const TargetTransformInfo &TTI) {
if (!TripCount \|\| !UnrollMaxIterationsCountToAnalyze)
return 0;

UnrollAnalyzer UA(L, TripCount, SE, TTI);		/// \brief Count the number of optimized instructions.
UA.findConstFoldableLoads();		unsigned NumberOfOptimizedInstructions;

// Estimate number of instructions, that could be simplified if we replace a		/// \brief Count the total number of instructions.
// load with the corresponding constant. Since the same load will take		unsigned UnrolledLoopSize;
// different values on different iterations, we have to go through all loop's
// iterations here. To limit ourselves here, we check only first N
// iterations, and then scale the found number, if necessary.
unsigned IterationsNumberForEstimate =
std::min<unsigned>(UnrollMaxIterationsCountToAnalyze, TripCount);
unsigned NumberOfOptimizedInstructions = 0;
for (unsigned i = 0; i < IterationsNumberForEstimate; ++i)
NumberOfOptimizedInstructions +=
UA.estimateNumberOfOptimizedInstructions(i);

NumberOfOptimizedInstructions *= TripCount / IterationsNumberForEstimate;		/// \brief Figure out if the loop is worth full unrolling.
		///
		/// Complete loop unrolling can make some loads constant, and we need to know
		/// if that would expose any further optimization opportunities. This routine
		/// estimates this optimization. It assigns computed number of instructions,
		/// that potentially might be optimized away, to
		/// NumberOfOptimizedInstructions, and total number of instructions to
		/// UnrolledLoopSize (not counting blocks that won't be reached, if we were
		/// able to compute the condition).
		/// \returns false if we can't analyze the loop, or if we discovered that
		/// unrolling won't give anything. Otherwise, returns true.
		bool analyzeLoop() {
		SmallSetVector<BasicBlock *, 16> BBWorklist;

		// Don't simulate loops with a big or unknown tripcount
		if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|
		TripCount > UnrollMaxIterationsCountToAnalyze)
		return false;

		// To avoid compute SCEV-expressions on every iteration, compute them once
		// and store interesting to us in SCEVCache.
		cacheSCEVResults();

		// Simulate execution of each iteration of the loop counting instructions,
		// which would be simplified.
		// Since the same load will take different values on different iterations,
		// we literally have to go through all loop's iterations.
		for (Iteration = 0; Iteration < TripCount; ++Iteration) {
		SimplifiedValues.clear();
		BBWorklist.clear();
		BBWorklist.insert(L->getHeader());
		// Note that we must not cache the size, this loop grows the worklist.
		for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
		BasicBlock *BB = BBWorklist[Idx];

		// Visit all instructions in the given basic block and try to simplify
		// it. We don't change the actual IR, just count optimization
		// opportunities.
		for (Instruction &I : *BB) {
		UnrolledLoopSize += TTI.getUserCost(&I);
		Base::visit(I);
		// If unrolled body turns out to be too big, bail out.
		if (UnrolledLoopSize - NumberOfOptimizedInstructions >
		MaxUnrolledLoopSize)
		return false;
		}

return NumberOfOptimizedInstructions;		// Add BB's successors to the worklist.
		for (BasicBlock *Succ : successors(BB))
		if (L->contains(Succ))
		BBWorklist.insert(Succ);
		}

		// If we found no optimization opportunities on the first iteration, we
		// won't find them on later ones too.
		if (!NumberOfOptimizedInstructions)
		return false;
		}
		return true;
}		}
		};
		} // namespace

/// ApproximateLoopSize - Approximate the size of the loop.		/// ApproximateLoopSize - Approximate the size of the loop.
static unsigned ApproximateLoopSize(const Loop *L, unsigned &NumCalls,		static unsigned ApproximateLoopSize(const Loop *L, unsigned &NumCalls,
bool &NotDuplicatable,		bool &NotDuplicatable,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
AssumptionCache *AC) {		AssumptionCache *AC) {
SmallPtrSet<const Value *, 32> EphValues;		SmallPtrSet<const Value *, 32> EphValues;
CodeMetrics::collectEphemeralValues(L, AC, EphValues);		CodeMetrics::collectEphemeralValues(L, AC, EphValues);
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	static void SetLoopAlreadyUnrolled(Loop *L) {
MDs.push_back(DisableNode);		MDs.push_back(DisableNode);

MDNode *NewLoopID = MDNode::get(Context, MDs);		MDNode *NewLoopID = MDNode::get(Context, MDs);
// Set operand 0 to refer to the loop id itself.		// Set operand 0 to refer to the loop id itself.
NewLoopID->replaceOperandWith(0, NewLoopID);		NewLoopID->replaceOperandWith(0, NewLoopID);
L->setLoopID(NewLoopID);		L->setLoopID(NewLoopID);
}		}

		bool LoopUnroll::canUnrollCompletely(
		Loop *L, unsigned Threshold, unsigned AbsoluteThreshold,
		uint64_t UnrolledSize, unsigned NumberOfOptimizedInstructions,
		unsigned PercentOfOptimizedForCompleteUnroll) {

		if (Threshold == NoThreshold) {
		DEBUG(dbgs() << " Can fully unroll, because no threshold is set.\n");
		return true;
		}

		if (UnrolledSize <= Threshold) {
		DEBUG(dbgs() << " Can fully unroll, because unrolled size: "
		<< UnrolledSize << "<" << Threshold << "\n");
		return true;
		}

		assert(UnrolledSize && "UnrolledSize can't be 0 at this point.");
		unsigned PercentOfOptimizedInstructions =
		(uint64_t)NumberOfOptimizedInstructions * 100ull / UnrolledSize;

		if (UnrolledSize <= AbsoluteThreshold &&
		PercentOfOptimizedInstructions >= PercentOfOptimizedForCompleteUnroll) {
		DEBUG(dbgs() << " Can fully unroll, because unrolling will help removing "
		<< PercentOfOptimizedInstructions
		<< "% instructions (threshold: "
		<< PercentOfOptimizedForCompleteUnroll << "%)\n");
		DEBUG(dbgs() << " Unrolled size (" << UnrolledSize
		<< ") is less than the threshold (" << AbsoluteThreshold
		<< ").\n");
		return true;
		}

		DEBUG(dbgs() << " Too large to fully unroll:\n");
		DEBUG(dbgs() << " Unrolled size: " << UnrolledSize << "\n");
		DEBUG(dbgs() << " Estimated number of optimized instructions: "
		<< NumberOfOptimizedInstructions << "\n");
		DEBUG(dbgs() << " Absolute threshold: " << AbsoluteThreshold << "\n");
		DEBUG(dbgs() << " Minimum percent of removed instructions: "
		<< PercentOfOptimizedForCompleteUnroll << "\n");
		DEBUG(dbgs() << " Threshold for small loops: " << Threshold << "\n");
		return false;
		}

unsigned LoopUnroll::selectUnrollCount(		unsigned LoopUnroll::selectUnrollCount(
const Loop *L, unsigned TripCount, bool PragmaFullUnroll,		const Loop *L, unsigned TripCount, bool PragmaFullUnroll,
unsigned PragmaCount, const TargetTransformInfo::UnrollingPreferences &UP,		unsigned PragmaCount, const TargetTransformInfo::UnrollingPreferences &UP,
bool &SetExplicitly) {		bool &SetExplicitly) {
SetExplicitly = true;		SetExplicitly = true;

// User-specified count (either as a command-line option or		// User-specified count (either as a command-line option or
// constructor parameter) has highest precedence.		// constructor parameter) has highest precedence.
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	DEBUG(dbgs() << " Not unrolling loop which contains non-duplicatable"
<< " instructions.\n");		<< " instructions.\n");
return false;		return false;
}		}
if (NumInlineCandidates != 0) {		if (NumInlineCandidates != 0) {
DEBUG(dbgs() << " Not unrolling loop with inlinable calls.\n");		DEBUG(dbgs() << " Not unrolling loop with inlinable calls.\n");
return false;		return false;
}		}

unsigned NumberOfOptimizedInstructions =
approximateNumberOfOptimizedInstructions(L, *SE, TripCount, TTI);
DEBUG(dbgs() << " Complete unrolling could save: "
<< NumberOfOptimizedInstructions << "\n");

unsigned Threshold, PartialThreshold;		unsigned Threshold, PartialThreshold;
		unsigned AbsoluteThreshold, PercentOfOptimizedForCompleteUnroll;
selectThresholds(L, HasPragma, UP, Threshold, PartialThreshold,		selectThresholds(L, HasPragma, UP, Threshold, PartialThreshold,
NumberOfOptimizedInstructions);		AbsoluteThreshold, PercentOfOptimizedForCompleteUnroll);

// Given Count, TripCount and thresholds determine the type of		// Given Count, TripCount and thresholds determine the type of
// unrolling which is to be performed.		// unrolling which is to be performed.
enum { Full = 0, Partial = 1, Runtime = 2 };		enum { Full = 0, Partial = 1, Runtime = 2 };
int Unrolling;		int Unrolling;
if (TripCount && Count == TripCount) {		if (TripCount && Count == TripCount) {
if (Threshold != NoThreshold && UnrolledSize > Threshold) {
DEBUG(dbgs() << " Too large to fully unroll with count: " << Count
<< " because size: " << UnrolledSize << ">" << Threshold
<< "\n");
Unrolling = Partial;		Unrolling = Partial;
		// If the loop is really small, we don't need to run an expensive analysis.
		if (canUnrollCompletely(
		L, Threshold, AbsoluteThreshold,
		UnrolledSize, 0, 100)) {
		Unrolling = Full;
} else {		} else {
		// The loop isn't that small, but we still can fully unroll it if that
		// helps to remove a significant number of instructions.
		// To check that, run additional analysis on the loop.
		UnrollAnalyzer UA(L, TripCount, *SE, TTI, AbsoluteThreshold);
		if (UA.analyzeLoop() &&
		canUnrollCompletely(L, Threshold, AbsoluteThreshold,
		UA.UnrolledLoopSize,
		UA.NumberOfOptimizedInstructions,
		PercentOfOptimizedForCompleteUnroll)) {
Unrolling = Full;		Unrolling = Full;
}		}
		}
} else if (TripCount && Count < TripCount) {		} else if (TripCount && Count < TripCount) {
Unrolling = Partial;		Unrolling = Partial;
} else {		} else {
Unrolling = Runtime;		Unrolling = Runtime;
}		}

// Reduce count based on the type of unrolling and the threshold values.		// Reduce count based on the type of unrolling and the threshold values.
unsigned OriginalCount = Count;		unsigned OriginalCount = Count;
▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopUnroll/full-unroll-bad-geps.ll

				; Check that we don't crash on corner cases.
				; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=10 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=20 -o /dev/null
				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

				define void @foo1() {
				entry:
				br label %for.body

				for.body:
				%phi = phi i64 [ 0, %entry ], [ %inc, %for.body ]
				%idx = zext i32 undef to i64
				%add.ptr = getelementptr inbounds i64, i64* null, i64 %idx
				%inc = add nuw nsw i64 %phi, 1
				%cmp = icmp ult i64 %inc, 999
				br i1 %cmp, label %for.body, label %for.exit

				for.exit:
				ret void
				}

				define void @foo2() {
				entry:
				br label %for.body

				for.body:
				%phi = phi i64 [ 0, %entry ], [ %inc, %for.body ]
				%x = getelementptr i32, <4 x i32*> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
				%inc = add nuw nsw i64 %phi, 1
				%cmp = icmp ult i64 %inc, 999
				br i1 %cmp, label %for.body, label %for.exit

				for.exit:
				ret void
				}

llvm/trunk/test/Transforms/LoopUnroll/full-unroll-heuristics.ll

	Show All 11 Lines
	; * If a loop size is between these two tresholds, we only do complete unroll			; * If a loop size is between these two tresholds, we only do complete unroll
	; it if estimated number of potentially optimized instructions is high (we			; it if estimated number of potentially optimized instructions is high (we
	; specify the minimal percent of such instructions).			; specify the minimal percent of such instructions).

	; In this particular test-case, complete unrolling will allow later			; In this particular test-case, complete unrolling will allow later
	; optimizations to remove ~55% of the instructions, the loop body size is 9,			; optimizations to remove ~55% of the instructions, the loop body size is 9,
	; and unrolled size is 65.			; and unrolled size is 65.

	; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=10 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=30 \| FileCheck %s -check-prefix=TEST1			; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=10 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=20 \| FileCheck %s -check-prefix=TEST1
	; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=30 \| FileCheck %s -check-prefix=TEST2			; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=20 \| FileCheck %s -check-prefix=TEST2
	; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=80 \| FileCheck %s -check-prefix=TEST3			; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=80 \| FileCheck %s -check-prefix=TEST3
	; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=100 -unroll-percent-of-optimized-for-complete-unroll=80 \| FileCheck %s -check-prefix=TEST4			; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=100 -unroll-percent-of-optimized-for-complete-unroll=80 \| FileCheck %s -check-prefix=TEST4

	; If the absolute threshold is too low, or if we can't optimize away requested			; If the absolute threshold is too low, or if we can't optimize away requested
	; percent of instructions, we shouldn't unroll:			; percent of instructions, we shouldn't unroll:
	; TEST1: %array_const_idx = getelementptr inbounds [9 x i32], [9 x i32]* @known_constant, i64 0, i64 %iv			; TEST1: %array_const_idx = getelementptr inbounds [9 x i32], [9 x i32]* @known_constant, i64 0, i64 %iv
	; TEST3: %array_const_idx = getelementptr inbounds [9 x i32], [9 x i32]* @known_constant, i64 0, i64 %iv			; TEST3: %array_const_idx = getelementptr inbounds [9 x i32], [9 x i32]* @known_constant, i64 0, i64 %iv

	Show All 33 Lines