This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
62
LoopUnrollPass.cpp
-
test/Transforms/LoopUnroll/
-
Transforms/
-
LoopUnroll/
-
full-unroll-heuristics.ll

Differential D8816

Reimplement heuristic for estimating complete-unroll optimization effects.
ClosedPublic

Authored by mzolotukhin on Apr 2 2015, 7:26 PM.

Download Raw Diff

Details

Reviewers

chandlerc
hfinkel

Commits

rG8c68171fef79: Reimplement heuristic for estimating complete-unroll optimization effects.
rL237156: Reimplement heuristic for estimating complete-unroll optimization effects.

Summary

This patch reimplements heuristic that tries to estimate optimization beneftis
from complete loop unrolling.

In this patch I kept the minimal changes - e.g. I removed code handling
branches and folding compares. That's a promising area, but now there
are too many questions to discuss before we can enable it.

Diff Detail

Event Timeline

mzolotukhin updated this revision to Diff 23200.Apr 2 2015, 7:26 PM

mzolotukhin retitled this revision from to Reimplement heuristic for estimating complete-unroll optimization effects..

mzolotukhin updated this object.

mzolotukhin edited the test plan for this revision. (Show Details)Apr 2 2015, 7:26 PM

mzolotukhin added reviewers: hfinkel, chandlerc.

mzolotukhin added a subscriber: Unknown Object (MLST).

mzolotukhin mentioned this in D7837: Reimplement heuristic for estimating complete-unroll optimization effects..Apr 2 2015, 7:28 PM

Ping!

Ok, full pass of comments below.

Generally, this is definitely looking better to me. I think there is still a number of things that could be simplified or refactored, but that can come later. The stuff below is just to try and get this iteration ready to go.

lib/Transforms/Scalar/LoopUnrollPass.cpp
255–256	Indentation seems off here..
258–262	No need to repeat the variable name. I'd also call the variable "HaveSeenAR" for consistency with LLVM's naming conventions.
271–276	While you're here, can you make the argument be 'L' instead of 'loop'?
285–286	I somewhat dislike an assignment in a return expression. It's really hard to see when reading the code. Could you instead set the variable and then return? Or maybe use a function that aborts the traversal?
290	This is the only raw false here. Everything else that returns false sets IndexIsConstant to false first. If this is correct, could you comment on why?
365	SmallDenseMap?
411–413	Please don't use DenseMap like this. You're inserting a null value for every AddrOp you visit. You want to use exactly the logic used above for the LHS and RHS expresisons: Value AddrOp = I.getPointerOperand(); if (!isa<Constant>(AddrOp)) if (Constant SimplifiedAddrOp = SimplifiedValues.lookup(AddrOp)) AddrOp = SimplifiedAddrOp; Better yet, just hoist this entire pattern into a helper called 'simplifyValue' or some such?
413–420	In what case is the base address null? It doesn't really make sense to add null base addr records to the cache to me?
414–416	Don't do two map lookups. Lookup the key once (using find because you don't just get a null pointer) and then test the iterator and use the iterator.
425–426	What ensures that this is always safe? If something does ensure that this is always safe, I must assume it is when populating the structure. If that's the case, why can't we only store the unsigned variant?
428	What happens when "Step * Iteration" overflows? What about when "(Start + Step * Iteration)" overflows?
432–435	Under what circumstances is this not a constant? If this is genuinely not a constant, should we really be considering the load "optimized"? (Also, the cast<Value> shouldn't be needed?)
444–449	Would this second comment paragraph go better on the SCEV evaluation tool above? Or sunk into the implementation below? It seems kind of out of place here.
449	If the base address is a constant the GEP will also fold away though, so we should be able to mark it as optimized? (and we should be able to do this on each iteration)
451	Shouldn't you rinse this through the simplification map?
452–453	You should probably comment specifically that we expect to re-visit the same instructions repeatedly (once per iteration) and so we only want to do iteration-independent SCEV queries and computations once. I'd also probably extract all of the SCEV computation stuff below into a separate member function that you can comment as being iteration independent, etc. Then you can structure the visit somewhat more naturally.
489–498	I find it really weird to count optimized instructions rather than counting instructions that will remain after optimizations.
502–508	Rather than setting UnrolledLoopSize to UINT_MAX below to signal some inability to reasonably compute the unrolled size estimation, why not return true or false here? If this returns false, we have no useful data about the loop. Move on. If this returns true, then you can query for the detailed numbers.
504–506	I think you should have a FIXME to eventually remove the max iteration count to analyze. Once we shake the bugs out of the algorithm, it shouldn't be necessary. We should be willing to analyze any number of iterations as long as the un-optimized resulting instruction count is below a threshold.
571–575	I would handle all of this below where you're actually dealing with percentages. Handling it here seems really surprising and hard to understand.
694–707	Can you explain some of your motivations for having the double threshold and percentage query? It seems really awkward to implement, and so I'm curious what the need is here. If we could get around with just having a flat threshold, it'd make me happy. =]
842–860	I may just be forgetting where this is handled, but do we somewhere short-circuit the case where the total size of the loop body * the trip count is already below the threshold? Because we should. We shouldn't go and do the expensive analysis unless we at least have a large enough loop to be on the fence.

mzolotukhin added inline comments.Apr 9 2015, 6:15 PM

lib/Transforms/Scalar/LoopUnrollPass.cpp
694–707	The idea is the following: currently we have a threshold for unrolling small loops (~200 instructions). What I want to add is a possiblity to go beyond this threshold, but only if that gives a performance benefit. E.g. if unrolled loop would be 500 instructions, but it would be 30% faster than the original loop, then we want to unroll it. But we do not want to unroll this loop if it would become only 5% faster (in terms of cost of executed instructions). On the other hand, we don't want to unroll loops with huge trip counts, even if the resultant code seems to be faster. I.e. if unrolling would help to eliminate 50% of instructions, but the trip count is 10^9, we definitely don't want to unroll it. And several examples to illustrate the idea: a) int b[] = {0,0,0...0,1}; // most of the values are 0 for (i = 0; i < 500; i++) { t = b[i] * c[i]; a[i] = t * d[i]; } If we completely unroll the loop, we'll get something like: t = b[0]c[0]; a[0] = t d[0]; t = b[1]c[1]; a[1] = t d[1]; ... t = b[499]c[499]; a[499] = t d[499]; which would be simplified to: a[0] = 0; // b[0] == 0 a[1] = 0; // b[1] == 0 ... a[498] = 0; // b[498] == 0 a[499] = c[499]d[499]; //b[499] == 1 That is, unrolling helps to remove ~50% instructions in this case - and that's not about code size, it's about execution time, because in the original loop we have to execute every MUL instruction, since we don't know exact value of b[i]. b) / The same example as before, but with a huge trip count. / int b[] = {0,0,0...0,1}; // most of the values are 0 for (i = 0; i < 500000; i++) { t = b[i] c[i]; a[i] = t * d[i]; } We want to give up on this loop, because unrolled version would be way too big. We might have some problems compiling it, and even if we compile it succesfully, we might be hit hard by cache/memory effects. c) /* The same example as (a), but unrolling doesn't help to simplify anything. / int b[] = {6,2,3...4,7}; // no 0 or 1 values for (i = 0; i < 500; i++) { t = b[i] c[i]; a[i] = t * d[i]; } We don't want to just start unrolling any loop with higher trip count than we unrolled before, if that doesn't promise any performance benefits. So, to distinguish (a) and (b), we use 'AbsoluteThreshold'. To distinguish (a) and (c) we use percentage.

Move check for possible NumberOfOptimizedInstructions*100 overflow into canUnrollCompletely.
Hoist computing SCEV expressions out of the main traversal loop. This step is semantically different from visiting instructions to check if they could become redundant after loop unrolling. It's different because we need to do it only once, while other visitors need to be run for every iteration of the loop.
Make analyzeLoop return bool.
Don't run the analysis if we can unroll the loop even without it.
Make SCEVGEPDescriptor's fields Start and Step uint64_t (previously, they were APInt).
Compute Index in uint64_t and make sure operands fit into 32-bit int.
Other small changes.

Hi Chandler,
Please find my answers inline. The patch is updated correspondingly, is it ok to commit it?

lib/Transforms/Scalar/LoopUnrollPass.cpp
255–256	Fixed.
258–262	Fixed.
271–276	Fixed.
285–286	Fixed.
290	I rewrote return statements in the function, now they use raw true/false values. I also return `false` instead of `true` in `if (isa<SCEVConstant>(S))` - that doesn't matter since we can't "follow" into the SCEVConstant anyway, but `false` is more consistent here.
365	Makes sense, fixed.
411–413	Thanks, fixed! Though I didn't add a separate function for it.
413–420	You are right, it can't be null (it's checked when we prepare a new entry to SCEVCache). Fixed.
414–416	Fixed.
425–426	Thanks, I added constraints on the operands. Also, we now store unsigned invariant instead of APInt, as you suggested.
428	I made Index, Step, and Start uint64_t, while value in Start, Step and Iteration can't exceed 32-bit maximum. That should prevent overflows.
432–435	Thanks, fixed!
449	We don't support such optimization for GEPS (for now). We can add it later, and it'll naturally go to `visitGetElementPtr`, which is currently removed.
452–453	I think that we want to return to the original `cacheSCEVResults` approach. With that, we'd explicitly distinguish actions we want to do once (compute SCEVs and store interesting ones) from actions we want to perform on every simulated loop iteration (like trying to optimize LoadInst/BinaryOp/etc.). With that, added `cacheSCEVResults` and removed `visitGetElementPtr`.
489–498	I can change it, but in fact it doesn't sound so weird to me:)
502–508	That's a good idea, fixed according to your suggestion!
504–506	Is it possible that we can optimize all instructions in the loop body, and thus don't reach the threshold? I think it isn't, because we would have at least one instruction (branch) in the loop body, but I'm not confident here - maybe there could be some weird cases (i.e. cost of the branch is 0). If it's guaranteed that cost of the loop body is always > 0, then we can remove this limit.
571–575	Fixed!
842–860	Good point, fixed!

Ping!

Rebase to trunk:

Reimplement heuristic for estimating complete-unroll optimization effects.
Address Chandler's remarks.
Address Chandler's remarks.
Address Chandler's remarks.
Fix merge fail.
Address Chandler's remarks.
Address Chandler's remarks.

Rebase on trunk.

Add a helper-function lookupSimplifiedValue.

Ping ^3.

Rebase on recent trunk.
Prevent precision loss in UnrolledSize.
Remove unused VisitedGEPs.
Fix a typo in comment.

Ping ^4.

Whew! Back to this at last. Sorry for the huge delay.

Lots of comments below, but I've marked some as good candidates for follow-up patches.

Can you let me know if it makes sense for me to take a look at the DCE stuff or if we should focus on getting this one landed first?

lib/Transforms/Scalar/LoopUnrollPass.cpp
277	Please add a comment to this function explaining what its trying to do.
333–334	Rather than all of this, you can just use a SmallSetVector<BasicBlock *, 16>. I would prefer this somewhat rather than the typedef...
339	Just use a named inner struct -- typedef-ing structs is only necessary in C modes.
341–342	While 64-bits should be enough for any common cases, if the SCEV code has APInts I would just continue to use them here so that we have full fidelity.
381–389	The spacing and mixture of doxygen style and non-doxygen style seems really messy here.
416	'd' isn't a very helpful name (aside from not matching the variable naming conventions).
418–420	Comment here to remind the reader that you're checking for the specific types of SCEVGEP loads that can be folded completely to a constant.
429–430	We at least need a comment or FIXME here as we shouldn't return false here. A load past the end of sequential constant data is an error, and so we should be free to fold it to nothing for the purpose of loop unroll cost estimation.
447–448	Why not a range based loop here?
451–455	It feels like we could probably hoist some of this out of the loop? Feel free to just leave a FIXME and not deal with it in this patch.
460	Again, 'd' is a bad name here. Actually, I don't know why you create the descriptor this early? It seems like this region of code could just use the base addr from the visitor.
477	Is this to prevent overflow of the offset computations later? If so, comment that please. If not, what is it for?
480–483	I feel like this could just be an insert call? Or if you'd rather, something like: SCEVCache[V] = {Visitor.BaseAddress, StartAP.getLimitedValue(), StepAP.getLimitedValue()};
508–515	I feel like this should probably be a doxygen comment.
518–519	Why do we zero these here rather than in the constructor?
539–540	This seems vacuous due to the requirement of a terminator...
546–551	Is there a reason you don't make visit() return a bool indicating whether it's cost should be counted or not, and localize all the counting in this function? It would be much easier to understand IMO. I think I would also find it easier to read this as counting the number of instructions that will actually result from unrolling (essentially, the unoptimized instructions) and the optimized instructions. You could still sum them and divide to compute the percentage, but it would make the threshold check not require subtraction. That could be done in a follow-up patch though.
694–698	I would find it much more clear to just write the "percentage" check below in a way that wouldn't overflow: uint64_t PercentOfOptimizedInstructions = (uint64_t)NumberOfOptimizedINstructions * 100ull / UnrolledSize;
702	I don't think the comment here is helping. I would just add an assert about it above, after the test.
855	Why do you need this? I'm surprised this isn't just directly using the UA's values?

This revision now requires changes to proceed.May 11 2015, 3:28 PM

Add a comment before follow().
Replace BBSetVector with SmallSetVector<BasicBlock *, 16>.
Doxygen-ify some comments.
Remove unnecessary variable 'SCEVGEPDescriptor d'.
Use range-base loop.
Replace 'd' with a meaningfull name.
Add some comments and fixme-s.
Initialize NumberOfOptimizedInstructions and UnrolledLoopSize in constructor.
Rewrite expression to avoid overflow.
Remove no longer needed overflow check.
Replace a comment with an assert.
Use UA.UnrolledLoopSize instead of min(UA.UnrolledLoopSize, UnrolledSize).
Add FIXME for out-of-bound access handling.
Remove redundant if(BB->empty()) check.
Replace typedef with a named struct.
Use APInt instead of uint64_t.
Add sanity checks before accessing SCEV.

Hi Chandler,

Thanks for the comments! I believe I've addressed in the source all of them, except this one:

Is there a reason you don't make visit() return a bool indicating whether it's cost should be counted or not, and localize all the counting in this function? It would be much easier to understand IMO.

I think I would also find it easier to read this as counting the number of instructions that will actually result from unrolling (essentially, the *un*optimized instructions) and the optimized instructions. You could still sum them and divide to compute the percentage, but it would make the threshold check not require subtraction. That could be done in a follow-up patch though.

I kept counting the cost inside the visit function, because we might have three cases there:

instruction was simplified to a constant (e.g. x = a[0] * y = 0 * y = 0)
instruction was simplified, but not to a constant (e.g. x = a[0] + y = 0 + y = y)
instruction wasn't simplified

In case we want to distinguish (1) and (2) outside the visit function, bool won't be enough. Currently we won't lose much by merging them though, but I didn't want to limit ourselves here from the very beginning.

Generally, I think this can go in. There are a bunch of things I think should be cleaned up here, but they're fairly minor and I'm happy to just fix those and for a few that have more impact, send you patches.

In D8816#170783, @mzolotukhin wrote:

Hi Chandler,

Thanks for the comments! I believe I've addressed in the source all of them, except this one:

Is there a reason you don't make visit() return a bool indicating whether it's cost should be counted or not, and localize all the counting in this function? It would be much easier to understand IMO.

I think I would also find it easier to read this as counting the number of instructions that will actually result from unrolling (essentially, the *un*optimized instructions) and the optimized instructions. You could still sum them and divide to compute the percentage, but it would make the threshold check not require subtraction. That could be done in a follow-up patch though.

I kept counting the cost inside the visit function, because we might have three cases there:

instruction was simplified to a constant (e.g. x = a[0] * y = 0 * y = 0)

instruction was simplified, but not to a constant (e.g. x = a[0] + y = 0 + y = y)

instruction wasn't simplified

In case we want to distinguish (1) and (2) outside the visit function, bool won't be enough. Currently we won't lose much by merging them though, but I didn't want to limit ourselves here from the very beginning.

I don't think we need to distinguish between them. The key to realize is that if 'y' above were inside the loop body, we would already have accounted for its cost. The critical thing is whether we can fold 'x' away.

While perhaps we'll want more advanced heuristics, but I would rather assume not and simplify the code accordingly until a real use case arrives. (YAGNI, essentially.)

If you agree, I'm happy to make this change.

The only specific change I'd like to request you make in a follow-up are to ensure some of the test cases you mentioned in email exercising the percentage threshold etc are actually checked in as test cases.

Thanks!

This revision is now accepted and ready to land.May 12 2015, 9:11 AM

Closed by commit rL237156: Reimplement heuristic for estimating complete-unroll optimization effects. (authored by mzolotukhin). · Explain WhyMay 12 2015, 10:23 AM

This revision was automatically updated to reflect the committed changes.

Thanks, Chandler!

I've committed the patch and will follow-up with the tests today.

There are a bunch of things I think should be cleaned up here, but they're fairly minor and I'm happy to just fix those and for a few that have more impact, send you patches.

Sure, go ahead!

Michael

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

LoopUnrollPass.cpp

543 lines

test/

Transforms/

LoopUnroll/

full-unroll-heuristics.ll

4 lines

Diff 24593

lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 180 Lines • ▼ Show 20 Lines	public:

// Select threshold values used to limit unrolling based on a		// Select threshold values used to limit unrolling based on a
// total unrolled size. Parameters Threshold and PartialThreshold		// total unrolled size. Parameters Threshold and PartialThreshold
// are set to the maximum unrolled size for fully and partially		// are set to the maximum unrolled size for fully and partially
// unrolled loops respectively.		// unrolled loops respectively.
void selectThresholds(const Loop *L, bool HasPragma,		void selectThresholds(const Loop *L, bool HasPragma,
const TargetTransformInfo::UnrollingPreferences &UP,		const TargetTransformInfo::UnrollingPreferences &UP,
unsigned &Threshold, unsigned &PartialThreshold,		unsigned &Threshold, unsigned &PartialThreshold,
unsigned NumberOfOptimizedInstructions) {		unsigned &AbsoluteThreshold,
		unsigned &PercentOfOptimizedForCompleteUnroll) {
// Determine the current unrolling threshold. While this is		// Determine the current unrolling threshold. While this is
// normally set from UnrollThreshold, it is overridden to a		// normally set from UnrollThreshold, it is overridden to a
// smaller value if the current function is marked as		// smaller value if the current function is marked as
// optimize-for-size, and the unroll threshold was not user		// optimize-for-size, and the unroll threshold was not user
// specified.		// specified.
Threshold = UserThreshold ? CurrentThreshold : UP.Threshold;		Threshold = UserThreshold ? CurrentThreshold : UP.Threshold;
		PartialThreshold = UserThreshold ? CurrentThreshold : UP.PartialThreshold;
// If we are allowed to completely unroll if we can remove M% of		AbsoluteThreshold = UserAbsoluteThreshold ? CurrentAbsoluteThreshold
// instructions, and we know that with complete unrolling we'll be able
// to kill N instructions, then we can afford to completely unroll loops
// with unrolled size up to N*100/M.
// Adjust the threshold according to that:
unsigned PercentOfOptimizedForCompleteUnroll =
UserPercentOfOptimized ? CurrentMinPercentOfOptimized
: UP.MinPercentOfOptimized;
unsigned AbsoluteThreshold = UserAbsoluteThreshold
? CurrentAbsoluteThreshold
: UP.AbsoluteThreshold;		: UP.AbsoluteThreshold;
if (PercentOfOptimizedForCompleteUnroll)		PercentOfOptimizedForCompleteUnroll = UserPercentOfOptimized
Threshold = std::max<unsigned>(Threshold,		? CurrentMinPercentOfOptimized
NumberOfOptimizedInstructions * 100 /		: UP.MinPercentOfOptimized;
PercentOfOptimizedForCompleteUnroll);
// But don't allow unrolling loops bigger than absolute threshold.
Threshold = std::min<unsigned>(Threshold, AbsoluteThreshold);

PartialThreshold = UserThreshold ? CurrentThreshold : UP.PartialThreshold;
if (!UserThreshold &&		if (!UserThreshold &&
L->getHeader()->getParent()->hasFnAttribute(		L->getHeader()->getParent()->hasFnAttribute(
Attribute::OptimizeForSize)) {		Attribute::OptimizeForSize)) {
Threshold = UP.OptSizeThreshold;		Threshold = UP.OptSizeThreshold;
PartialThreshold = UP.PartialOptSizeThreshold;		PartialThreshold = UP.PartialOptSizeThreshold;
}		}
if (HasPragma) {		if (HasPragma) {
// If the loop has an unrolling pragma, we want to be more		// If the loop has an unrolling pragma, we want to be more
// aggressive with unrolling limits. Set thresholds to at		// aggressive with unrolling limits. Set thresholds to at
// least the PragmaTheshold value which is larger than the		// least the PragmaTheshold value which is larger than the
// default limits.		// default limits.
if (Threshold != NoThreshold)		if (Threshold != NoThreshold)
Threshold = std::max<unsigned>(Threshold, PragmaUnrollThreshold);		Threshold = std::max<unsigned>(Threshold, PragmaUnrollThreshold);
if (PartialThreshold != NoThreshold)		if (PartialThreshold != NoThreshold)
PartialThreshold =		PartialThreshold =
std::max<unsigned>(PartialThreshold, PragmaUnrollThreshold);		std::max<unsigned>(PartialThreshold, PragmaUnrollThreshold);
}		}
}		}
		bool canUnrollCompletely(Loop *L, unsigned Threshold,
		unsigned AbsoluteThreshold, unsigned UnrolledSize,
		unsigned NumberOfOptimizedInstructions,
		unsigned PercentOfOptimizedForCompleteUnroll);
};		};
}		}

char LoopUnroll::ID = 0;		char LoopUnroll::ID = 0;
INITIALIZE_PASS_BEGIN(LoopUnroll, "loop-unroll", "Unroll loops", false, false)		INITIALIZE_PASS_BEGIN(LoopUnroll, "loop-unroll", "Unroll loops", false, false)
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)		INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LoopSimplify)		INITIALIZE_PASS_DEPENDENCY(LoopSimplify)
INITIALIZE_PASS_DEPENDENCY(LCSSA)		INITIALIZE_PASS_DEPENDENCY(LCSSA)
INITIALIZE_PASS_DEPENDENCY(ScalarEvolution)		INITIALIZE_PASS_DEPENDENCY(ScalarEvolution)
INITIALIZE_PASS_END(LoopUnroll, "loop-unroll", "Unroll loops", false, false)		INITIALIZE_PASS_END(LoopUnroll, "loop-unroll", "Unroll loops", false, false)

Pass *llvm::createLoopUnrollPass(int Threshold, int Count, int AllowPartial,		Pass *llvm::createLoopUnrollPass(int Threshold, int Count, int AllowPartial,
int Runtime) {		int Runtime) {
return new LoopUnroll(Threshold, Count, AllowPartial, Runtime);		return new LoopUnroll(Threshold, Count, AllowPartial, Runtime);
}		}

Pass *llvm::createSimpleLoopUnrollPass() {		Pass *llvm::createSimpleLoopUnrollPass() {
return llvm::createLoopUnrollPass(-1, -1, 0, 0);		return llvm::createLoopUnrollPass(-1, -1, 0, 0);
}		}

static bool isLoadFromConstantInitializer(Value *V) {
if (GlobalVariable *GV = dyn_cast<GlobalVariable>(V))
if (GV->isConstant() && GV->hasDefinitiveInitializer())
return GV->getInitializer();
return false;
}

namespace {		namespace {
		/// \brief SCEV expressions visitor used for finding expressions that would
		/// become constants if the loop L is unrolled.
struct FindConstantPointers {		struct FindConstantPointers {
bool LoadCanBeConstantFolded;		/// \brief Shows whether the expression is ConstAddress+Constant or not.
bool IndexIsConstant;		bool IndexIsConstant;
APInt Step;
APInt StartValue;		/// \brief Used for filtering out SCEV expressions with two or more AddRec
		/// subexpressions.
		chandlercUnsubmitted Not Done Reply Inline Actions Indentation seems off here.. chandlerc: Indentation seems off here..
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Fixed. mzolotukhin: Fixed.
		///
		/// Used to filter out complicated SCEV expressions, having several AddRec
		/// sub-expressions. We don't handle them, because unrolling one loop
		/// wouldn't help to replace only one of these inductions with a constant,
		/// and consequently, the expression would remain non-constant.
		bool HaveSeenAR;
		chandlercUnsubmitted Not Done Reply Inline Actions No need to repeat the variable name. I'd also call the variable "HaveSeenAR" for consistency with LLVM's naming conventions. chandlerc: No need to repeat the variable name. I'd also call the variable "HaveSeenAR" for consistency…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Fixed. mzolotukhin: Fixed.

		/// \brief If the SCEV expression becomes ConstAddress+Constant, this value
		/// holds ConstAddress. Otherwise, it's nullptr.
Value *BaseAddress;		Value *BaseAddress;

		/// \brief The loop, which we try to completely unroll.
const Loop *L;		const Loop *L;

ScalarEvolution &SE;		ScalarEvolution &SE;
FindConstantPointers(const Loop *loop, ScalarEvolution &SE)
: LoadCanBeConstantFolded(true), IndexIsConstant(true), L(loop), SE(SE) {}		FindConstantPointers(const Loop *L, ScalarEvolution &SE)
		: IndexIsConstant(true), HaveSeenAR(false), BaseAddress(nullptr),
		L(L), SE(SE) {}

		chandlercUnsubmitted Not Done Reply Inline Actions While you're here, can you make the argument be 'L' instead of 'loop'? chandlerc: While you're here, can you make the argument be 'L' instead of 'loop'?
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Fixed. mzolotukhin: Fixed.
bool follow(const SCEV *S) {		bool follow(const SCEV *S) {
		chandlercUnsubmitted Not Done Reply Inline Actions Please add a comment to this function explaining what its trying to do. chandlerc: Please add a comment to this function explaining what its trying to do.
if (const SCEVUnknown *SC = dyn_cast<SCEVUnknown>(S)) {		if (const SCEVUnknown *SC = dyn_cast<SCEVUnknown>(S)) {
// We've reached the leaf node of SCEV, it's most probably just a		// We've reached the leaf node of SCEV, it's most probably just a
// variable. Now it's time to see if it corresponds to a global constant		// variable.
// global (in which case we can eliminate the load), or not.		// If it's the only one SCEV-subexpression, then it might be a base
		// address of an index expression.
		// If we've already recorded base address, then just give up on this SCEV
		// - it's too complicated.
		if (BaseAddress) {
		IndexIsConstant = false;
		chandlercUnsubmitted Not Done Reply Inline Actions I somewhat dislike an assignment in a return expression. It's really hard to see when reading the code. Could you instead set the variable and then return? Or maybe use a function that aborts the traversal? chandlerc: I somewhat dislike an assignment in a return expression. It's really hard to see when reading…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Fixed. mzolotukhin: Fixed.
		return false;
		}
BaseAddress = SC->getValue();		BaseAddress = SC->getValue();
LoadCanBeConstantFolded =
IndexIsConstant && isLoadFromConstantInitializer(BaseAddress);
return false;		return false;
		chandlercUnsubmitted Not Done Reply Inline Actions This is the only raw false here. Everything else that returns false sets IndexIsConstant to false first. If this is correct, could you comment on why? chandlerc: This is the only raw false here. Everything else that returns false sets IndexIsConstant to…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions I rewrote return statements in the function, now they use raw true/false values. I also return `false` instead of `true` in `if (isa<SCEVConstant>(S))` - that doesn't matter since we can't "follow" into the SCEVConstant anyway, but `false` is more consistent here. mzolotukhin: I rewrote return statements in the function, now they use raw true/false values. I also return…
}		}
if (isa<SCEVConstant>(S))		if (isa<SCEVConstant>(S))
return true;		return false;
if (const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(S)) {		if (const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(S)) {
// If the current SCEV expression is AddRec, and its loop isn't the loop		// If the current SCEV expression is AddRec, and its loop isn't the loop
// we are about to unroll, then we won't get a constant address after		// we are about to unroll, then we won't get a constant address after
// unrolling, and thus, won't be able to eliminate the load.		// unrolling, and thus, won't be able to eliminate the load.
if (AR->getLoop() != L)		if (AR->getLoop() != L) {
return IndexIsConstant = false;		IndexIsConstant = false;
// If the step isn't constant, we won't get constant addresses in unrolled		return false;
// version. Bail out.		}
if (const SCEVConstant *StepSE =		// We don't handle multiple AddRecs here, so give up in this case.
dyn_cast<SCEVConstant>(AR->getStepRecurrence(SE)))		if (HaveSeenAR) {
Step = StepSE->getValue()->getValue();		IndexIsConstant = false;
else		return false;
return IndexIsConstant = false;		}
		HaveSeenAR = true;
return IndexIsConstant;
}		}
// If Result is true, continue traversal.
// Otherwise, we have found something that prevents us from (possible) load		// Continue traversal.
// elimination.		return true;
return IndexIsConstant;
}		}
bool isDone() const { return !IndexIsConstant; }		bool isDone() const { return !IndexIsConstant; }
};		};

// This class is used to get an estimate of the optimization effects that we		// This class is used to get an estimate of the optimization effects that we
// could get from complete loop unrolling. It comes from the fact that some		// could get from complete loop unrolling. It comes from the fact that some
// loads might be replaced with concrete constant values and that could trigger		// loads might be replaced with concrete constant values and that could trigger
// a chain of instruction simplifications.		// a chain of instruction simplifications.
//		//
// E.g. we might have:		// E.g. we might have:
// int a[] = {0, 1, 0};		// int a[] = {0, 1, 0};
// v = 0;		// v = 0;
// for (i = 0; i < 3; i ++)		// for (i = 0; i < 3; i ++)
// v += b[i]*a[i];		// v += b[i]*a[i];
// If we completely unroll the loop, we would get:		// If we completely unroll the loop, we would get:
// v = b[0]a[0] + b[1]a[1] + b[2]*a[2]		// v = b[0]a[0] + b[1]a[1] + b[2]*a[2]
// Which then will be simplified to:		// Which then will be simplified to:
// v = b[0]* 0 + b[1]* 1 + b[2]* 0		// v = b[0]* 0 + b[1]* 1 + b[2]* 0
// And finally:		// And finally:
// v = b[1]		// v = b[1]
class UnrollAnalyzer : public InstVisitor<UnrollAnalyzer, bool> {		class UnrollAnalyzer : public InstVisitor<UnrollAnalyzer, bool> {
		typedef SetVector<BasicBlock , SmallVector<BasicBlock , 16>,
		SmallPtrSet<BasicBlock *, 16>> BBSetVector;
		chandlercUnsubmitted Not Done Reply Inline Actions Rather than all of this, you can just use a SmallSetVector<BasicBlock , 16>. I would prefer this somewhat rather than the typedef... chandlerc:* Rather than all of this, you can just use a SmallSetVector<BasicBlock *, 16>. I would prefer…

typedef InstVisitor<UnrollAnalyzer, bool> Base;		typedef InstVisitor<UnrollAnalyzer, bool> Base;
friend class InstVisitor<UnrollAnalyzer, bool>;		friend class InstVisitor<UnrollAnalyzer, bool>;

		typedef struct {
		chandlercUnsubmitted Not Done Reply Inline Actions Just use a named inner struct -- typedef-ing structs is only necessary in C modes. chandlerc: Just use a named inner struct -- typedef-ing structs is only necessary in C modes.
		Value *BaseAddr;
		uint64_t Start;
		uint64_t Step;
		chandlercUnsubmitted Not Done Reply Inline Actions While 64-bits should be enough for any common cases, if the SCEV code has APInts I would just continue to use them here so that we have full fidelity. chandlerc: While 64-bits should be enough for any common cases, if the SCEV code has APInts I would just…
		} SCEVGEPDescriptor;

		/// \brief The loop we're going to analyze.
const Loop *L;		const Loop *L;

		/// \brief TripCount of the given loop.
unsigned TripCount;		unsigned TripCount;

ScalarEvolution &SE;		ScalarEvolution &SE;

const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;

		// While we walk the loop instructions, we we build up and maintain a mapping
		// of simplified values specific to this iteration. The idea is to propagate
		// any special information we have about loads that can be replaced with
		// constants after complete unrolling, and account for likely simplifications
		// post-unrolling.
DenseMap<Value , Constant > SimplifiedValues;		DenseMap<Value , Constant > SimplifiedValues;
DenseMap<LoadInst , Value > LoadBaseAddresses;
SmallPtrSet<Instruction *, 32> CountedInstructions;

/// \brief Count the number of optimized instructions.		// To avoid requesting SCEV info on every iteration, request it once, and
unsigned NumberOfOptimizedInstructions;		// for each value that would become ConstAddress+Constant after loop
		// unrolling, save the corresponding data.
		SmallDenseMap<Value *, SCEVGEPDescriptor> SCEVCache;
		chandlercUnsubmitted Not Done Reply Inline Actions SmallDenseMap? chandlerc: SmallDenseMap?
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Makes sense, fixed. mzolotukhin: Makes sense, fixed.

		// Keep track of already visited GEP-expressions to avoid visiting them
		// several times.
		SmallPtrSet<Value *, 16> VisitedGEPs;

		/// \brief Number of currently simulated iteration.
		///
		/// If an expression is ConstAddress+Constant, then the Constant is
		/// Start + Iteration*Step, where Start and Step could be obtained from
		/// SCEVCache.
		unsigned Iteration;

		/// \brief Upper threshold for complete unrolling.
		unsigned MaxUnrolledLoopSize;

// Provide base case for our instruction visit.		// Provide base case for our instruction visit.
bool visitInstruction(Instruction &I) { return false; };		bool visitInstruction(Instruction &I) { return false; };
// TODO: We should also visit ICmp, FCmp, GetElementPtr, Trunc, ZExt, SExt,
// FPTrunc, FPExt, FPToUI, FPToSI, UIToFP, SIToFP, BitCast, Select,		// TODO: Add visitors for other instruction types, e.g. ZExt, SExt.
// ExtractElement, InsertElement, ShuffleVector, ExtractValue, InsertValue.
//		//
// Probaly it's worth to hoist the code for estimating the simplifications		// Probaly it's worth to hoist the code for estimating the simplifications
// effects to a separate class, since we have a very similar code in		// effects to a separate class, since we have a very similar code in
// InlineCost already.		// InlineCost already.
bool visitBinaryOperator(BinaryOperator &I) {		bool visitBinaryOperator(BinaryOperator &I) {
		chandlercUnsubmitted Not Done Reply Inline Actions The spacing and mixture of doxygen style and non-doxygen style seems really messy here. chandlerc: The spacing and mixture of doxygen style and non-doxygen style seems really messy here.
Value LHS = I.getOperand(0), RHS = I.getOperand(1);		Value *LHS = lookupSimplifiedValue(I.getOperand(0));
if (!isa<Constant>(LHS))		Value *RHS = lookupSimplifiedValue(I.getOperand(1));
if (Constant *SimpleLHS = SimplifiedValues.lookup(LHS))
LHS = SimpleLHS;
if (!isa<Constant>(RHS))
if (Constant *SimpleRHS = SimplifiedValues.lookup(RHS))
RHS = SimpleRHS;
Value *SimpleV = nullptr;		Value *SimpleV = nullptr;
const DataLayout &DL = I.getModule()->getDataLayout();		const DataLayout &DL = I.getModule()->getDataLayout();
if (auto FI = dyn_cast<FPMathOperator>(&I))		if (auto FI = dyn_cast<FPMathOperator>(&I))
SimpleV =		SimpleV =
SimplifyFPBinOp(I.getOpcode(), LHS, RHS, FI->getFastMathFlags(), DL);		SimplifyFPBinOp(I.getOpcode(), LHS, RHS, FI->getFastMathFlags(), DL);
else		else
SimpleV = SimplifyBinOp(I.getOpcode(), LHS, RHS, DL);		SimpleV = SimplifyBinOp(I.getOpcode(), LHS, RHS, DL);

if (SimpleV && CountedInstructions.insert(&I).second)		if (SimpleV)
NumberOfOptimizedInstructions += TTI.getUserCost(&I);		NumberOfOptimizedInstructions += TTI.getUserCost(&I);

if (Constant *C = dyn_cast_or_null<Constant>(SimpleV)) {		if (Constant *C = dyn_cast_or_null<Constant>(SimpleV)) {
SimplifiedValues[&I] = C;		SimplifiedValues[&I] = C;
return true;		return true;
}		}
return false;		return false;
}		}

Constant computeLoadValue(LoadInst LI, unsigned Iteration) {		bool visitLoad(LoadInst &I) {
if (!LI)		Value *AddrOp = lookupSimplifiedValue(I.getPointerOperand());
return nullptr;
Value *BaseAddr = LoadBaseAddresses[LI];
if (!BaseAddr)
return nullptr;

auto GV = dyn_cast<GlobalVariable>(BaseAddr);		auto It = SCEVCache.find(AddrOp);
		chandlercUnsubmitted Not Done Reply Inline Actions Please don't use DenseMap like this. You're inserting a null value for every AddrOp you visit. You want to use exactly the logic used above for the LHS and RHS expresisons: Value AddrOp = I.getPointerOperand(); if (!isa<Constant>(AddrOp)) if (Constant SimplifiedAddrOp = SimplifiedValues.lookup(AddrOp)) AddrOp = SimplifiedAddrOp; Better yet, just hoist this entire pattern into a helper called 'simplifyValue' or some such? chandlerc: Please don't use DenseMap like this. You're inserting a null value for every AddrOp you visit.
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Thanks, fixed! Though I didn't add a separate function for it. mzolotukhin: Thanks, fixed! Though I didn't add a separate function for it.
if (!GV)		if (It == SCEVCache.end())
return nullptr;		return false;
		SCEVGEPDescriptor d = It->second;
		chandlercUnsubmitted Not Done Reply Inline Actions Don't do two map lookups. Lookup the key once (using find because you don't just get a null pointer) and then test the iterator and use the iterator. chandlerc: Don't do two map lookups. Lookup the key once (using find because you don't just get a null…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Fixed. mzolotukhin: Fixed.
		chandlercUnsubmitted Not Done Reply Inline Actions 'd' isn't a very helpful name (aside from not matching the variable naming conventions). chandlerc: 'd' isn't a very helpful name (aside from not matching the variable naming conventions).

		auto GV = dyn_cast<GlobalVariable>(d.BaseAddr);
		if (!GV \|\| !GV->hasInitializer())
		return false;
		chandlercUnsubmitted Not Done Reply Inline Actions In what case is the base address null? It doesn't really make sense to add null base addr records to the cache to me? chandlerc: In what case is the base address null? It doesn't really make sense to add null base addr…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions You are right, it can't be null (it's checked when we prepare a new entry to SCEVCache). Fixed. mzolotukhin: You are right, it can't be null (it's checked when we prepare a new entry to SCEVCache). Fixed.
		chandlercUnsubmitted Not Done Reply Inline Actions Comment here to remind the reader that you're checking for the specific types of SCEVGEP loads that can be folded completely to a constant. chandlerc: Comment here to remind the reader that you're checking for the specific types of SCEVGEP loads…

ConstantDataSequential *CDS =		ConstantDataSequential *CDS =
dyn_cast<ConstantDataSequential>(GV->getInitializer());		dyn_cast<ConstantDataSequential>(GV->getInitializer());
if (!CDS)		if (!CDS)
return nullptr;		return false;

const SCEV *BaseAddrSE = SE.getSCEV(BaseAddr);
const SCEV *S = SE.getSCEV(LI->getPointerOperand());
const SCEV *OffSE = SE.getMinusSCEV(S, BaseAddrSE);

APInt StepC, StartC;
const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(OffSE);
if (!AR)
return nullptr;

if (const SCEVConstant *StepSE =
dyn_cast<SCEVConstant>(AR->getStepRecurrence(SE)))
StepC = StepSE->getValue()->getValue();
else
return nullptr;

if (const SCEVConstant *StartSE = dyn_cast<SCEVConstant>(AR->getStart()))
StartC = StartSE->getValue()->getValue();
else
return nullptr;

		chandlercUnsubmitted Not Done Reply Inline Actions What ensures that this is always safe? If something does ensure that this is always safe, I must assume it is when populating the structure. If that's the case, why can't we only store the unsigned variant? chandlerc: What ensures that this is always safe? If something does ensure that this is always safe, I…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Thanks, I added constraints on the operands. Also, we now store unsigned invariant instead of APInt, as you suggested. mzolotukhin: Thanks, I added constraints on the operands. Also, we now store unsigned invariant instead of…
unsigned ElemSize = CDS->getElementType()->getPrimitiveSizeInBits() / 8U;		unsigned ElemSize = CDS->getElementType()->getPrimitiveSizeInBits() / 8U;
unsigned Start = StartC.getLimitedValue();		uint64_t Index = (d.Start + d.Step * Iteration) / ElemSize;
		chandlercUnsubmitted Not Done Reply Inline Actions What happens when "Step * Iteration" overflows? What about when "(Start + Step * Iteration)" overflows? chandlerc: What happens when "Step * Iteration" overflows? What about when "(Start + Step * Iteration)"…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions I made Index, Step, and Start uint64_t, while value in Start, Step and Iteration can't exceed 32-bit maximum. That should prevent overflows. mzolotukhin: I made Index, Step, and Start uint64_t, while value in Start, Step and Iteration can't exceed…
unsigned Step = StepC.getLimitedValue();

unsigned Index = (Start + Step * Iteration) / ElemSize;
if (Index >= CDS->getNumElements())		if (Index >= CDS->getNumElements())
return nullptr;		return false;
		chandlercUnsubmitted Not Done Reply Inline Actions We at least need a comment or FIXME here as we shouldn't return false here. A load past the end of sequential constant data is an error, and so we should be free to fold it to nothing for the purpose of loop unroll cost estimation. chandlerc: We at least need a comment or FIXME here as we shouldn't return false here. A load past the end…

Constant *CV = CDS->getElementAsConstant(Index);		Constant *CV = CDS->getElementAsConstant(Index);
		assert(CV && "Constant expected.");
		SimplifiedValues[&I] = CV;

		chandlercUnsubmitted Not Done Reply Inline Actions Under what circumstances is this not a constant? If this is genuinely not a constant, should we really be considering the load "optimized"? (Also, the cast<Value> shouldn't be needed?) chandlerc: Under what circumstances is this not a constant? If this is genuinely not a constant, should…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Thanks, fixed! mzolotukhin: Thanks, fixed!
return CV;		NumberOfOptimizedInstructions += TTI.getUserCost(&I);
		return true;
}		}

public:		// Visit all GEPs in the loop and find those which after complete loop
UnrollAnalyzer(const Loop *L, unsigned TripCount, ScalarEvolution &SE,		// unrolling would become a constant, or BaseAddress+Constant. Such GEPs
const TargetTransformInfo &TTI)		// could allow to evaluate a load to a constant later - for now we just store
: L(L), TripCount(TripCount), SE(SE), TTI(TTI),		// the corresponding BaseAddress and StartValue with StepValue in the
NumberOfOptimizedInstructions(0) {}		// SCEVCache.
		void cacheSCEVResults() {
// Visit all loads the loop L, and for those that, after complete loop
// unrolling, would have a constant address and it will point to a known
// constant initializer, record its base address for future use. It is used
// when we estimate number of potentially simplified instructions.
void findConstFoldableLoads() {
for (auto BB : L->getBlocks()) {		for (auto BB : L->getBlocks()) {
for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ++I) {		for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ++I) {
if (LoadInst *LI = dyn_cast<LoadInst>(I)) {		if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(I)) {
		chandlercUnsubmitted Not Done Reply Inline Actions Why not a range based loop here? chandlerc: Why not a range based loop here?
if (!LI->isSimple())		Value *V = cast<Value>(GEP);
		chandlercUnsubmitted Not Done Reply Inline Actions Would this second comment paragraph go better on the SCEV evaluation tool above? Or sunk into the implementation below? It seems kind of out of place here. chandlerc: Would this second comment paragraph go better on the SCEV evaluation tool above? Or sunk into…
		chandlercUnsubmitted Not Done Reply Inline Actions If the base address is a constant the GEP will also fold away though, so we should be able to mark it as optimized? (and we should be able to do this on each iteration) chandlerc: If the base address is a constant the GEP will also fold away though, so we should be able to…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions We don't support such optimization for GEPS (for now). We can add it later, and it'll naturally go to `visitGetElementPtr`, which is currently removed. mzolotukhin: We don't support such optimization for GEPS (for now). We can add it later, and it'll naturally…
continue;		const SCEV *S = SE.getSCEV(V);
Value *AddrOp = LI->getPointerOperand();
const SCEV *S = SE.getSCEV(AddrOp);
FindConstantPointers Visitor(L, SE);		FindConstantPointers Visitor(L, SE);
		chandlercUnsubmitted Not Done Reply Inline Actions Shouldn't you rinse this through the simplification map? chandlerc: Shouldn't you rinse this through the simplification map?
SCEVTraversal<FindConstantPointers> T(Visitor);		SCEVTraversal<FindConstantPointers> T(Visitor);
		// Try to find (BaseAddress+Step+Offset) tuple.
		chandlercUnsubmitted Not Done Reply Inline Actions You should probably comment specifically that we expect to re-visit the same instructions repeatedly (once per iteration) and so we only want to do iteration-independent SCEV queries and computations once. I'd also probably extract all of the SCEV computation stuff below into a separate member function that you can comment as being iteration independent, etc. Then you can structure the visit somewhat more naturally. chandlerc: You should probably comment specifically that we expect to re-visit the same instructions…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions I think that we want to return to the original `cacheSCEVResults` approach. With that, we'd explicitly distinguish actions we want to do once (compute SCEVs and store interesting ones) from actions we want to perform on every simulated loop iteration (like trying to optimize LoadInst/BinaryOp/etc.). With that, added `cacheSCEVResults` and removed `visitGetElementPtr`. mzolotukhin: I think that we want to return to the original `cacheSCEVResults` approach. With that, we'd…
		// If succeeded, save it to the cache - it might help in folding
		// loads.
		chandlercUnsubmitted Not Done Reply Inline Actions It feels like we could probably hoist some of this out of the loop? Feel free to just leave a FIXME and not deal with it in this patch. chandlerc: It feels like we could probably hoist some of this out of the loop? Feel free to just leave a…
T.visitAll(S);		T.visitAll(S);
if (Visitor.IndexIsConstant && Visitor.LoadCanBeConstantFolded) {		if (!Visitor.IndexIsConstant \|\| !Visitor.BaseAddress)
LoadBaseAddresses[LI] = Visitor.BaseAddress;		continue;

		SCEVGEPDescriptor d;
		chandlercUnsubmitted Not Done Reply Inline Actions Again, 'd' is a bad name here. Actually, I don't know why you create the descriptor this early? It seems like this region of code could just use the base addr from the visitor. chandlerc: Again, 'd' is a bad name here. Actually, I don't know why you create the descriptor this early?
		d.BaseAddr = Visitor.BaseAddress;
		const SCEV *BaseAddrSE = SE.getSCEV(d.BaseAddr);
		const SCEV *OffSE = SE.getMinusSCEV(S, BaseAddrSE);
		const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(OffSE);

		if (!AR)
		continue;

		const SCEVConstant *StepSE =
		dyn_cast<SCEVConstant>(AR->getStepRecurrence(SE));
		const SCEVConstant *StartSE = dyn_cast<SCEVConstant>(AR->getStart());
		if (!StepSE \|\| !StartSE)
		continue;

		APInt StepAP = StepSE->getValue()->getValue();
		APInt StartAP = StartSE->getValue()->getValue();
		if (StartAP.getActiveBits() > 32 \|\| StepAP.getActiveBits() > 32)
		chandlercUnsubmitted Not Done Reply Inline Actions Is this to prevent overflow of the offset computations later? If so, comment that please. If not, what is it for? chandlerc: Is this to prevent overflow of the offset computations later? If so, comment that please. If…
		continue;

		d.Start = StartAP.getLimitedValue();
		d.Step = StepAP.getLimitedValue();

		SCEVCache[V] = d;
		chandlercUnsubmitted Not Done Reply Inline Actions I feel like this could just be an insert call? Or if you'd rather, something like: SCEVCache[V] = {Visitor.BaseAddress, StartAP.getLimitedValue(), StepAP.getLimitedValue()}; chandlerc: I feel like this could just be an insert call? Or if you'd rather, something like: SCEVCache…
}		}
}		}
}		}
}		}

		// If SimplifiedValues contains V, return corresponding simplified value.
		// Otherwise, return V as is.
		Value lookupSimplifiedValue(Value V) {
		if (isa<Constant>(V))
		return V;
		if (Constant *SimplifiedV = SimplifiedValues.lookup(V))
		return SimplifiedV;
		return V;
}		}

		chandlercUnsubmitted Not Done Reply Inline Actions I find it really weird to count optimized instructions rather than counting instructions that will remain after optimizations. chandlerc: I find it really weird to count optimized instructions rather than counting instructions that…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions I can change it, but in fact it doesn't sound so weird to me:) mzolotukhin: I can change it, but in fact it doesn't sound so weird to me:)
// Given a list of loads that could be constant-folded (LoadBaseAddresses),		public:
// estimate number of optimized instructions after substituting the concrete		UnrollAnalyzer(const Loop *L, unsigned TripCount, ScalarEvolution &SE,
// values for the given Iteration. Also track how many instructions become		const TargetTransformInfo &TTI, unsigned MaxUnrolledLoopSize)
// dead through this process.		: L(L), TripCount(TripCount), SE(SE), TTI(TTI),
unsigned estimateNumberOfOptimizedInstructions(unsigned Iteration) {		MaxUnrolledLoopSize(MaxUnrolledLoopSize) {}
// We keep a set vector for the worklist so that we don't wast space in the
// worklist queuing up the same instruction repeatedly. This can happen due
// to multiple operands being the same instruction or due to the same
// instruction being an operand of lots of things that end up dead or
// simplified.
SmallSetVector<Instruction *, 8> Worklist;

// Clear the simplified values and counts for this iteration.		/// \brief Count the number of optimized instructions.
SimplifiedValues.clear();		unsigned NumberOfOptimizedInstructions;
		chandlercUnsubmitted Not Done Reply Inline Actions I think you should have a FIXME to eventually remove the max iteration count to analyze. Once we shake the bugs out of the algorithm, it shouldn't be necessary. We should be willing to analyze any number of iterations as long as the un-optimized resulting instruction count is below a threshold. chandlerc: I think you should have a FIXME to eventually remove the max iteration count to analyze. Once…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Is it possible that we can optimize all instructions in the loop body, and thus don't reach the threshold? I think it isn't, because we would have at least one instruction (branch) in the loop body, but I'm not confident here - maybe there could be some weird cases (i.e. cost of the branch is 0). If it's guaranteed that cost of the loop body is always > 0, then we can remove this limit. mzolotukhin: Is it possible that we can optimize all instructions in the loop body, and thus don't reach the…
CountedInstructions.clear();
NumberOfOptimizedInstructions = 0;

// We start by adding all loads to the worklist.		/// \brief Count the total number of instructions.
		chandlercUnsubmitted Not Done Reply Inline Actions Rather than setting UnrolledLoopSize to UINT_MAX below to signal some inability to reasonably compute the unrolled size estimation, why not return true or false here? If this returns false, we have no useful data about the loop. Move on. If this returns true, then you can query for the detailed numbers. chandlerc: Rather than setting UnrolledLoopSize to UINT_MAX below to signal some inability to reasonably…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions That's a good idea, fixed according to your suggestion! mzolotukhin: That's a good idea, fixed according to your suggestion!
for (auto &LoadDescr : LoadBaseAddresses) {		unsigned UnrolledLoopSize;
LoadInst *LI = LoadDescr.first;
SimplifiedValues[LI] = computeLoadValue(LI, Iteration);
if (CountedInstructions.insert(LI).second)
NumberOfOptimizedInstructions += TTI.getUserCost(LI);

for (User *U : LI->users())
Worklist.insert(cast<Instruction>(U));
}

// And then we try to simplify every user of every instruction from the
// worklist. If we do simplify a user, add it to the worklist to process
// its users as well.
while (!Worklist.empty()) {
Instruction *I = Worklist.pop_back_val();
if (!L->contains(I))
continue;
if (!visit(I))
continue;
for (User *U : I->users())
Worklist.insert(cast<Instruction>(U));
}

// Now that we know the potentially simplifed instructions, estimate number		// Complete loop unrolling can make some loads constant, and we need to know
// of instructions that would become dead if we do perform the		// if that would expose any further optimization opportunities. This routine
// simplification.		// estimates this optimization. It assigns computed number of instructions,
		// that potentially might be optimized away, to NumberOfOptimizedInstructions,
// The dead instructions are held in a separate set. This is used to		// and total number of instructions to UnrolledLoopSize (not counting blocks
		chandlercUnsubmitted Not Done Reply Inline Actions I feel like this should probably be a doxygen comment. chandlerc: I feel like this should probably be a doxygen comment.
// prevent us from re-examining instructions and make sure we only count		// that won't be reached, if we were able to compute the condition).
// the benifit once. The worklist's internal set handles insertion		// Return false if we can't analyze the loop, or if we discovered that
// deduplication.		// unrolling won't give anything. Otherwise, return true.
SmallPtrSet<Instruction *, 16> DeadInstructions;		bool analyzeLoop() {
		chandlercUnsubmitted Not Done Reply Inline Actions Why do we zero these here rather than in the constructor? chandlerc: Why do we zero these here rather than in the constructor?
		BBSetVector BBWorklist;
// Lambda to enque operands onto the worklist.		UnrolledLoopSize = 0;
auto EnqueueOperands = [&](Instruction &I) {		NumberOfOptimizedInstructions = 0;
for (auto *Op : I.operand_values())
if (auto *OpI = dyn_cast<Instruction>(Op))
if (!OpI->use_empty())
Worklist.insert(OpI);
};

// Start by initializing worklist with simplified instructions.		// Don't simulate loops with a big or unknown tripcount
for (auto &FoldedKeyValue : SimplifiedValues)		if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|
if (auto *FoldedInst = dyn_cast<Instruction>(FoldedKeyValue.first)) {		TripCount > UnrollMaxIterationsCountToAnalyze)
DeadInstructions.insert(FoldedInst);		return false;

// Add each instruction operand of this dead instruction to the		// To avoid compute SCEV-expressions on every iteration, compute them once
// worklist.		// and store interesting to us in SCEVCache.
EnqueueOperands(*FoldedInst);		cacheSCEVResults();
}
		// Simulate execution of each iteration of the loop counting instructions,
// If a definition of an insn is only used by simplified or dead		// which would be simplified.
// instructions, it's also dead. Check defs of all instructions from the		// Since the same load will take different values on different iterations,
// worklist.		// we literally have to go through all loop's iterations.
while (!Worklist.empty()) {		for (Iteration = 0; Iteration < TripCount; ++Iteration) {
Instruction *I = Worklist.pop_back_val();		SimplifiedValues.clear();
if (!L->contains(I))		BBWorklist.clear();
continue;		BBWorklist.insert(L->getHeader());
		chandlercUnsubmitted Not Done Reply Inline Actions This seems vacuous due to the requirement of a terminator... chandlerc: This seems vacuous due to the requirement of a terminator...
if (DeadInstructions.count(I))		// Note that we must not cache the size, this loop grows the worklist.
		for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
		BasicBlock *BB = BBWorklist[Idx];
		if (BB->empty())
continue;		continue;

if (std::all_of(I->user_begin(), I->user_end(), [&](User *U) {		// Visit all instructions in the given basic block and try to simplify
return DeadInstructions.count(cast<Instruction>(U));		// it. We don't change the actual IR, just count optimization
})) {		// opportunities.
NumberOfOptimizedInstructions += TTI.getUserCost(I);		for (Instruction &I : *BB) {
DeadInstructions.insert(I);		UnrolledLoopSize += TTI.getUserCost(&I);
		chandlercUnsubmitted Not Done Reply Inline Actions Is there a reason you don't make visit() return a bool indicating whether it's cost should be counted or not, and localize all the counting in this function? It would be much easier to understand IMO. I think I would also find it easier to read this as counting the number of instructions that will actually result from unrolling (essentially, the unoptimized instructions) and the optimized instructions. You could still sum them and divide to compute the percentage, but it would make the threshold check not require subtraction. That could be done in a follow-up patch though. chandlerc: Is there a reason you don't make visit() return a bool indicating whether it's cost should be…
EnqueueOperands(*I);		Base::visit(I);
		// If unrolled body turns out to be too big, bail out.
		if (UnrolledLoopSize - NumberOfOptimizedInstructions >
		MaxUnrolledLoopSize)
		return false;
}		}

		// Add BB's successors to the worklist.
		for (BasicBlock *Succ : successors(BB))
		if (L->contains(Succ))
		BBWorklist.insert(Succ);
		}

		// If we found no optimization opportunities on the first iteration, we
		// won't find them on later ones too.
		if (!NumberOfOptimizedInstructions)
		return false;
}		}
return NumberOfOptimizedInstructions;		return true;
}		}
};		};
} // namespace		} // namespace

// Complete loop unrolling can make some loads constant, and we need to know if
// that would expose any further optimization opportunities.
// This routine estimates this optimization effect and returns the number of
// instructions, that potentially might be optimized away.
static unsigned
approximateNumberOfOptimizedInstructions(const Loop *L, ScalarEvolution &SE,
unsigned TripCount,
const TargetTransformInfo &TTI) {
if (!TripCount \|\| !UnrollMaxIterationsCountToAnalyze)
return 0;

UnrollAnalyzer UA(L, TripCount, SE, TTI);
UA.findConstFoldableLoads();

// Estimate number of instructions, that could be simplified if we replace a
// load with the corresponding constant. Since the same load will take
// different values on different iterations, we have to go through all loop's
// iterations here. To limit ourselves here, we check only first N
// iterations, and then scale the found number, if necessary.
unsigned IterationsNumberForEstimate =
std::min<unsigned>(UnrollMaxIterationsCountToAnalyze, TripCount);
unsigned NumberOfOptimizedInstructions = 0;
for (unsigned i = 0; i < IterationsNumberForEstimate; ++i)
NumberOfOptimizedInstructions +=
UA.estimateNumberOfOptimizedInstructions(i);

NumberOfOptimizedInstructions *= TripCount / IterationsNumberForEstimate;

return NumberOfOptimizedInstructions;
}

/// ApproximateLoopSize - Approximate the size of the loop.		/// ApproximateLoopSize - Approximate the size of the loop.
		chandlercUnsubmitted Not Done Reply Inline Actions I would handle all of this below where you're actually dealing with percentages. Handling it here seems really surprising and hard to understand. chandlerc: I would handle all of this below where you're actually dealing with percentages. Handling it…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Fixed! mzolotukhin: Fixed!
static unsigned ApproximateLoopSize(const Loop *L, unsigned &NumCalls,		static unsigned ApproximateLoopSize(const Loop *L, unsigned &NumCalls,
bool &NotDuplicatable,		bool &NotDuplicatable,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
AssumptionCache *AC) {		AssumptionCache *AC) {
SmallPtrSet<const Value *, 32> EphValues;		SmallPtrSet<const Value *, 32> EphValues;
CodeMetrics::collectEphemeralValues(L, AC, EphValues);		CodeMetrics::collectEphemeralValues(L, AC, EphValues);

CodeMetrics Metrics;		CodeMetrics Metrics;
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	static void SetLoopAlreadyUnrolled(Loop *L) {
MDs.push_back(DisableNode);		MDs.push_back(DisableNode);

MDNode *NewLoopID = MDNode::get(Context, MDs);		MDNode *NewLoopID = MDNode::get(Context, MDs);
// Set operand 0 to refer to the loop id itself.		// Set operand 0 to refer to the loop id itself.
NewLoopID->replaceOperandWith(0, NewLoopID);		NewLoopID->replaceOperandWith(0, NewLoopID);
L->setLoopID(NewLoopID);		L->setLoopID(NewLoopID);
}		}

		bool LoopUnroll::canUnrollCompletely(
		Loop *L, unsigned Threshold, unsigned AbsoluteThreshold,
		unsigned UnrolledSize, unsigned NumberOfOptimizedInstructions,
		unsigned PercentOfOptimizedForCompleteUnroll) {

		if (Threshold == NoThreshold) {
		DEBUG(dbgs() << " Can fully unroll, because no threshold is set.\n");
		return true;
		}

		if (UnrolledSize <= Threshold) {
		DEBUG(dbgs() << " Can fully unroll, because unrolled size: "
		<< UnrolledSize << "<" << Threshold << "\n");
		return true;
		}

		// If we can overflow computing percentage of optimized instructions, just
		// give a conservative answer. Anyway, we don't want to deal with such a
		// big loops.
		if (NumberOfOptimizedInstructions > UINT_MAX / 100)
		NumberOfOptimizedInstructions = 0;
		chandlercUnsubmitted Not Done Reply Inline Actions I would find it much more clear to just write the "percentage" check below in a way that wouldn't overflow: uint64_t PercentOfOptimizedInstructions = (uint64_t)NumberOfOptimizedINstructions * 100ull / UnrolledSize; chandlerc: I would find it much more clear to just write the "percentage" check below in a way that…

		unsigned PercentOfOptimizedInstructions =
		NumberOfOptimizedInstructions * 100 /
		UnrolledSize; // The previous check guards us from div by 0
		chandlercUnsubmitted Not Done Reply Inline Actions I don't think the comment here is helping. I would just add an assert about it above, after the test. chandlerc: I don't think the comment here is helping. I would just add an assert about it above, after the…
		if (UnrolledSize <= AbsoluteThreshold &&
		PercentOfOptimizedInstructions >= PercentOfOptimizedForCompleteUnroll) {
		DEBUG(dbgs() << " Can fully unroll, because unrolling will help removing "
		<< PercentOfOptimizedInstructions
		<< "% instructions (threshold: "
		chandlercUnsubmitted Not Done Reply Inline Actions Can you explain some of your motivations for having the double threshold and percentage query? It seems really awkward to implement, and so I'm curious what the need is here. If we could get around with just having a flat threshold, it'd make me happy. =] chandlerc: Can you explain some of your motivations for having the double threshold and percentage query?
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions The idea is the following: currently we have a threshold for unrolling small loops (~200 instructions). What I want to add is a possiblity to go beyond this threshold, but only if that gives a performance benefit. E.g. if unrolled loop would be 500 instructions, but it would be 30% faster than the original loop, then we want to unroll it. But we do not want to unroll this loop if it would become only 5% faster (in terms of cost of executed instructions). On the other hand, we don't want to unroll loops with huge trip counts, even if the resultant code seems to be faster. I.e. if unrolling would help to eliminate 50% of instructions, but the trip count is 10^9, we definitely don't want to unroll it. And several examples to illustrate the idea: a) int b[] = {0,0,0...0,1}; // most of the values are 0 for (i = 0; i < 500; i++) { t = b[i] * c[i]; a[i] = t * d[i]; } If we completely unroll the loop, we'll get something like: t = b[0]c[0]; a[0] = t d[0]; t = b[1]c[1]; a[1] = t d[1]; ... t = b[499]c[499]; a[499] = t d[499]; which would be simplified to: a[0] = 0; // b[0] == 0 a[1] = 0; // b[1] == 0 ... a[498] = 0; // b[498] == 0 a[499] = c[499]d[499]; //b[499] == 1 That is, unrolling helps to remove ~50% instructions in this case - and that's not about code size, it's about execution time, because in the original loop we have to execute every MUL instruction, since we don't know exact value of b[i]. b) / The same example as before, but with a huge trip count. / int b[] = {0,0,0...0,1}; // most of the values are 0 for (i = 0; i < 500000; i++) { t = b[i] c[i]; a[i] = t * d[i]; } We want to give up on this loop, because unrolled version would be way too big. We might have some problems compiling it, and even if we compile it succesfully, we might be hit hard by cache/memory effects. c) /* The same example as (a), but unrolling doesn't help to simplify anything. / int b[] = {6,2,3...4,7}; // no 0 or 1 values for (i = 0; i < 500; i++) { t = b[i] c[i]; a[i] = t * d[i]; } We don't want to just start unrolling any loop with higher trip count than we unrolled before, if that doesn't promise any performance benefits. So, to distinguish (a) and (b), we use 'AbsoluteThreshold'. To distinguish (a) and (c) we use percentage. mzolotukhin: The idea is the following: currently we have a threshold for unrolling small loops (~200…
		<< PercentOfOptimizedForCompleteUnroll << "%)\n");
		DEBUG(dbgs() << " Unrolled size (" << UnrolledSize
		<< ") is less than the threshold (" << AbsoluteThreshold
		<< ").\n");
		return true;
		}

		DEBUG(dbgs() << " Too large to fully unroll:\n");
		DEBUG(dbgs() << " Unrolled size: " << UnrolledSize << "\n");
		DEBUG(dbgs() << " Estimated number of optimized instructions: "
		<< NumberOfOptimizedInstructions << "\n");
		DEBUG(dbgs() << " Absolute threshold: " << AbsoluteThreshold << "\n");
		DEBUG(dbgs() << " Minimum percent of removed instructions: "
		<< PercentOfOptimizedForCompleteUnroll << "\n");
		DEBUG(dbgs() << " Threshold for small loops: " << Threshold << "\n");
		return false;
		}

unsigned LoopUnroll::selectUnrollCount(		unsigned LoopUnroll::selectUnrollCount(
const Loop *L, unsigned TripCount, bool PragmaFullUnroll,		const Loop *L, unsigned TripCount, bool PragmaFullUnroll,
unsigned PragmaCount, const TargetTransformInfo::UnrollingPreferences &UP,		unsigned PragmaCount, const TargetTransformInfo::UnrollingPreferences &UP,
bool &SetExplicitly) {		bool &SetExplicitly) {
SetExplicitly = true;		SetExplicitly = true;

// User-specified count (either as a command-line option or		// User-specified count (either as a command-line option or
// constructor parameter) has highest precedence.		// constructor parameter) has highest precedence.
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	DEBUG(dbgs() << " Not unrolling loop which contains non-duplicatable"
<< " instructions.\n");		<< " instructions.\n");
return false;		return false;
}		}
if (NumInlineCandidates != 0) {		if (NumInlineCandidates != 0) {
DEBUG(dbgs() << " Not unrolling loop with inlinable calls.\n");		DEBUG(dbgs() << " Not unrolling loop with inlinable calls.\n");
return false;		return false;
}		}

unsigned NumberOfOptimizedInstructions =
approximateNumberOfOptimizedInstructions(L, *SE, TripCount, TTI);
DEBUG(dbgs() << " Complete unrolling could save: "
<< NumberOfOptimizedInstructions << "\n");

unsigned Threshold, PartialThreshold;		unsigned Threshold, PartialThreshold;
		unsigned AbsoluteThreshold, PercentOfOptimizedForCompleteUnroll;
selectThresholds(L, HasPragma, UP, Threshold, PartialThreshold,		selectThresholds(L, HasPragma, UP, Threshold, PartialThreshold,
NumberOfOptimizedInstructions);		AbsoluteThreshold, PercentOfOptimizedForCompleteUnroll);

// Given Count, TripCount and thresholds determine the type of		// Given Count, TripCount and thresholds determine the type of
// unrolling which is to be performed.		// unrolling which is to be performed.
enum { Full = 0, Partial = 1, Runtime = 2 };		enum { Full = 0, Partial = 1, Runtime = 2 };
int Unrolling;		int Unrolling;
if (TripCount && Count == TripCount) {		if (TripCount && Count == TripCount) {
if (Threshold != NoThreshold && UnrolledSize > Threshold) {
DEBUG(dbgs() << " Too large to fully unroll with count: " << Count
<< " because size: " << UnrolledSize << ">" << Threshold
<< "\n");
Unrolling = Partial;		Unrolling = Partial;
		// If the loop is really small, we don't need to run an expensive analysis.
		if (canUnrollCompletely(
		L, Threshold, AbsoluteThreshold,
		UnrolledSize, 0, 100)) {
		Unrolling = Full;
} else {		} else {
		// The loop isn't that small, but we still can fully unroll it if that
		// helps to remove a significant number of instructions.
		// To check that, run additional analysis on the loop.
		UnrollAnalyzer UA(L, TripCount, *SE, TTI, AbsoluteThreshold);
		if (UA.analyzeLoop() && canUnrollCompletely(
		L, Threshold, AbsoluteThreshold,
		std::min<unsigned>(UnrolledSize, UA.UnrolledLoopSize),
		chandlercUnsubmitted Not Done Reply Inline Actions Why do you need this? I'm surprised this isn't just directly using the UA's values? chandlerc: Why do you need this? I'm surprised this isn't just directly using the UA's values?
		UA.NumberOfOptimizedInstructions,
		PercentOfOptimizedForCompleteUnroll)) {
Unrolling = Full;		Unrolling = Full;
}		}
		}
		chandlercUnsubmitted Not Done Reply Inline Actions I may just be forgetting where this is handled, but do we somewhere short-circuit the case where the total size of the loop body * the trip count is already below the threshold? Because we should. We shouldn't go and do the expensive analysis unless we at least have a large enough loop to be on the fence. chandlerc: I may just be forgetting where this is handled, but do we somewhere short-circuit the case…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Good point, fixed! mzolotukhin: Good point, fixed!
} else if (TripCount && Count < TripCount) {		} else if (TripCount && Count < TripCount) {
Unrolling = Partial;		Unrolling = Partial;
} else {		} else {
Unrolling = Runtime;		Unrolling = Runtime;
}		}

// Reduce count based on the type of unrolling and the threshold values.		// Reduce count based on the type of unrolling and the threshold values.
unsigned OriginalCount = Count;		unsigned OriginalCount = Count;
▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

test/Transforms/LoopUnroll/full-unroll-heuristics.ll

	Show All 11 Lines
	; * If a loop size is between these two tresholds, we only do complete unroll			; * If a loop size is between these two tresholds, we only do complete unroll
	; it if estimated number of potentially optimized instructions is high (we			; it if estimated number of potentially optimized instructions is high (we
	; specify the minimal percent of such instructions).			; specify the minimal percent of such instructions).

	; In this particular test-case, complete unrolling will allow later			; In this particular test-case, complete unrolling will allow later
	; optimizations to remove ~55% of the instructions, the loop body size is 9,			; optimizations to remove ~55% of the instructions, the loop body size is 9,
	; and unrolled size is 65.			; and unrolled size is 65.

	; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=10 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=30 \| FileCheck %s -check-prefix=TEST1			; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=10 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=20 \| FileCheck %s -check-prefix=TEST1
	; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=30 \| FileCheck %s -check-prefix=TEST2			; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=20 \| FileCheck %s -check-prefix=TEST2
	; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=80 \| FileCheck %s -check-prefix=TEST3			; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=10 -unroll-percent-of-optimized-for-complete-unroll=80 \| FileCheck %s -check-prefix=TEST3
	; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=100 -unroll-percent-of-optimized-for-complete-unroll=80 \| FileCheck %s -check-prefix=TEST4			; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-absolute-threshold=100 -unroll-threshold=100 -unroll-percent-of-optimized-for-complete-unroll=80 \| FileCheck %s -check-prefix=TEST4

	; If the absolute threshold is too low, or if we can't optimize away requested			; If the absolute threshold is too low, or if we can't optimize away requested
	; percent of instructions, we shouldn't unroll:			; percent of instructions, we shouldn't unroll:
	; TEST1: %array_const_idx = getelementptr inbounds [9 x i32], [9 x i32]* @known_constant, i64 0, i64 %iv			; TEST1: %array_const_idx = getelementptr inbounds [9 x i32], [9 x i32]* @known_constant, i64 0, i64 %iv
	; TEST3: %array_const_idx = getelementptr inbounds [9 x i32], [9 x i32]* @known_constant, i64 0, i64 %iv			; TEST3: %array_const_idx = getelementptr inbounds [9 x i32], [9 x i32]* @known_constant, i64 0, i64 %iv

	Show All 33 Lines