This is an archive of the discontinued LLVM Phabricator instance.

I haven't looked into the details of the patch yet, but (at least from the function name) it looks like this is very similar to what LoopUnrollAnalyzer does (see lib/Analysis/LoopUnrollAnalyzer.cpp). Have you considered adding this functionality there (and what's missing there to catch this)?

Michael

Hi Michael,

The new function is more about a decision if we need unroll or not. lib/Transforms/Scalar/LoopUnrollPass.cpp looks like a good place for it.

Thanks,
Evgeny

PING.

Hi Evgeny,

Why functionality from lib/Analysis/LoopUnrollAnalyzer.cpp isn't enough for your case? It should be able to predict that phis would be removed after unrolling (possibly along with other instructions).

Michael

Add recurrent xor case.
Move calculations to cost estimation function.

Set "disable unroll" for loops unrolled by "force unroll". The reason is that on LTO unroll pass could be called multiple times and we will try to unroll unless exceed threshold. This is not critical, but makes unroll behavior dependent on LTO.
Add missed unit tests.
Fix calculations for phis used outside of the loop only.

PING.

PING 2.

PING 3.

mzolotukhin added inline comments.Dec 14 2016, 4:45 PM

lib/Transforms/Scalar/LoopUnrollPass.cpp
505–509	This look very hacky. What is the motivation of this change? How often do we see such loops? If we do want to handle such cases (which I'm not convinced now), then we should do it in a general way. That is, the logic should be in instruction visitors, and we should automagically deduce that these instructions are free. There are more cases than just xor - we can multiply by -1, and we should be able to handle in a similar way. In the current form the code is not easily extensible to handle new cases.

evstupac added inline comments.Dec 14 2016, 6:08 PM

lib/Transforms/Scalar/LoopUnrollPass.cpp
505–509	What is the motivation of this change? How often do we see such loops? Not each test looks like this, but there are couple where we switch states: state = st[s^=1]; which becomes invariant with some other calculations after unroll. If we do want to handle such cases (which I'm not convinced now), then we should do it in a general way. That is, the logic should be in instruction visitors, and we should automagically deduce that these instructions are free. The same is valid for the code above where complete unroll simplify phi. Why phi is simplified here? I just did the same for XOR. And yes we can multiply by -1, do i&1, i/2,... but we need a start point which depends on unroll factor and iteration. The other solution is to pass, unroll factor in addition to iteration number and move simplification there. Do you think this is better?

What is the motivation of this change? How often do we see such loops?

Not each test looks like this, but there are couple where we switch states:

state = st[s^=1];
which becomes invariant with some other calculations after unroll.

I realize that it's possible to write such a test. My question was how frequent are such cases in real programs. Can you provide statistics from SPEC/llvm-testsuite that would justify the change?

If we do want to handle such cases (which I'm not convinced now), then we should do it in a general way. That is, the logic should be in instruction visitors, and we should automagically deduce that these instructions are free.

The same is valid for the code above where complete unroll simplify phi. Why phi is simplified here?

PHIs are present in every loop, in contrast to XORs.

The other solution is to pass, unroll factor in addition to iteration number and move simplification there.
Do you think this is better?

I haven't put much thought to this yet, because I'm still not convinced in the need of this change. It's possible though that if we decide to go this way, we'll need to do some prep-work first to make the code look nice. The main part of simplification code is already there (with PHI-nodes getting an extra attention due to their special and very important nature).

(was unable to answer in phabricator previously, so just put this from mail thread here)

I realize that it's possible to write such a test. My question was how frequent are such cases in real programs. Can you provide statistics from SPEC/llvm-testsuite that would justify the change?

I'll get the results for llvm test-suite. There is not much difference
on spec2000. There are gains on eembc benhmarks.

PHIs are present in every loop, in contrast to XORs.

The patch touches only recursive XORs which also go through PHI.
PHIs for optimization after complete unroll are in countable loops. My
patch deals with uncountable (a list loop for example).

I'm not sure I can share all the cases, but I've found good public example of "xor" in the loop from jpeg (h2v1_downsample and h2v2_downsample):
https://github.com/cloudflare/jpegtran/blob/master/jcsample.c

For the case above runtime unroll should work fine, but there are cases where loop upper bound is structure field load. If so loop becomes uncountable and we are unable to apply runtime unroll.
Generally there multiple cases when runtime unroll can fail, so why not to force unroll if we can prove profitability?

Add phi cycle function to explain phi simplification.

Regarding performance data.
Initially I want unroll only loops where previous iteration value is reused. Just with this change I get the following.
197.parser up to 5% depending on CPU and 10% to 30% on set of internal benchmarks.
With xor and sub enabled most of benchmarks the results got buildsame or no performance change. However I still believe this fits to current functionality.

One thing I noticed, adding cost to regular unroll (not complete). All loops which counts something got unrolled:

while(smth)

n++;

Use(n);

And unroll safes at least 1 instruction for such loops.

evstupac added reviewers: hfinkel, mkuper.Mar 2 2017, 2:47 PM

evstupac added subscribers: hfinkel, mkuper, Farhana.

PING.
Let's come to a conclusion if this is acceptable or not.
The cases that covered here:

Uncountable loops where previous value is reused (save one+ instruction for each value)
uncountable loops, that counts smth:

while(smth) {
  body();
  n++;
}

to:

while(smth) {
  body();
  if (!smth) goto exit2;
  body();
  n+=2;
}
exit2:
n++;
loop exit:

Saves one+ "add" in the loop.

Uncountable loops switching states (not that frequent):

while(smth) {
  body();
  s ^= n;
}
while(smth) {
  body();
  s = -s;
}

Potentially saves a lot (as constant values could simplify several instructions).

Hi,

I'm still not convinced that we need this functionality. The loops that you mentioned can easily be handled by using unroll pragma if users really care about their performance. The patch might introduce no performance/compile-time regressions, but we shouldn't forget about the increased source code complexity.

Michael

Hi Michael,

but we shouldn't forget about the increased source code complexity.

The patch mostly reuse existing code for full unroll.

Regarding PHI Cycle it even improve current heuristic, as full unroll consider
"s^=1;" is constant "1" at all iterations >0. However it is 0 on even and 1 on odd.
And I'm ok to drop this part to make code shorter.

Maybe we can add uncountable loops unroll under -O3?

Thanks,
Evgeny

In D21720#691221, @evstupac wrote:

Hi Michael,

but we shouldn't forget about the increased source code complexity.

The patch mostly reuse existing code for full unroll.

Regarding PHI Cycle it even improve current heuristic, as full unroll consider
"s^=1;" is constant "1" at all iterations >0. However it is 0 on even and 1 on odd.
And I'm ok to drop this part to make code shorter.

Maybe we can add uncountable loops unroll under -O3?

I agree. -O3 would be the right place for this.

Thanks,
Evgeny

In D21720#691193, @mzolotukhin wrote:

Hi,

I'm still not convinced that we need this functionality. The loops that you mentioned can easily be handled by using unroll pragma if users really care about their performance. The patch might introduce no performance/compile-time regressions, but we shouldn't forget about the increased source code complexity.

Michael

I don't think that "easily be handled by using unroll pragma" is the right standard here. The same argument can be made for anything. We should only really require a pragma when we cannot design a reasonable heuristic (at this stage in the pipeline). This patch does not seem that large. I can review next week.

full unroll consider "s^=1;" is constant "1" at all iterations >0. However it is 0 on even and 1 on odd.

If that's the case, I'd be happy to review a separate patch that just fixes that.

I don't think that "easily be handled by using unroll pragma" is the right standard here. The same argument can be made for anything. We should only really require a pragma when we cannot design a reasonable heuristic (at this stage in the pipeline). This patch does not seem that large. I can review next week.

My biggest concern about this patch is that it doesn't solve the problem in a general way, but instead only catches two cases: s ^= 1 and s = -s. While it looks kind of generic, I can see no way how it can be extended in future. E.g. it doesn't seem to be able to support another way to write the same idiom: s = i%2 or s = i&1, and if we'd like to support it, we'll need yet another approach.

Your point about using pragma is generally valid, but here is my view on it. While in general we do need to try to cover as many cases without pragmas as possible, we need to provide a general implementation that potentially can help other cases as well. Patching the compiler here and there just to cover cases one by one would lead to unmaintainable and inefficient code in the end. Also, unlike many other loops in a loop with s ^=1 the user would more likely correctly guess a favorable unroll factor (2), just by the nature of the computation. In more complicated loops a help from the compiler is needed much more, because it might be hard to guess the best unroll factor.

IMHO, a general approach here would be to teach unroller (or maybe SCEV) to analyze N sequential iterations. E.g. for the xor case and for 2 sequential iterations starting at the i-th iteration, we'd get something like S_i_plus_2 = S_i_plus_1 ^ 1 = (S_i ^ 1) ^ 1 = S_i. Honestly, I don't know how hard it would be to implement this, and maybe the efforts and complexity won't be worth the gain, but that's what I'd call a general approach, rather than targeting a single case.

Michael

My biggest concern about this patch is that it doesn't solve the problem in a general way, but instead only catches two cases: s ^= 1 and s = -s.

Why PhiCycle is not general? That could be InstCycle instead (for i&3, i&7, i%N,....).
"i%2", "i&1" are convertible to "s^=1".

IMHO, a general approach here would be to teach unroller (or maybe SCEV) to analyze N sequential iterations. E.g. for the xor case and for 2 sequential iterations starting at the i-th iteration, we'd get something like S_i_plus_2 = S_i_plus_1 ^ 1 = (S_i ^ 1) ^ 1 = S_i.

Current approach do this, but starting from 0 iteration (taking in account that some instruction are cycle).

Suppose we skip this, what about:

Uncountable loops where previous value is reused (save one+ instruction for each value)?
Uncountable loops, that counts smth?

In D21720#696992, @evstupac wrote:

My biggest concern about this patch is that it doesn't solve the problem in a general way, but instead only catches two cases: s ^= 1 and s = -s.

Why PhiCycle is not general? That could be InstCycle instead (for i&3, i&7, i%N,....).
"i%2", "i&1" are convertible to "s^=1".

Sure, but "are convertible to" is not helpful if they're not. Whether that's a useful canonical form might be a separate discussion (maybe it is if that's the only use of the induction variable). In any case, I agree that we should have something that is insensitive to how this is written.

However, we still need to make sure we're doing this in a way that is profitable. The point of making sure to unroll by an even factor for these cycle-2 recurrences is that it allows us to completely remove the PHI. It is not clear to me that the more-general case shares this property without additional work (by which I mean canonicalization work).

IMHO, a general approach here would be to teach unroller (or maybe SCEV) to analyze N sequential iterations. E.g. for the xor case and for 2 sequential iterations starting at the i-th iteration, we'd get something like S_i_plus_2 = S_i_plus_1 ^ 1 = (S_i ^ 1) ^ 1 = S_i.

But you need more than this. You need to know how the intermediate iterations are related so that you can eliminate the PHI. This is especially true for the remainder case (because remainders are expensive compared to &, for example).

I suspect that SCEV would have a hard time doing this because the relations are not algebraic (although it might be able to do something interesting with the remainder case).

In any case, there is definitely a more-general case we can handle here, but it is the following: there needs to either be a cycle PHI, or a use of an otherwise-unused instruction variable that repeats in a fixed pattern, such that unrolling the loop by the length of that pattern allows us to eliminate the PHI. i&C and i%C have that property. Is there anything else?

Current approach do this, but starting from 0 iteration (taking in account that some instruction are cycle).

Suppose we skip this, what about:

Uncountable loops where previous value is reused (save one+ instruction for each value)?

Uncountable loops, that counts smth?

In any case, there is definitely a more-general case we can handle here, but it is the following: there needs to either be a cycle PHI, or a use of an otherwise-unused instruction variable that repeats in a fixed pattern, such that unrolling the loop by the length of that pattern allows us to eliminate the PHI. i&C and i%C have that property. Is there anything else?

"i/C" could also be optimized when "i" starts from "k*C". However we'll need to lower it (replace with new IV) in later optimizations (for example LSR).

What about cases 1. and 2. ?

In D21720#697188, @evstupac wrote:

In any case, there is definitely a more-general case we can handle here, but it is the following: there needs to either be a cycle PHI, or a use of an otherwise-unused instruction variable that repeats in a fixed pattern, such that unrolling the loop by the length of that pattern allows us to eliminate the PHI. i&C and i%C have that property. Is there anything else?

"i/C" could also be optimized when "i" starts from "k*C". However we'll need to lower it (replace with new IV) in later optimizations (for example LSR).

What about cases 1. and 2. ?

Case 1 sounds good, but maybe our existing heuristic should get it? I don't understand what you mean in case 2.

Case 1 sounds good, but maybe our existing heuristic should get it? I don't understand what you mean in case 2.

Actually case 1 is the most simple and profitable. Maybe we shall start with it only?
Regarding case 2:

It covers loops which counts how long some condition is true. For example (how many elements in the list):
while (curr) {

len++;
curr = curr->next;

}

After unroll by 2:

while(curr) {

curr1 = curr->next;
if (!curr1) goto exit1;
curr = curr1->next;
len+=2;

}
exit1:
len++;
loop exit:

but maybe our existing heuristic should get it?

No. We don't do unroll of uncountable loops at all.

zzheng added a subscriber: zzheng.Mar 10 2017, 10:27 AM

zzheng added inline comments.

lib/Transforms/Scalar/LoopUnrollPass.cpp
94	Is this description copied from above flag? allow-force sounds like forced unrolling... how about name it UnrollAllowUncountable (unroll-allow-uncountable)?
test/Transforms/LoopUnroll/unroll-force.ll
1 ↗	(On Diff #88080)	Maybe name this file 'unroll-uncountable.ll'?

What is the case 1?

For loops with unknown trip count (that's the case 2, right?) I don't see why we need to add any additional analyzers (analyzeLoopUnrollCost). Removing one instruction by unrolling by 2 is not better than the worst case for partial unroll of a loop with a known trip count:

for (i = 0; i < N; i++) {
  ...
}

But since we sometimes partially unroll such loops, I don't currently see a reason why we can't do this for loops with unknown trip-count as well. However, we don't need additional thresholds/knobs for it, thresholds for partial unroll should do it already.

What is the case 1?

Uncountable loops where previous value is reused (save one+ instruction for each value)
The example is in lit test (motivation comes from list loops).

For loops with unknown trip count (that's the case 2, right?)

No. All cases are about uncountable loops.

I don't see why we need to add any additional analyzers (analyzeLoopUnrollCost). Removing one instruction by unrolling by 2 is not better than the worst case for partial unroll of a loop with a known trip count:

We can skip this and do like in my initial patch (just run a function that return unroll count if it is profitable to unroll). That way cases 2 and 3 are skipped. I change the patch to call `analyzeLoopUnrollCost' to make more general.

But since we sometimes partially unroll such loops, I don't currently see a reason why we can't do this for loops with unknown trip-count as well. However, we don't need additional thresholds/knobs for it, thresholds for partial unroll should do it already.

The patch do not introduce a new threshold. If we enable unroll for all uncountable loops it will result in dramatic code size increase (for most cases without performance value).

evstupac added inline comments.Mar 10 2017, 12:02 PM

lib/Transforms/Scalar/LoopUnrollPass.cpp
94	Right. Thanks for catching this. Will fix. "unroll-allow-uncountable" looks good.
test/Transforms/LoopUnroll/unroll-force.ll
1 ↗	(On Diff #88080)	Yes. Make sense.

Uncountable loops where previous value is reused (save one+ instruction for each value)

What do you mean by uncountable? How is it different from a loop with unknown trip count?
What is a previous value? Is it a value from a previous iteration? If that's the case, any induction variable satisfies that condition.

No. All cases are about uncountable loops.

What's the difference between these two cases? I've gotten totally confused now.

We can skip this and do like in my initial patch (just run a function that return unroll count if it is profitable to unroll). That way cases 2 and 3 are skipped. I change the patch to call `analyzeLoopUnrollCost' to make more general.

What is the case 3?

The patch do not introduce a new threshold. If we enable unroll for all uncountable loops it will result in dramatic code size increase (for most cases without performance value).

Exactly. And unroll of any loop that has at least one IV will save at least one instruction, because we'll have a phi for the IV and we can fold this phi in between unrolled iterations. So, my point was that "saving one instruction" should not be enough to unroll a loop.

What do you mean by uncountable? How is it different from a loop with unknown trip count?

L is uncountble <=> isa<SCEVCouldNotCompute>(SE->getBackedgeTakenCount(L)) is true.
For unknown trip isa<SCEVCouldNotCompute>(SE->getBackedgeTakenCount(L)) is false.

What's the difference between these two cases? I've gotten totally confused now.
What is the case 3?

The cases that covered here:

Uncountable loops where previous value is reused (save one+ instruction for each value)
uncountable loops, that counts smth:

while(smth) {
  body();
  n++;
}

to:

while(smth) {
  body();
  if (!smth) goto exit2;
  body();
  n+=2;
}
exit2:
n++;
loop exit:

Saves one+ "add" in the loop.

Uncountable loops switching states (not that frequent):

while(smth) {
  body();
  s ^= n;
}
while(smth) {
  body();
  s = -s;
}

Potentially saves a lot (as constant values could simplify several instructions).

Currently, there's a profitability check hidden in llvm::UnrollLoop: if the "Force" parameter is false, it will refuse to unroll any loop unless it can make the loop latch in the unrolled iterations an unconditional branch. (See the debug message "Wont unroll; remainder loop could not be generated" in lib/Transforms/Utils/LoopUnroll.cpp.) This prevents the unroll pass from going crazy unrolling every loop in the program, but it's also not a very good way to measure profitability.

Ideally, we want to tie unrolling to a dynamic cost discount. If the unrolled loop is substantially cheaper than the original, we should unroll, whether or not the savings come from eliminating the conditional branch in the latch. Conversely, unrolling a large loop just to eliminate one branch is probably a bad idea.

This patch is trying to solve the problem for certain cases... but the approach is kind of awkward: rather than actually moving the profitability check, it adds a special case to set the "Force" bit for certain loops.

Conversely, unrolling a large loop just to eliminate one branch is probably a bad idea.

That is bounded by thresholds.

This patch is trying to solve the problem for certain cases... but the approach is kind of awkward: rather than actually moving the profitability check,
it adds a special case to set the "Force" bit for certain loops.

Isn't it the same for Runtime, Partial? Setting count means almost the same. Count of 1 or 0 means we'll act in the same way if Runtime is false. I can rewrite this and set Force to true by default (but set count to 0 or 1 for uncountable loops, that do not meet some conditions).
And yes you are right we need to move all profitability checks from Utils/LoopUnrollRuntime.cpp to Scalar/LoopUnrollPass.cpp (I plan to do this in another patch).,

Conversely, unrolling a large loop just to eliminate one branch is probably a bad idea.

That is bounded by thresholds.

Not really... I mean, we have PartialThreshold as a maximum size for the whole unrolled loop, but that isn't sensitive to the dynamic cost savings. (Compare to getFullUnrollBoostingFactor for complete unrolling.)

Not really... I mean, we have PartialThreshold as a maximum size for the whole unrolled loop, but that isn't sensitive to the dynamic cost savings. (Compare to getFullUnrollBoostingFactor for complete unrolling.)

Right. But full unroll is the only exception.
And yes this threshold is not that accurate, but it bounds loop unrolling. And it will bound uncountable loops unroll as well.

In D21720#698373, @evstupac wrote:

Not really... I mean, we have PartialThreshold as a maximum size for the whole unrolled loop, but that isn't sensitive to the dynamic cost savings. (Compare to getFullUnrollBoostingFactor for complete unrolling.)

Right. But full unroll is the only exception.
And yes this threshold is not that accurate, but it bounds loop unrolling. And it will bound uncountable loops unroll as well.

Okay, so is the general plan here to?

Generally allow unrolling of uncountable loops
Adjust the profitability heuristic to account for the cost of the branches for the uncountable case
Include, as necessary, in our general profitability heuristic (if we don't get this naturally), savings from phi-derived values which are cyclic with a small cycle length (like the s = -s case).

Okay, so is the general plan here to?

Generally allow unrolling of uncountable loops

Adjust the profitability heuristic to account for the cost of the branches for the uncountable case

Include, as necessary, in our general profitability heuristic (if we don't get this naturally), savings from phi-derived values which are cyclic with a small cycle length (like the s = -s case).

I agree that we should start from (1). But I don't understand, why (2) (and (3) ) is considered special for the uncountable case. AFAIU, the current heuristic aims at simulating these savings (removed branches) already - even though we do not try to accurately predict cost of rolled and unrolled version of the loop. Am I missing something?

Michael

Okay, so is the general plan here to?

Generally allow unrolling of uncountable loops

That seems reasonable. I'll enable this at O3.

Adjust the profitability heuristic to account for the cost of the branches for the uncountable case

In most cases we can not remove branches in uncountable (isa<SCEVCouldNotCompute>(SE->getBackedgeTakenCount(L)) is true) loops. And it is hard to detect cases when we can.
To be able to reduce IVs (if there are such) we'll need to count them and do some basic analysis. A lot of uncountable loops don't have IVs at all.
So the heuristic we have for Runtime unrolling (with unknown trip count) is not ok here.

Include, as necessary, in our general profitability heuristic (if we don't get this naturally), savings from phi-derived values which are cyclic with a small cycle length (like the s = -s case).

Right now this is the only way to see if unroll for uncountable loop is profitable or not.

I agree that we should start from (1). But I don't understand, why (2) (and (3) ) is considered special for the uncountable case. AFAIU, the current heuristic aims at simulating these savings (removed branches) already - even though we do not try to accurately predict cost of rolled and unrolled version of the loop. Am I missing something?

Current heuristic works for loops with unknown trip count (when we are able to calculate it, but runtime only). It is not applicable for uncountable loops.

Enabled uncountable (unable to calculate trip count) unroll by default at O3 and higher.
Renamed test (according to inline comments).

PING

The use of the "Force" bit here is still really confusing... we need a better way of expressing the profitability of unrolling.

include/llvm/CodeGen/BasicTTIImpl.h
332	What effect does this have on performance?

In D21720#766168, @efriedma wrote:

The use of the "Force" bit here is still really confusing... we need a better way of expressing the profitability of unrolling.

The naming was introduced earlier. Changing it in this patch would be considered as "unrelated change".
"Force" mean we'll unroll loop anyway. For example if user set "pragma unroll(2)" we need to force unrolling (no matter: countable loop or not, it is expensive to calculate tripcount or not...).

include/llvm/CodeGen/BasicTTIImpl.h
332	spec performance is unchanged. However I see performance gain on small loops working with list structures (that is the main reason I'm trying to apply this). Generally I don't see why we are missing unroll of uncountable loops now. When we are able to prove profitability we should do this.

PING.

Changing it in this patch would be considered as "unrelated change".

It's still messy, and still needs to change before this lands. Changing it in a separate patch is fine, if that makes sense.

This change needs testcases, including a couple examples of loops where the unrolling is profitable.

lib/Transforms/Scalar/LoopUnrollPass.cpp
413	Indentation. And I'm not sure why you're adding an instruction to the worklist that might not even be in the loop.
466	Maybe this would be more clear if it returned a boolean? I'm not sure the rest of this code makes sense if PhiCycle isn't either 0 or 2.
608	What is this change doing? (Please put comments in the code.)
987	"Cost->UnrolledCost >= Cost->RolledDynamicCost" is the profitability check? Needs a comment to explain what that means. Do we care how large the improvement is vs. the size of the loop?
990	Maybe rearrange this? UP.Force is only true on one codepath out of the previous if statement.

sanjoy resigned from this revision.Jan 29 2022, 5:36 PM

Revision Contents

Path

Size

include/

llvm/

Analysis/

LoopUnrollAnalyzer.h

6 lines

CodeGen/

BasicTTIImpl.h

5 lines

lib/

Analysis/

LoopUnrollAnalyzer.cpp

6 lines

Transforms/

Scalar/

LoopUnrollPass.cpp

105 lines

unittests/

Analysis/

UnrollAnalyzer.cpp

2 lines

Diff 98867

include/llvm/Analysis/LoopUnrollAnalyzer.h

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	class UnrolledInstAnalyzer : private InstVisitor<UnrolledInstAnalyzer, bool> {
struct SimplifiedAddress {		struct SimplifiedAddress {
Value *Base = nullptr;		Value *Base = nullptr;
ConstantInt *Offset = nullptr;		ConstantInt *Offset = nullptr;
};		};

public:		public:
UnrolledInstAnalyzer(unsigned Iteration,		UnrolledInstAnalyzer(unsigned Iteration,
DenseMap<Value , Constant > &SimplifiedValues,		DenseMap<Value , Constant > &SimplifiedValues,
ScalarEvolution &SE, const Loop *L)		ScalarEvolution &SE, const Loop *L, bool CompleteUnroll)
: SimplifiedValues(SimplifiedValues), SE(SE), L(L) {		: SimplifiedValues(SimplifiedValues), SE(SE),
		L(L), CompleteUnroll(CompleteUnroll) {
IterationNumber = SE.getConstant(APInt(64, Iteration));		IterationNumber = SE.getConstant(APInt(64, Iteration));
}		}

// Allow access to the initial visit method.		// Allow access to the initial visit method.
using Base::visit;		using Base::visit;

private:		private:
/// \brief A cache of pointer bases and constant-folded offsets corresponding		/// \brief A cache of pointer bases and constant-folded offsets corresponding
Show All 15 Lines	private:
/// of simplified values specific to this iteration. The idea is to propagate		/// of simplified values specific to this iteration. The idea is to propagate
/// any special information we have about loads that can be replaced with		/// any special information we have about loads that can be replaced with
/// constants after complete unrolling, and account for likely simplifications		/// constants after complete unrolling, and account for likely simplifications
/// post-unrolling.		/// post-unrolling.
DenseMap<Value , Constant > &SimplifiedValues;		DenseMap<Value , Constant > &SimplifiedValues;

ScalarEvolution &SE;		ScalarEvolution &SE;
const Loop *L;		const Loop *L;
		bool CompleteUnroll;

bool simplifyInstWithSCEV(Instruction *I);		bool simplifyInstWithSCEV(Instruction *I);

bool visitInstruction(Instruction &I) { return simplifyInstWithSCEV(&I); }		bool visitInstruction(Instruction &I) { return simplifyInstWithSCEV(&I); }
bool visitBinaryOperator(BinaryOperator &I);		bool visitBinaryOperator(BinaryOperator &I);
bool visitLoad(LoadInst &I);		bool visitLoad(LoadInst &I);
bool visitCastInst(CastInst &I);		bool visitCastInst(CastInst &I);
bool visitCmpInst(CmpInst &I);		bool visitCmpInst(CmpInst &I);
bool visitPHINode(PHINode &PN);		bool visitPHINode(PHINode &PN);
};		};
}		}
#endif		#endif

include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 320 Lines • ▼ Show 20 Lines	for (Loop::block_iterator I = L->block_begin(), E = L->block_end(); I != E;
}		}
}		}

// Enable runtime and partial unrolling up to the specified size.		// Enable runtime and partial unrolling up to the specified size.
// Enable using trip count upper bound to unroll loops.		// Enable using trip count upper bound to unroll loops.
UP.Partial = UP.Runtime = UP.UpperBound = true;		UP.Partial = UP.Runtime = UP.UpperBound = true;
UP.PartialThreshold = MaxOps;		UP.PartialThreshold = MaxOps;

		const TargetMachine &TM = getTLI()->getTargetMachine();
		// Unroll uncountable inner loops at O3 and higher.
		if (L->getSubLoops().size() == 0 && TM.getOptLevel() > CodeGenOpt::Default)
		UP.Force = true;
		efriedmaUnsubmitted Not Done Reply Inline Actions What effect does this have on performance? efriedma: What effect does this have on performance?
		evstupacAuthorUnsubmitted Not Done Reply Inline Actions spec performance is unchanged. However I see performance gain on small loops working with list structures (that is the main reason I'm trying to apply this). Generally I don't see why we are missing unroll of uncountable loops now. When we are able to prove profitability we should do this. evstupac: spec performance is unchanged. However I see performance gain on small loops working with list…

// Avoid unrolling when optimizing for size.		// Avoid unrolling when optimizing for size.
UP.OptSizeThreshold = 0;		UP.OptSizeThreshold = 0;
UP.PartialOptSizeThreshold = 0;		UP.PartialOptSizeThreshold = 0;

// Set number of instructions optimized when "back edge"		// Set number of instructions optimized when "back edge"
// becomes "fall through" to default value of 2.		// becomes "fall through" to default value of 2.
UP.BEInsns = 2;		UP.BEInsns = 2;
}		}
▲ Show 20 Lines • Show All 840 Lines • Show Last 20 Lines

lib/Analysis/LoopUnrollAnalyzer.cpp

Show All 29 Lines	bool UnrolledInstAnalyzer::simplifyInstWithSCEV(Instruction *I) {
if (!SE.isSCEVable(I->getType()))		if (!SE.isSCEVable(I->getType()))
return false;		return false;

const SCEV *S = SE.getSCEV(I);		const SCEV *S = SE.getSCEV(I);
if (auto *SC = dyn_cast<SCEVConstant>(S)) {		if (auto *SC = dyn_cast<SCEVConstant>(S)) {
SimplifiedValues[I] = SC->getValue();		SimplifiedValues[I] = SC->getValue();
return true;		return true;
}		}
		if (!CompleteUnroll)
		return false;

auto *AR = dyn_cast<SCEVAddRecExpr>(S);		auto *AR = dyn_cast<SCEVAddRecExpr>(S);
if (!AR \|\| AR->getLoop() != L)		if (!AR \|\| AR->getLoop() != L)
return false;		return false;

const SCEV *ValueAtIteration = AR->evaluateAtIteration(IterationNumber, SE);		const SCEV *ValueAtIteration = AR->evaluateAtIteration(IterationNumber, SE);
// Check if the AddRec expression becomes a constant.		// Check if the AddRec expression becomes a constant.
if (auto *SC = dyn_cast<SCEVConstant>(ValueAtIteration)) {		if (auto *SC = dyn_cast<SCEVConstant>(ValueAtIteration)) {
▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines
}		}

bool UnrolledInstAnalyzer::visitPHINode(PHINode &PN) {		bool UnrolledInstAnalyzer::visitPHINode(PHINode &PN) {
// Run base visitor first. This way we can gather some useful for later		// Run base visitor first. This way we can gather some useful for later
// analysis information.		// analysis information.
if (Base::visitPHINode(PN))		if (Base::visitPHINode(PN))
return true;		return true;

		// Consider PHI is foldable only after complete unroll.
		if (!CompleteUnroll)
		return false;

// The loop induction PHI nodes are definitionally free.		// The loop induction PHI nodes are definitionally free.
return PN.getParent() == L->getHeader();		return PN.getParent() == L->getHeader();
}		}

lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	UnrollAllowPartial("unroll-allow-partial", cl::Hidden,
cl::desc("Allows loops to be partially unrolled until "		cl::desc("Allows loops to be partially unrolled until "
"-unroll-threshold loop size is reached."));		"-unroll-threshold loop size is reached."));

static cl::opt<bool> UnrollAllowRemainder(		static cl::opt<bool> UnrollAllowRemainder(
"unroll-allow-remainder", cl::Hidden,		"unroll-allow-remainder", cl::Hidden,
cl::desc("Allow generation of a loop remainder (extra iterations) "		cl::desc("Allow generation of a loop remainder (extra iterations) "
"when unrolling a loop."));		"when unrolling a loop."));

		static cl::opt<bool> UnrollUncountable(
		"unroll-uncountable", cl::Hidden, cl::init(false),
		cl::desc("Allow unroll of uncountable (trip count is not countable) "
		zzhengUnsubmitted Not Done Reply Inline Actions Is this description copied from above flag? allow-force sounds like forced unrolling... how about name it UnrollAllowUncountable (unroll-allow-uncountable)? zzheng: Is this description copied from above flag? allow-force sounds like forced unrolling... how…
		evstupacAuthorUnsubmitted Not Done Reply Inline Actions Right. Thanks for catching this. Will fix. "unroll-allow-uncountable" looks good. evstupac: Right. Thanks for catching this. Will fix. "unroll-allow-uncountable" looks good.
		"loops when profitable."));

static cl::opt<bool>		static cl::opt<bool>
UnrollRuntime("unroll-runtime", cl::ZeroOrMore, cl::Hidden,		UnrollRuntime("unroll-runtime", cl::ZeroOrMore, cl::Hidden,
cl::desc("Unroll loops with run-time trip counts"));		cl::desc("Unroll loops with run-time trip counts"));

static cl::opt<unsigned> UnrollMaxUpperBound(		static cl::opt<unsigned> UnrollMaxUpperBound(
"unroll-max-upperbound", cl::init(8), cl::Hidden,		"unroll-max-upperbound", cl::init(8), cl::Hidden,
cl::desc(		cl::desc(
"The max of trip count upper bound that is considered in unrolling"));		"The max of trip count upper bound that is considered in unrolling"));
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	static TargetTransformInfo::UnrollingPreferences gatherUnrollingPreferences(
if (UnrollAllowPartial.getNumOccurrences() > 0)		if (UnrollAllowPartial.getNumOccurrences() > 0)
UP.Partial = UnrollAllowPartial;		UP.Partial = UnrollAllowPartial;
if (UnrollAllowRemainder.getNumOccurrences() > 0)		if (UnrollAllowRemainder.getNumOccurrences() > 0)
UP.AllowRemainder = UnrollAllowRemainder;		UP.AllowRemainder = UnrollAllowRemainder;
if (UnrollRuntime.getNumOccurrences() > 0)		if (UnrollRuntime.getNumOccurrences() > 0)
UP.Runtime = UnrollRuntime;		UP.Runtime = UnrollRuntime;
if (UnrollMaxUpperBound == 0)		if (UnrollMaxUpperBound == 0)
UP.UpperBound = false;		UP.UpperBound = false;
		if (UnrollUncountable.getNumOccurrences() > 0)
		UP.Force = true;
if (UnrollAllowPeeling.getNumOccurrences() > 0)		if (UnrollAllowPeeling.getNumOccurrences() > 0)
UP.AllowPeeling = UnrollAllowPeeling;		UP.AllowPeeling = UnrollAllowPeeling;

// Apply user values provided by argument		// Apply user values provided by argument
if (UserThreshold.hasValue()) {		if (UserThreshold.hasValue()) {
UP.Threshold = *UserThreshold;		UP.Threshold = *UserThreshold;
UP.PartialThreshold = *UserThreshold;		UP.PartialThreshold = *UserThreshold;
}		}
Show All 38 Lines	struct UnrolledInstStateKeyInfo {
}		}
static inline bool isEqual(const UnrolledInstState &LHS,		static inline bool isEqual(const UnrolledInstState &LHS,
const UnrolledInstState &RHS) {		const UnrolledInstState &RHS) {
return PairInfo::isEqual({LHS.I, LHS.Iteration}, {RHS.I, RHS.Iteration});		return PairInfo::isEqual({LHS.I, LHS.Iteration}, {RHS.I, RHS.Iteration});
}		}
};		};
}		}

		/// Some instructions get the same value during loop execution.
		/// The function find a cycle length for such.
		static unsigned getPhiCycleLength (PHINode PHI, const Loop L) {
		Value *V = PHI->getIncomingValueForBlock(L->getLoopLatch());
		Instruction *I = dyn_cast<Instruction>(V);
		if (!I)
		return 0;
		switch (I->getOpcode()) {
		case Instruction::Sub:
		if (ConstantInt *CS = dyn_cast<ConstantInt>(I->getOperand(0)))
		if (CS->isZero() && I->getOperand(1) == PHI)
		return 2;
		case Instruction::Xor:
		if (I->getOperand(0) == PHI && L->isLoopInvariant(I->getOperand(1)))
		return 2;
		default:
		return 0;
		}
		return 0;
		}

namespace {		namespace {
struct EstimatedUnrollCost {		struct EstimatedUnrollCost {
/// \brief The estimated cost after unrolling.		/// \brief The estimated cost after unrolling.
unsigned UnrolledCost;		unsigned UnrolledCost;

/// \brief The estimated dynamic cost of executing the instructions in the		/// \brief The estimated dynamic cost of executing the instructions in the
/// rolled form.		/// rolled form.
unsigned RolledDynamicCost;		unsigned RolledDynamicCost;
Show All 9 Lines
/// dynamic cost we mean that we won't count costs of blocks that are known not		/// dynamic cost we mean that we won't count costs of blocks that are known not
/// to be executed (i.e. if we have a branch in the loop and we know that at the		/// to be executed (i.e. if we have a branch in the loop and we know that at the
/// given iteration its condition would be resolved to true, we won't add up the		/// given iteration its condition would be resolved to true, we won't add up the
/// cost of the 'false'-block).		/// cost of the 'false'-block).
/// \returns Optional value, holding the RolledDynamicCost and UnrolledCost. If		/// \returns Optional value, holding the RolledDynamicCost and UnrolledCost. If
/// the analysis failed (no benefits expected from the unrolling, or the loop is		/// the analysis failed (no benefits expected from the unrolling, or the loop is
/// too big to analyze), the returned value is None.		/// too big to analyze), the returned value is None.
static Optional<EstimatedUnrollCost>		static Optional<EstimatedUnrollCost>
analyzeLoopUnrollCost(const Loop *L, unsigned TripCount, DominatorTree &DT,		analyzeLoopUnrollCost(const Loop *L, unsigned UnrollCount, DominatorTree &DT,
ScalarEvolution &SE, const TargetTransformInfo &TTI,		ScalarEvolution &SE, const TargetTransformInfo &TTI,
unsigned MaxUnrolledLoopSize) {		unsigned MaxUnrolledLoopSize, bool CompleteUnroll = true) {
// We want to be able to scale offsets by the trip count and add more offsets		// We want to be able to scale offsets by the trip count and add more offsets
// to them without checking for overflows, and we already don't want to		// to them without checking for overflows, and we already don't want to
// analyze massive trip counts, so we force the max to be reasonably small.		// analyze massive trip counts, so we force the max to be reasonably small.
assert(UnrollMaxIterationsCountToAnalyze < (INT_MAX / 2) &&		assert(UnrollMaxIterationsCountToAnalyze < (INT_MAX / 2) &&
"The unroll iterations max is too large!");		"The unroll iterations max is too large!");

// Only analyze inner loops. We can't properly estimate cost of nested loops		// Only analyze inner loops. We can't properly estimate cost of nested loops
// and we won't visit inner loops again anyway.		// and we won't visit inner loops again anyway.
if (!L->empty())		if (!L->empty())
return None;		return None;

// Don't simulate loops with a big or unknown tripcount		// Don't simulate loops with a big or unknown UnrollCount
if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|		if (!UnrollMaxIterationsCountToAnalyze \|\| !UnrollCount \|\|
TripCount > UnrollMaxIterationsCountToAnalyze)		UnrollCount > UnrollMaxIterationsCountToAnalyze)
return None;		return None;

SmallSetVector<BasicBlock *, 16> BBWorklist;		SmallSetVector<BasicBlock *, 16> BBWorklist;
SmallSetVector<std::pair<BasicBlock , BasicBlock >, 4> ExitWorklist;		SmallSetVector<std::pair<BasicBlock , BasicBlock >, 4> ExitWorklist;
DenseMap<Value , Constant > SimplifiedValues;		DenseMap<Value , Constant > SimplifiedValues;
SmallVector<std::pair<Value , Constant >, 4> SimplifiedInputValues;		SmallVector<std::pair<Value , Constant >, 4> SimplifiedInputValues;

// The estimated cost of the unrolled form of the loop. We try to estimate		// The estimated cost of the unrolled form of the loop. We try to estimate
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	for (;; --Iteration) {
continue;		continue;

// Mark that we are counting the cost of this instruction now.		// Mark that we are counting the cost of this instruction now.
Cost.IsCounted = true;		Cost.IsCounted = true;

// If this is a PHI node in the loop header, just add it to the PHI set.		// If this is a PHI node in the loop header, just add it to the PHI set.
if (auto *PhiI = dyn_cast<PHINode>(I))		if (auto *PhiI = dyn_cast<PHINode>(I))
if (PhiI->getParent() == L->getHeader()) {		if (PhiI->getParent() == L->getHeader()) {
		if (CompleteUnroll)
assert(Cost.IsFree && "Loop PHIs shouldn't be evaluated as they "		assert(Cost.IsFree && "Loop PHIs shouldn't be evaluated as they "
"inherently simplify during unrolling.");		"inherently simplify during complete "
		"unrolling.");
if (Iteration == 0)		if (Iteration == 0)
continue;		continue;

// Push the incoming value from the backedge into the PHI used list		// Push the incoming value from the backedge into the PHI used list
// if it is an in-loop instruction. We'll use this to populate the		// if it is an in-loop instruction. We'll use this to populate the
// cost worklist for the next iteration (as we count backwards).		// cost worklist for the next iteration (as we count backwards).
if (auto *OpI = dyn_cast<Instruction>(		if (auto *OpI = dyn_cast<Instruction>(
PhiI->getIncomingValueForBlock(L->getLoopLatch())))		PhiI->getIncomingValueForBlock(L->getLoopLatch())))
Show All 14 Lines	for (;; --Iteration) {
// recursively. If we reach a loop PHI node, simply add it to the set		// recursively. If we reach a loop PHI node, simply add it to the set
// to be considered on the next iteration (backwards!).		// to be considered on the next iteration (backwards!).
for (Value *Op : I->operands()) {		for (Value *Op : I->operands()) {
// Check whether this operand is free due to being a constant or		// Check whether this operand is free due to being a constant or
// outside the loop.		// outside the loop.
auto *OpI = dyn_cast<Instruction>(Op);		auto *OpI = dyn_cast<Instruction>(Op);
if (!OpI \|\| !L->contains(OpI))		if (!OpI \|\| !L->contains(OpI))
continue;		continue;
		// For regular unroll instruction that depends on PHI also costs.
		if (!CompleteUnroll && isa<PHINode>(OpI)) {
		auto *OpIP = dyn_cast<PHINode>(OpI);
		if(OpIP->getParent() == L->getHeader())
		if (auto *OpIPI =
		dyn_cast<Instruction>(
		OpIP->getIncomingValueForBlock(L->getLoopLatch())))
		CostWorklist.push_back(OpIPI);
		efriedmaUnsubmitted Not Done Reply Inline Actions Indentation. And I'm not sure why you're adding an instruction to the worklist that might not even be in the loop. efriedma: Indentation. And I'm not sure why you're adding an instruction to the worklist that might not…
		}
// Otherwise accumulate its cost.		// Otherwise accumulate its cost.
CostWorklist.push_back(OpI);		CostWorklist.push_back(OpI);
}		}
} while (!CostWorklist.empty());		} while (!CostWorklist.empty());

if (PHIUsedList.empty())		if (PHIUsedList.empty())
// We've exhausted the search.		// We've exhausted the search.
break;		break;
Show All 12 Lines	assert(L->isLCSSAForm(DT) &&
"Must have loops in LCSSA form to track live-out values.");		"Must have loops in LCSSA form to track live-out values.");

DEBUG(dbgs() << "Starting LoopUnroll profitability analysis...\n");		DEBUG(dbgs() << "Starting LoopUnroll profitability analysis...\n");

// Simulate execution of each iteration of the loop counting instructions,		// Simulate execution of each iteration of the loop counting instructions,
// which would be simplified.		// which would be simplified.
// Since the same load will take different values on different iterations,		// Since the same load will take different values on different iterations,
// we literally have to go through all loop's iterations.		// we literally have to go through all loop's iterations.
for (unsigned Iteration = 0; Iteration < TripCount; ++Iteration) {		for (unsigned Iteration = 0; Iteration < UnrollCount; ++Iteration) {
DEBUG(dbgs() << " Analyzing iteration " << Iteration << "\n");		DEBUG(dbgs() << " Analyzing iteration " << Iteration << "\n");

// Prepare for the iteration by collecting any simplified entry or backedge		// Prepare for the iteration by collecting any simplified entry or backedge
// inputs.		// inputs.
for (Instruction &I : *L->getHeader()) {		for (Instruction &I : *L->getHeader()) {
auto *PHI = dyn_cast<PHINode>(&I);		auto *PHI = dyn_cast<PHINode>(&I);
if (!PHI)		if (!PHI)
break;		break;

// The loop header PHI nodes must have exactly two input: one from the		// The loop header PHI nodes must have exactly two input: one from the
// loop preheader and one from the loop latch.		// loop preheader and one from the loop latch.
assert(		assert(
PHI->getNumIncomingValues() == 2 &&		PHI->getNumIncomingValues() == 2 &&
"Must have an incoming value only for the preheader and the latch.");		"Must have an incoming value only for the preheader and the latch.");

		// If incoming value for PHI is another loop PHI, that means we need to
		// store previous value (without unroll).
		if (PHINode *PHIL = dyn_cast<PHINode>(
		PHI->getIncomingValueForBlock(L->getLoopLatch())))
		if (Iteration && L->contains(PHIL))
		RolledDynamicCost++;

		unsigned PhiCycle = getPhiCycleLength (PHI, L);
		efriedmaUnsubmitted Not Done Reply Inline Actions Maybe this would be more clear if it returned a boolean? I'm not sure the rest of this code makes sense if PhiCycle isn't either 0 or 2. efriedma: Maybe this would be more clear if it returned a boolean? I'm not sure the rest of this code…
		// For complete unroll if we were unable to get PHI cycle length
		// consider cycle length as full unroll count (TripCount).
		if (CompleteUnroll && PhiCycle == 0)
		PhiCycle = UnrollCount;
		// If PHI has no cycle we are unable to simplify it.
		if (PhiCycle == 0)
		continue;

Value *V = PHI->getIncomingValueForBlock(		Value *V = PHI->getIncomingValueForBlock(
Iteration == 0 ? L->getLoopPreheader() : L->getLoopLatch());		Iteration % PhiCycle ? L->getLoopLatch() : L->getLoopPreheader());
Constant *C = dyn_cast<Constant>(V);		Constant *C = dyn_cast<Constant>(V);
if (Iteration != 0 && !C)		if ((Iteration % PhiCycle) && !C)
C = SimplifiedValues.lookup(V);		C = SimplifiedValues.lookup(V);
if (C)		if (C)
SimplifiedInputValues.push_back({PHI, C});		SimplifiedInputValues.push_back({PHI, C});
}		}

// Now clear and re-populate the map for the next iteration.		// Now clear and re-populate the map for the next iteration.
SimplifiedValues.clear();		SimplifiedValues.clear();
while (!SimplifiedInputValues.empty())		while (!SimplifiedInputValues.empty())
SimplifiedValues.insert(SimplifiedInputValues.pop_back_val());		SimplifiedValues.insert(SimplifiedInputValues.pop_back_val());
		UnrolledInstAnalyzer Analyzer(Iteration, SimplifiedValues,
		SE, L, CompleteUnroll);

UnrolledInstAnalyzer Analyzer(Iteration, SimplifiedValues, SE, L);

BBWorklist.clear();		BBWorklist.clear();
BBWorklist.insert(L->getHeader());		BBWorklist.insert(L->getHeader());
// Note that we must not cache the size, this loop grows the worklist.		// Note that we must not cache the size, this loop grows the worklist.
for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {		for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
BasicBlock *BB = BBWorklist[Idx];		BasicBlock *BB = BBWorklist[Idx];

// Visit all instructions in the given basic block and try to simplify		// Visit all instructions in the given basic block and try to simplify
// it. We don't change the actual IR, just count optimization		// it. We don't change the actual IR, just count optimization
// opportunities.		// opportunities.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
continue;		continue;

// Track this instruction's expected baseline cost when executing the		// Track this instruction's expected baseline cost when executing the
// rolled loop form.		// rolled loop form.
RolledDynamicCost += TTI.getUserCost(&I);		RolledDynamicCost += TTI.getUserCost(&I);

// Visit the instruction to analyze its loop cost after unrolling,		// Visit the instruction to analyze its loop cost after unrolling,
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This look very hacky. What is the motivation of this change? How often do we see such loops? If we do want to handle such cases (which I'm not convinced now), then we should do it in a general way. That is, the logic should be in instruction visitors, and we should automagically deduce that these instructions are free. There are more cases than just xor - we can multiply by -1, and we should be able to handle in a similar way. In the current form the code is not easily extensible to handle new cases. mzolotukhin: This look very hacky. What is the motivation of this change? How often do we see such loops?
		evstupacAuthorUnsubmitted Not Done Reply Inline Actions What is the motivation of this change? How often do we see such loops? Not each test looks like this, but there are couple where we switch states: state = st[s^=1]; which becomes invariant with some other calculations after unroll. If we do want to handle such cases (which I'm not convinced now), then we should do it in a general way. That is, the logic should be in instruction visitors, and we should automagically deduce that these instructions are free. The same is valid for the code above where complete unroll simplify phi. Why phi is simplified here? I just did the same for XOR. And yes we can multiply by -1, do i&1, i/2,... but we need a start point which depends on unroll factor and iteration. The other solution is to pass, unroll factor in addition to iteration number and move simplification there. Do you think this is better? evstupac: >What is the motivation of this change? How often do we see such loops? Not each test looks…
// and if the visitor returns true, mark the instruction as free after		// and if the visitor returns true, mark the instruction as free after
// unrolling and continue.		// unrolling and continue.
bool IsFree = Analyzer.visit(I);		bool IsFree = Analyzer.visit(I);
bool Inserted = InstCostMap.insert({&I, (int)Iteration,		bool Inserted = InstCostMap.insert({&I, (int)Iteration,
(unsigned)IsFree,		(unsigned)IsFree,
/IsCounted/ false}).second;		/IsCounted/ false}).second;
(void)Inserted;		(void)Inserted;
assert(Inserted && "Cannot have a state for an unvisited instruction!");		assert(Inserted && "Cannot have a state for an unvisited instruction!");
▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
BBWorklist.insert(Succ);		BBWorklist.insert(Succ);
else		else
ExitWorklist.insert({BB, Succ});		ExitWorklist.insert({BB, Succ});
AddCostRecursively(*TI, Iteration);		AddCostRecursively(*TI, Iteration);
}		}

// If we found no optimization opportunities on the first iteration, we		// If we found no optimization opportunities on the first iteration, we
// won't find them on later ones too.		// won't find them on later ones too.
if (UnrolledCost == RolledDynamicCost) {		if (CompleteUnroll && UnrolledCost == RolledDynamicCost) {
DEBUG(dbgs() << " No opportunities found.. exiting.\n"		DEBUG(dbgs() << " No opportunities found.. exiting.\n"
<< " UnrolledCost: " << UnrolledCost << "\n");		<< " UnrolledCost: " << UnrolledCost << "\n");
return None;		return None;
}		}
}		}

while (!ExitWorklist.empty()) {		while (!ExitWorklist.empty()) {
BasicBlock ExitingBB, ExitBB;		BasicBlock ExitingBB, ExitBB;
std::tie(ExitingBB, ExitBB) = ExitWorklist.pop_back_val();		std::tie(ExitingBB, ExitBB) = ExitWorklist.pop_back_val();

for (Instruction &I : *ExitBB) {		for (Instruction &I : *ExitBB) {
auto *PN = dyn_cast<PHINode>(&I);		auto *PN = dyn_cast<PHINode>(&I);
if (!PN)		if (!PN)
break;		break;

Value *Op = PN->getIncomingValueForBlock(ExitingBB);		Value *Op = PN->getIncomingValueForBlock(ExitingBB);
		if (auto *OpPN = dyn_cast<PHINode>(Op))
		if (OpPN->getParent() == L->getHeader())
		Op = OpPN->getIncomingValueForBlock(L->getLoopLatch());
		efriedmaUnsubmitted Not Done Reply Inline Actions What is this change doing? (Please put comments in the code.) efriedma: What is this change doing? (Please put comments in the code.)
if (auto *OpI = dyn_cast<Instruction>(Op))		if (auto *OpI = dyn_cast<Instruction>(Op))
if (L->contains(OpI))		if (L->contains(OpI))
AddCostRecursively(*OpI, TripCount - 1);		AddCostRecursively(*OpI, UnrollCount - 1);
}		}
}		}

DEBUG(dbgs() << "Analysis finished:\n"		DEBUG(dbgs() << "Analysis finished:\n"
<< "UnrolledCost: " << UnrolledCost << ", "		<< "UnrolledCost: " << UnrolledCost << ", "
<< "RolledDynamicCost: " << RolledDynamicCost << "\n");		<< "RolledDynamicCost: " << RolledDynamicCost << "\n");
return {{UnrolledCost, RolledDynamicCost}};		return {{UnrolledCost, RolledDynamicCost}};
}		}
▲ Show 20 Lines • Show All 350 Lines • ▼ Show 20 Lines	if (PragmaCount > 0 && !UP.AllowRemainder)
"unroll_count pragma because remainder loop is restricted "		"unroll_count pragma because remainder loop is restricted "
"(that could architecture specific or because the loop "		"(that could architecture specific or because the loop "
"contains a convergent instruction) and so must have an unroll "		"contains a convergent instruction) and so must have an unroll "
"count that divides the loop trip multiple of "		"count that divides the loop trip multiple of "
<< NV("TripMultiple", TripMultiple) << ". Unrolling instead "		<< NV("TripMultiple", TripMultiple) << ". Unrolling instead "
<< NV("UnrollCount", UP.Count) << " time(s).");		<< NV("UnrollCount", UP.Count) << " time(s).");
}		}

		// Estimate if Force unroll could be profitable.
		if (!isa<SCEVCouldNotCompute>(SE->getBackedgeTakenCount(L)) && UP.Count >= 2
		&& !ExplicitUnroll) {
		UP.Force = false;
		} else {
		Optional<EstimatedUnrollCost> Cost =
		analyzeLoopUnrollCost(L, 2, DT, *SE, TTI,
		UP.PartialThreshold + UP.BEInsns, false);
		if (!Cost \|\| Cost->UnrolledCost >= Cost->RolledDynamicCost \|\|
		(unsigned)Cost->UnrolledCost > UP.PartialThreshold)
		efriedmaUnsubmitted Not Done Reply Inline Actions "Cost->UnrolledCost >= Cost->RolledDynamicCost" is the profitability check? Needs a comment to explain what that means. Do we care how large the improvement is vs. the size of the loop? efriedma: "Cost->UnrolledCost >= Cost->RolledDynamicCost" is the profitability check? Needs a comment to…
		UP.Force = false;
		}
		if (UP.Force) {
		efriedmaUnsubmitted Not Done Reply Inline Actions Maybe rearrange this? UP.Force is only true on one codepath out of the previous if statement. efriedma: Maybe rearrange this? UP.Force is only true on one codepath out of the previous if statement.
		// Currently force unroll only by 2.
		UP.Count = 2;
		DEBUG(dbgs() << " unrolling with count: " << UP.Count << "\n");
		// Unroll uncountable loops only once.
		return true;
		}
if (UP.Count > UP.MaxCount)		if (UP.Count > UP.MaxCount)
UP.Count = UP.MaxCount;		UP.Count = UP.MaxCount;
DEBUG(dbgs() << " partially unrolling with count: " << UP.Count << "\n");		DEBUG(dbgs() << " partially unrolling with count: " << UP.Count << "\n");
if (UP.Count < 2)		if (UP.Count < 2)
UP.Count = 0;		UP.Count = 0;
return ExplicitUnroll;		return ExplicitUnroll;
}		}

▲ Show 20 Lines • Show All 296 Lines • Show Last 20 Lines

unittests/Analysis/UnrollAnalyzer.cpp

Show All 32 Lines	bool runOnFunction(Function &F) override {
BasicBlock Header = &FI++;		BasicBlock Header = &FI++;
Loop *L = LI->getLoopFor(Header);		Loop *L = LI->getLoopFor(Header);
BasicBlock *Exiting = L->getExitingBlock();		BasicBlock *Exiting = L->getExitingBlock();

SimplifiedValuesVector.clear();		SimplifiedValuesVector.clear();
TripCount = SE->getSmallConstantTripCount(L, Exiting);		TripCount = SE->getSmallConstantTripCount(L, Exiting);
for (unsigned Iteration = 0; Iteration < TripCount; Iteration++) {		for (unsigned Iteration = 0; Iteration < TripCount; Iteration++) {
DenseMap<Value , Constant > SimplifiedValues;		DenseMap<Value , Constant > SimplifiedValues;
UnrolledInstAnalyzer Analyzer(Iteration, SimplifiedValues, *SE, L);		UnrolledInstAnalyzer Analyzer(Iteration, SimplifiedValues, *SE, L, true);
for (auto *BB : L->getBlocks())		for (auto *BB : L->getBlocks())
for (Instruction &I : *BB)		for (Instruction &I : *BB)
Analyzer.visit(I);		Analyzer.visit(I);
SimplifiedValuesVector.push_back(SimplifiedValues);		SimplifiedValuesVector.push_back(SimplifiedValues);
}		}
return false;		return false;
}		}
void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
▲ Show 20 Lines • Show All 281 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Unroll for uncountable loopsNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 98867

include/llvm/Analysis/LoopUnrollAnalyzer.h

include/llvm/CodeGen/BasicTTIImpl.h

lib/Analysis/LoopUnrollAnalyzer.cpp

lib/Transforms/Scalar/LoopUnrollPass.cpp

unittests/Analysis/UnrollAnalyzer.cpp

Unroll for uncountable loops
Needs ReviewPublic