Page MenuHomePhabricator

Unroll for uncountable loops
Needs ReviewPublic

Authored by evstupac on Jun 24 2016, 7:29 PM.

Details

Summary

The patch introduces 1st heuristic to unroll uncountable loops.
The heuristics targeted on loops working with list structures.

make check passed
spec2000 has unchanged performance
There are several improvements on applications working with list structures.

Diff Detail

Repository
rL LLVM

Event Timeline

evstupac updated this revision to Diff 61880.Jun 24 2016, 7:29 PM
evstupac retitled this revision from to Unroll for uncountable loops.
evstupac updated this object.
evstupac added reviewers: mzolotukhin, sanjoy.
evstupac set the repository for this revision to rL LLVM.
evstupac added a subscriber: llvm-commits.
mzolotukhin edited edge metadata.Jun 27 2016, 8:04 AM

Hi Evgeny,

I haven't looked into the details of the patch yet, but (at least from the function name) it looks like this is very similar to what LoopUnrollAnalyzer does (see lib/Analysis/LoopUnrollAnalyzer.cpp). Have you considered adding this functionality there (and what's missing there to catch this)?

Michael

Hi Michael,

The new function is more about a decision if we need unroll or not. lib/Transforms/Scalar/LoopUnrollPass.cpp looks like a good place for it.

Thanks,
Evgeny

Hi Evgeny,

Why functionality from lib/Analysis/LoopUnrollAnalyzer.cpp isn't enough for your case? It should be able to predict that phis would be removed after unrolling (possibly along with other instructions).

Michael

evstupac updated this revision to Diff 77959.Nov 15 2016, 1:35 AM
evstupac edited edge metadata.

Add recurrent xor case.
Move calculations to cost estimation function.

evstupac updated this revision to Diff 78772.Nov 21 2016, 1:42 PM
  1. Set "disable unroll" for loops unrolled by "force unroll". The reason is that on LTO unroll pass could be called multiple times and we will try to unroll unless exceed threshold. This is not critical, but makes unroll behavior dependent on LTO.
  2. Add missed unit tests.
  3. Fix calculations for phis used outside of the loop only.
mzolotukhin added inline comments.Dec 14 2016, 4:45 PM
lib/Transforms/Scalar/LoopUnrollPass.cpp
505–509

This look very hacky. What is the motivation of this change? How often do we see such loops?

If we do want to handle such cases (which I'm not convinced now), then we should do it in a general way. That is, the logic should be in instruction visitors, and we should automagically deduce that these instructions are free. There are more cases than just xor - we can multiply by -1, and we should be able to handle in a similar way. In the current form the code is not easily extensible to handle new cases.

evstupac added inline comments.Dec 14 2016, 6:08 PM
lib/Transforms/Scalar/LoopUnrollPass.cpp
505–509

What is the motivation of this change? How often do we see such loops?

Not each test looks like this, but there are couple where we switch states:

state = st[s^=1];

which becomes invariant with some other calculations after unroll.

If we do want to handle such cases (which I'm not convinced now), then we should do it in a general way. That is, the logic should be in instruction visitors, and we should automagically deduce that these instructions are free.

The same is valid for the code above where complete unroll simplify phi. Why phi is simplified here?
I just did the same for XOR.
And yes we can multiply by -1, do i&1, i/2,... but we need a start point which depends on unroll factor and iteration.
The other solution is to pass, unroll factor in addition to iteration number and move simplification there.
Do you think this is better?

> What is the motivation of this change? How often do we see such loops?
Not each test looks like this, but there are couple where we switch states:

state = st[s^=1];
which becomes invariant with some other calculations after unroll.

I realize that it's possible to write such a test. My question was how frequent are such cases in real programs. Can you provide statistics from SPEC/llvm-testsuite that would justify the change?

> If we do want to handle such cases (which I'm not convinced now), then we should do it in a general way. That is, the logic should be in instruction visitors, and we should automagically deduce that these instructions are free.
The same is valid for the code above where complete unroll simplify phi. Why phi is simplified here?

PHIs are present in every loop, in contrast to XORs.

The other solution is to pass, unroll factor in addition to iteration number and move simplification there.
Do you think this is better?

I haven't put much thought to this yet, because I'm still not convinced in the need of this change. It's possible though that if we decide to go this way, we'll need to do some prep-work first to make the code look nice. The main part of simplification code is already there (with PHI-nodes getting an extra attention due to their special and very important nature).

(was unable to answer in phabricator previously, so just put this from mail thread here)

I realize that it's possible to write such a test. My question was how frequent are such cases in real programs. Can you provide statistics from SPEC/llvm-testsuite that would justify the change?

I'll get the results for llvm test-suite. There is not much difference
on spec2000. There are gains on eembc benhmarks.

PHIs are present in every loop, in contrast to XORs.

The patch touches only recursive XORs which also go through PHI.
PHIs for optimization after complete unroll are in countable loops. My
patch deals with uncountable (a list loop for example).

I'm not sure I can share all the cases, but I've found good public example of "xor" in the loop from jpeg (h2v1_downsample and h2v2_downsample):
https://github.com/cloudflare/jpegtran/blob/master/jcsample.c

For the case above runtime unroll should work fine, but there are cases where loop upper bound is structure field load. If so loop becomes uncountable and we are unable to apply runtime unroll.
Generally there multiple cases when runtime unroll can fail, so why not to force unroll if we can prove profitability?

evstupac updated this revision to Diff 88080.Feb 10 2017, 7:36 PM

Add phi cycle function to explain phi simplification.

Regarding performance data.
Initially I want unroll only loops where previous iteration value is reused. Just with this change I get the following.
197.parser up to 5% depending on CPU and 10% to 30% on set of internal benchmarks.
With xor and sub enabled most of benchmarks the results got buildsame or no performance change. However I still believe this fits to current functionality.

One thing I noticed, adding cost to regular unroll (not complete). All loops which counts something got unrolled:

while(smth)

n++;

Use(n);

And unroll safes at least 1 instruction for such loops.

evstupac added subscribers: hfinkel, mkuper, Farhana.

PING.
Let's come to a conclusion if this is acceptable or not.
The cases that covered here:

  1. Uncountable loops where previous value is reused (save one+ instruction for each value)
  2. uncountable loops, that counts smth:
while(smth) {
  body();
  n++;
}

to:

while(smth) {
  body();
  if (!smth) goto exit2;
  body();
  n+=2;
}
exit2:
n++;
loop exit:

Saves one+ "add" in the loop.

  1. Uncountable loops switching states (not that frequent):
while(smth) {
  body();
  s ^= n;
}
while(smth) {
  body();
  s = -s;
}

Potentially saves a lot (as constant values could simplify several instructions).

Hi,

I'm still not convinced that we need this functionality. The loops that you mentioned can easily be handled by using unroll pragma if users really care about their performance. The patch might introduce no performance/compile-time regressions, but we shouldn't forget about the increased source code complexity.

Michael

Hi Michael,

but we shouldn't forget about the increased source code complexity.

The patch mostly reuse existing code for full unroll.

Regarding PHI Cycle it even improve current heuristic, as full unroll consider
"s^=1;" is constant "1" at all iterations >0. However it is 0 on even and 1 on odd.
And I'm ok to drop this part to make code shorter.

Maybe we can add uncountable loops unroll under -O3?

Thanks,
Evgeny

hfinkel edited edge metadata.Mar 2 2017, 5:03 PM

Hi Michael,

but we shouldn't forget about the increased source code complexity.

The patch mostly reuse existing code for full unroll.

Regarding PHI Cycle it even improve current heuristic, as full unroll consider
"s^=1;" is constant "1" at all iterations >0. However it is 0 on even and 1 on odd.
And I'm ok to drop this part to make code shorter.

Maybe we can add uncountable loops unroll under -O3?

I agree. -O3 would be the right place for this.

Thanks,
Evgeny

Hi,

I'm still not convinced that we need this functionality. The loops that you mentioned can easily be handled by using unroll pragma if users really care about their performance. The patch might introduce no performance/compile-time regressions, but we shouldn't forget about the increased source code complexity.

Michael

I don't think that "easily be handled by using unroll pragma" is the right standard here. The same argument can be made for anything. We should only really require a pragma when we cannot design a reasonable heuristic (at this stage in the pipeline). This patch does not seem that large. I can review next week.

full unroll consider "s^=1;" is constant "1" at all iterations >0. However it is 0 on even and 1 on odd.

If that's the case, I'd be happy to review a separate patch that just fixes that.

I don't think that "easily be handled by using unroll pragma" is the right standard here. The same argument can be made for anything. We should only really require a pragma when we cannot design a reasonable heuristic (at this stage in the pipeline). This patch does not seem that large. I can review next week.

My biggest concern about this patch is that it doesn't solve the problem in a general way, but instead only catches two cases: s ^= 1 and s = -s. While it looks kind of generic, I can see no way how it can be extended in future. E.g. it doesn't seem to be able to support another way to write the same idiom: s = i%2 or s = i&1, and if we'd like to support it, we'll need yet another approach.

Your point about using pragma is generally valid, but here is my view on it. While in general we do need to try to cover as many cases without pragmas as possible, we need to provide a general implementation that potentially can help other cases as well. Patching the compiler here and there just to cover cases one by one would lead to unmaintainable and inefficient code in the end. Also, unlike many other loops in a loop with s ^=1 the user would more likely correctly guess a favorable unroll factor (2), just by the nature of the computation. In more complicated loops a help from the compiler is needed much more, because it might be hard to guess the best unroll factor.

IMHO, a general approach here would be to teach unroller (or maybe SCEV) to analyze N sequential iterations. E.g. for the xor case and for 2 sequential iterations starting at the i-th iteration, we'd get something like S_i_plus_2 = S_i_plus_1 ^ 1 = (S_i ^ 1) ^ 1 = S_i. Honestly, I don't know how hard it would be to implement this, and maybe the efforts and complexity won't be worth the gain, but that's what I'd call a general approach, rather than targeting a single case.

Michael

My biggest concern about this patch is that it doesn't solve the problem in a general way, but instead only catches two cases: s ^= 1 and s = -s.

Why PhiCycle is not general? That could be InstCycle instead (for i&3, i&7, i%N,....).
"i%2", "i&1" are convertible to "s^=1".

IMHO, a general approach here would be to teach unroller (or maybe SCEV) to analyze N sequential iterations. E.g. for the xor case and for 2 sequential iterations starting at the i-th iteration, we'd get something like S_i_plus_2 = S_i_plus_1 ^ 1 = (S_i ^ 1) ^ 1 = S_i.

Current approach do this, but starting from 0 iteration (taking in account that some instruction are cycle).

Suppose we skip this, what about:

  1. Uncountable loops where previous value is reused (save one+ instruction for each value)?
  2. Uncountable loops, that counts smth?

My biggest concern about this patch is that it doesn't solve the problem in a general way, but instead only catches two cases: s ^= 1 and s = -s.

Why PhiCycle is not general? That could be InstCycle instead (for i&3, i&7, i%N,....).
"i%2", "i&1" are convertible to "s^=1".

Sure, but "are convertible to" is not helpful if they're not. Whether that's a useful canonical form might be a separate discussion (maybe it is if that's the only use of the induction variable). In any case, I agree that we should have something that is insensitive to how this is written.

However, we still need to make sure we're doing this in a way that is profitable. The point of making sure to unroll by an even factor for these cycle-2 recurrences is that it allows us to completely remove the PHI. It is not clear to me that the more-general case shares this property without additional work (by which I mean canonicalization work).

IMHO, a general approach here would be to teach unroller (or maybe SCEV) to analyze N sequential iterations. E.g. for the xor case and for 2 sequential iterations starting at the i-th iteration, we'd get something like S_i_plus_2 = S_i_plus_1 ^ 1 = (S_i ^ 1) ^ 1 = S_i.

But you need more than this. You need to know how the intermediate iterations are related so that you can eliminate the PHI. This is especially true for the remainder case (because remainders are expensive compared to &, for example).

I suspect that SCEV would have a hard time doing this because the relations are not algebraic (although it might be able to do something interesting with the remainder case).

In any case, there is definitely a more-general case we can handle here, but it is the following: there needs to either be a cycle PHI, or a use of an otherwise-unused instruction variable that repeats in a fixed pattern, such that unrolling the loop by the length of that pattern allows us to eliminate the PHI. i&C and i%C have that property. Is there anything else?

Current approach do this, but starting from 0 iteration (taking in account that some instruction are cycle).

Suppose we skip this, what about:

  1. Uncountable loops where previous value is reused (save one+ instruction for each value)?
  2. Uncountable loops, that counts smth?

In any case, there is definitely a more-general case we can handle here, but it is the following: there needs to either be a cycle PHI, or a use of an otherwise-unused instruction variable that repeats in a fixed pattern, such that unrolling the loop by the length of that pattern allows us to eliminate the PHI. i&C and i%C have that property. Is there anything else?

"i/C" could also be optimized when "i" starts from "k*C". However we'll need to lower it (replace with new IV) in later optimizations (for example LSR).

What about cases 1. and 2. ?

In any case, there is definitely a more-general case we can handle here, but it is the following: there needs to either be a cycle PHI, or a use of an otherwise-unused instruction variable that repeats in a fixed pattern, such that unrolling the loop by the length of that pattern allows us to eliminate the PHI. i&C and i%C have that property. Is there anything else?

"i/C" could also be optimized when "i" starts from "k*C". However we'll need to lower it (replace with new IV) in later optimizations (for example LSR).

What about cases 1. and 2. ?

Case 1 sounds good, but maybe our existing heuristic should get it? I don't understand what you mean in case 2.

Case 1 sounds good, but maybe our existing heuristic should get it? I don't understand what you mean in case 2.

Actually case 1 is the most simple and profitable. Maybe we shall start with it only?
Regarding case 2:

It covers loops which counts how long some condition is true. For example (how many elements in the list):
while (curr) {

len++;
curr = curr->next;

}

After unroll by 2:

while(curr) {

curr1 = curr->next;
if (!curr1) goto exit1;
curr = curr1->next;
len+=2;

}
exit1:
len++;
loop exit:

but maybe our existing heuristic should get it?

No. We don't do unroll of uncountable loops at all.

zzheng added a subscriber: zzheng.Mar 10 2017, 10:27 AM
zzheng added inline comments.
lib/Transforms/Scalar/LoopUnrollPass.cpp
94

Is this description copied from above flag?

allow-force sounds like forced unrolling... how about name it UnrollAllowUncountable (unroll-allow-uncountable)?

test/Transforms/LoopUnroll/unroll-force.ll
1 ↗(On Diff #88080)

Maybe name this file 'unroll-uncountable.ll'?

What is the case 1?

For loops with unknown trip count (that's the case 2, right?) I don't see why we need to add any additional analyzers (analyzeLoopUnrollCost). Removing one instruction by unrolling by 2 is not better than the worst case for partial unroll of a loop with a known trip count:

for (i = 0; i < N; i++) {
  ...
}

But since we sometimes partially unroll such loops, I don't currently see a reason why we can't do this for loops with unknown trip-count as well. However, we don't need additional thresholds/knobs for it, thresholds for partial unroll should do it already.

What is the case 1?

Uncountable loops where previous value is reused (save one+ instruction for each value)
The example is in lit test (motivation comes from list loops).

For loops with unknown trip count (that's the case 2, right?)

No. All cases are about uncountable loops.

I don't see why we need to add any additional analyzers (analyzeLoopUnrollCost). Removing one instruction by unrolling by 2 is not better than the worst case for partial unroll of a loop with a known trip count:

We can skip this and do like in my initial patch (just run a function that return unroll count if it is profitable to unroll). That way cases 2 and 3 are skipped. I change the patch to call `analyzeLoopUnrollCost' to make more general.

But since we sometimes partially unroll such loops, I don't currently see a reason why we can't do this for loops with unknown trip-count as well. However, we don't need additional thresholds/knobs for it, thresholds for partial unroll should do it already.

The patch do not introduce a new threshold. If we enable unroll for all uncountable loops it will result in dramatic code size increase (for most cases without performance value).

evstupac added inline comments.Mar 10 2017, 12:02 PM
lib/Transforms/Scalar/LoopUnrollPass.cpp
94

Right. Thanks for catching this. Will fix. "unroll-allow-uncountable" looks good.

test/Transforms/LoopUnroll/unroll-force.ll
1 ↗(On Diff #88080)

Yes. Make sense.

Uncountable loops where previous value is reused (save one+ instruction for each value)

What do you mean by uncountable? How is it different from a loop with unknown trip count?
What is a previous value? Is it a value from a previous iteration? If that's the case, any induction variable satisfies that condition.

No. All cases are about uncountable loops.

What's the difference between these two cases? I've gotten totally confused now.

We can skip this and do like in my initial patch (just run a function that return unroll count if it is profitable to unroll). That way cases 2 and 3 are skipped. I change the patch to call `analyzeLoopUnrollCost' to make more general.

What is the case 3?

The patch do not introduce a new threshold. If we enable unroll for all uncountable loops it will result in dramatic code size increase (for most cases without performance value).

Exactly. And unroll of any loop that has at least one IV will save at least one instruction, because we'll have a phi for the IV and we can fold this phi in between unrolled iterations. So, my point was that "saving one instruction" should not be enough to unroll a loop.

What do you mean by uncountable? How is it different from a loop with unknown trip count?

L is uncountble <=> isa<SCEVCouldNotCompute>(SE->getBackedgeTakenCount(L)) is true.
For unknown trip isa<SCEVCouldNotCompute>(SE->getBackedgeTakenCount(L)) is false.

What's the difference between these two cases? I've gotten totally confused now.
What is the case 3?

The cases that covered here:

  1. Uncountable loops where previous value is reused (save one+ instruction for each value)
  2. uncountable loops, that counts smth:
while(smth) {
  body();
  n++;
}

to:

while(smth) {
  body();
  if (!smth) goto exit2;
  body();
  n+=2;
}
exit2:
n++;
loop exit:

Saves one+ "add" in the loop.

  1. Uncountable loops switching states (not that frequent):
while(smth) {
  body();
  s ^= n;
}
while(smth) {
  body();
  s = -s;
}

Potentially saves a lot (as constant values could simplify several instructions).

Currently, there's a profitability check hidden in llvm::UnrollLoop: if the "Force" parameter is false, it will refuse to unroll any loop unless it can make the loop latch in the unrolled iterations an unconditional branch. (See the debug message "Wont unroll; remainder loop could not be generated" in lib/Transforms/Utils/LoopUnroll.cpp.) This prevents the unroll pass from going crazy unrolling every loop in the program, but it's also not a very good way to measure profitability.

Ideally, we want to tie unrolling to a dynamic cost discount. If the unrolled loop is substantially cheaper than the original, we should unroll, whether or not the savings come from eliminating the conditional branch in the latch. Conversely, unrolling a large loop just to eliminate one branch is probably a bad idea.

This patch is trying to solve the problem for certain cases... but the approach is kind of awkward: rather than actually moving the profitability check, it adds a special case to set the "Force" bit for certain loops.

Conversely, unrolling a large loop just to eliminate one branch is probably a bad idea.

That is bounded by thresholds.

This patch is trying to solve the problem for certain cases... but the approach is kind of awkward: rather than actually moving the profitability check,
it adds a special case to set the "Force" bit for certain loops.

Isn't it the same for Runtime, Partial? Setting count means almost the same. Count of 1 or 0 means we'll act in the same way if Runtime is false. I can rewrite this and set Force to true by default (but set count to 0 or 1 for uncountable loops, that do not meet some conditions).
And yes you are right we need to move all profitability checks from Utils/LoopUnrollRuntime.cpp to Scalar/LoopUnrollPass.cpp (I plan to do this in another patch).,

> Conversely, unrolling a large loop just to eliminate one branch is probably a bad idea.
That is bounded by thresholds.

Not really... I mean, we have PartialThreshold as a maximum size for the whole unrolled loop, but that isn't sensitive to the dynamic cost savings. (Compare to getFullUnrollBoostingFactor for complete unrolling.)

Not really... I mean, we have PartialThreshold as a maximum size for the whole unrolled loop, but that isn't sensitive to the dynamic cost savings. (Compare to getFullUnrollBoostingFactor for complete unrolling.)

Right. But full unroll is the only exception.
And yes this threshold is not that accurate, but it bounds loop unrolling. And it will bound uncountable loops unroll as well.

Not really... I mean, we have PartialThreshold as a maximum size for the whole unrolled loop, but that isn't sensitive to the dynamic cost savings. (Compare to getFullUnrollBoostingFactor for complete unrolling.)

Right. But full unroll is the only exception.
And yes this threshold is not that accurate, but it bounds loop unrolling. And it will bound uncountable loops unroll as well.

Okay, so is the general plan here to?

  1. Generally allow unrolling of uncountable loops
  2. Adjust the profitability heuristic to account for the cost of the branches for the uncountable case
  3. Include, as necessary, in our general profitability heuristic (if we don't get this naturally), savings from phi-derived values which are cyclic with a small cycle length (like the s = -s case).

Okay, so is the general plan here to?

  1. Generally allow unrolling of uncountable loops
  2. Adjust the profitability heuristic to account for the cost of the branches for the uncountable case
  3. Include, as necessary, in our general profitability heuristic (if we don't get this naturally), savings from phi-derived values which are cyclic with a small cycle length (like the s = -s case).

I agree that we should start from (1). But I don't understand, why (2) (and (3) ) is considered special for the uncountable case. AFAIU, the current heuristic aims at simulating these savings (removed branches) already - even though we do not try to accurately predict cost of rolled and unrolled version of the loop. Am I missing something?

Michael

Okay, so is the general plan here to?

  1. Generally allow unrolling of uncountable loops

That seems reasonable. I'll enable this at O3.

  1. Adjust the profitability heuristic to account for the cost of the branches for the uncountable case

In most cases we can not remove branches in uncountable (isa<SCEVCouldNotCompute>(SE->getBackedgeTakenCount(L)) is true) loops. And it is hard to detect cases when we can.
To be able to reduce IVs (if there are such) we'll need to count them and do some basic analysis. A lot of uncountable loops don't have IVs at all.
So the heuristic we have for Runtime unrolling (with unknown trip count) is not ok here.

  1. Include, as necessary, in our general profitability heuristic (if we don't get this naturally), savings from phi-derived values which are cyclic with a small cycle length (like the s = -s case).

Right now this is the only way to see if unroll for uncountable loop is profitable or not.

I agree that we should start from (1). But I don't understand, why (2) (and (3) ) is considered special for the uncountable case. AFAIU, the current heuristic aims at simulating these savings (removed branches) already - even though we do not try to accurately predict cost of rolled and unrolled version of the loop. Am I missing something?

Current heuristic works for loops with unknown trip count (when we are able to calculate it, but runtime only). It is not applicable for uncountable loops.

evstupac updated this revision to Diff 98867.May 12 2017, 6:52 PM

Enabled uncountable (unable to calculate trip count) unroll by default at O3 and higher.
Renamed test (according to inline comments).

The use of the "Force" bit here is still really confusing... we need a better way of expressing the profitability of unrolling.

include/llvm/CodeGen/BasicTTIImpl.h
332

What effect does this have on performance?

The use of the "Force" bit here is still really confusing... we need a better way of expressing the profitability of unrolling.

The naming was introduced earlier. Changing it in this patch would be considered as "unrelated change".
"Force" mean we'll unroll loop anyway. For example if user set "pragma unroll(2)" we need to force unrolling (no matter: countable loop or not, it is expensive to calculate tripcount or not...).

include/llvm/CodeGen/BasicTTIImpl.h
332

spec performance is unchanged.
However I see performance gain on small loops working with list structures (that is the main reason I'm trying to apply this).
Generally I don't see why we are missing unroll of uncountable loops now. When we are able to prove profitability we should do this.

Changing it in this patch would be considered as "unrelated change".

It's still messy, and still needs to change before this lands. Changing it in a separate patch is fine, if that makes sense.


This change needs testcases, including a couple examples of loops where the unrolling is profitable.

lib/Transforms/Scalar/LoopUnrollPass.cpp
413

Indentation. And I'm not sure why you're adding an instruction to the worklist that might not even be in the loop.

466

Maybe this would be more clear if it returned a boolean? I'm not sure the rest of this code makes sense if PhiCycle isn't either 0 or 2.

608

What is this change doing? (Please put comments in the code.)

987

"Cost->UnrolledCost >= Cost->RolledDynamicCost" is the profitability check? Needs a comment to explain what that means.

Do we care how large the improvement is vs. the size of the loop?

990

Maybe rearrange this? UP.Force is only true on one codepath out of the previous if statement.