This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroller] Replace UnrollingPreferences::Force with ForceMaxCount + SystemZ getUnrollingPreferences().
ClosedPublic

Authored by jonpa on Sep 12 2016, 4:59 AM.

Details

Summary

This is an implementation of TargetTransformInfo::getUnrollingPreferences() for SystemZ.

In addition, there is a also a change in the UnrollingPreferences so that the Force member is instead ForceMaxCount. The motivation behind this is that it is important to get rid of tiny loops on SystemZ, and this is beneficial even with forced unrolling (cloned iterations). However, this should only be done with two or three iterations, so that the resulting loop is at least 6-8 instructions (the branch predictor can only handle a taken branch every other cycle). In order to achieve this, it is necessary to limit this with ForceMaxCount.

Somebody, please take a look at the common code change that was introduced here.

Diff Detail

Event Timeline

jonpa updated this revision to Diff 70991.Sep 12 2016, 4:59 AM
jonpa retitled this revision from to [LoopUnroller] Replace UnrollingPreferences::Force with ForceMaxCount + SystemZ getUnrollingPreferences()..
jonpa updated this object.
jonpa added reviewers: uweigand, chandlerc, hfinkel, evstupac.
jonpa added a subscriber: llvm-commits.
evstupac edited edge metadata.Sep 12 2016, 5:27 PM

Hi,

Overall, I don't like that Force unroll has its own MAX counter, but Partial and Runtime do not.
What is your case where you need to bound Force unroll counter, but Runtime is ok?
If you want to introduce a bound for Forced unroll, Threshold bound looks more general.

Do you have a test case to add?

Thanks,
Evgeny

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
262

"else" with brackets is better here

lib/Transforms/Utils/LoopUnroll.cpp
308

Is this possible to move "Count" change to /Scalar/ part?
This will also rewrite unroll count set by user (pragma unroll or -unroll-count option).

310

Same. "else" with brackets is better here.

jonpa added a comment.Sep 13 2016, 1:49 AM

Hi Evgeny,

thanks for review!

Overall, I don't like that Force unroll has its own MAX counter, but Partial and Runtime do not.

There is MaxCount for partial/runtime and also FullUnrollMaxCount. So to me ForceMaxCount seemed logical, since forced unrolling is certainly a bit separate from the rest, to me.

What is your case where you need to bound Force unroll counter, but Runtime is ok?

On z13, max 6 instructions can be decoded every cycle, but the SystemZ branch predictor can only handle a taken branch every other cycle. This means that tiny loops (<6 instructions or so) is really bad, and the main purpose of this patch is to get rid of them.

Before enabling forced unrolling, there were still a very significant amount of such small loops left in benchmarks (SPEC). On SystemZ, a conditional branch is statically predicted as not taken, which is why forced unrolling should work (by inserting conditional exit-branches after each cloned iteration). Experiments showed that performance was best if just force-unrolling two or three times, and not more. So the idea here is to force-unroll just enough to make the loop bigger than 6 instructions at least, but not more.

There is also the issue of store tags depletion on z13. Simply put, the resulting loop should not have too many stores, because that may cause very expensive flushes. This is implemented in the patch by counting all the stores in the loop, and never unrolling past 12 stores in the final loop. This limit must be used for all unrolling, including the forced unrolling (and runtime).

If you want to introduce a bound for Forced unroll, Threshold bound looks more general.

Do you mean I should Threshold as a way to limit the forced unrolling? Could this be done with exact control over resulting iterations?

Do you have a test case to add?

Sorry, but I can only say that this patch finally after a lot of experimentation passed extensive performance tests and is really needed without any functional change.

/Jonas

chandlerc edited edge metadata.Sep 13 2016, 4:55 AM

While I read your explanation of the particular architectural challenges faced, I don't really understand why partial / runtime unrolling plus a remainder loop isn't a strictly superior option compared to forced unrolling with a cap?

jonpa added a comment.Sep 13 2016, 5:16 AM

I agree that forced unrolling should be always tried last, since partial / runtime should always be preferred. I beleive this is also what happens if you enable -force, right?

I am basically assuming that the LoopUnroller does everything it can to do partial / runtime unrolling, and then as a final possibility also offers forced unrolling. In other words, I first tried enabling partial / runtime unrolling, only to find many loops not unrolled this way. Therefore I thought it was necessary to used forced unrolling as the only means left, thinking that sometimes UnrollRuntimeLoopRemainder() can't for example compute trip count, while this is no problem with forced unrolling.

Are you saying it somehow would be possible to get all loops runtime unrolled instead?

So the idea here is to force-unroll just enough to make the loop bigger than 6 instructions at least, but not more.

Then Threshold is better for your case. You should bound unrolling by 6. And first try to bound Runtime unroll (and therefore Force unroll as well). Only if this does not help create new Threshold for forced unroll.

With your current changes loop that has 150 iterations (UP.Threshold) after unroll is ok... that could unroll by 2 of 50 instructions loop.

Do you mean I should Threshold as a way to limit the forced unrolling? Could this be done with exact control over resulting iterations?

Yes. For IR. Loop unroller calls Loop size estimator. Threshold is bound for instructions after unroll.

jonpa added a comment.Sep 14 2016, 9:22 AM

Chandler,

On "SPEC", considering only final loops that ended up as 8 machine instructions or less,
with *my patch without forced unrolling*, I get that
`

   48  loops were unrolled, but ended up <= 8 MIs.
10353  loops were not unrolled, because:
  865  pragma disabled (from loop vectorizer -- scalar loop)
  342  UnrolLLoop() : Any of the first five "return false" (!Preheader or !LatchBlock or !L->isSafeToClone() or (!BI || BI->isUnconditional()))
 3174  Loops containing calls - disabled purposefully for SystemZ
 5969  Loops where UnrollLoop() return false if UnrollRuntimeLoopRemainder() returns false, whereof:
       2300  Any of the first three "return false" (!L->getExitingBlock() or !L->isLoopSimplifyForm() or !Exit)
       3669  4th "return false",   // Only unroll loops with a computable trip count, and the trip count needs to be an int value (allowing a pointer type is a TODO item)
`

Same, but the complete patch including forced unrolling:

 606  loops were unrolled, but ended up <= 8 MIs.  (I happen to know that they are practically all >6, though ;-)
4434  loops were not unrolled, because:
 865  pragma disabled (from loop vectorizer -- scalar loop)
 342  UnrolLLoop() : Any of the first five "return false" (!Preheader or !LatchBlock or !L->isSafeToClone() or (!BI || BI->isUnconditional()))
3174  Loops containing calls - disabled purposefully for SystemZ
  54  Loops where UnrollLoop() return false if UnrollRuntimeLoopRemainder() returns false, whereof:
       24  Any of the first three "return false" (!L->getExitingBlock() or !L->isLoopSimplifyForm() or !Exit)
       30  4th "return false",   // Only unroll loops with a computable trip count, and the trip count needs to be an int value (allowing a pointer type is a TODO item)

It is quite clear that for runtime unroller cannot currently handle all loops. The reasons seem to realate to the LoopInfo and SCEV classes.
It is also clear that forced unrolling basically handles nearly everything left.

Evegeny,

I see your point that there is already a loop size estimate available. This however is not an estimate only of instruction count, but also weighs in
execution factors.

My background here is that first of all I am interested in the resulting machine instruction count only, so e.g. a div should only count as one. Therefore I am doing

if (getUserCost(I) != TargetTransformInfo::TargetCostConstants::TCC_Free)
​ MICountEstimate++;

This patch as it is now gets rid of basically all loops (except those with calls and pragma unroll disable) smaller than 6 instructions, by using an MICountEstimate
of 12. I found that an estimate of 12 will include 99% of loops <= 8 MIs, and that the resulting ForceMaxCount of ceil(12 / MICountEstimate), limited by 3, got
rid of practically 99,9 of loops <= 6 MIs. This is exactly what was the goal with the patch, and this is a very good result that would be a shame to change.

Having worked with this patch for a while, it has become clear that the *exact number of iterations* produced needs to be controlled. The resulting loop should have no more than 12 stores, and it seems bad to have more than 3 exit branches w/ forced unrolling.

This patch did not work very well before, so it was the forced unrolling in exactly this setting that made it all work (both performance wise, and also achieving the objective of getting rid of all small loops). Therefore I would be very reluctant to change any functionality.

So if you really object to this, I migh need to still do my own MICountEstimate, compute the MaxForceCount, and then somehow access the LoopSize estimate of the LoopUnrollPass, and in this indirect way give a new variable ForceThreshold a value that give the same results.

Having worked with this patch for a while, it has become clear that the *exact number of iterations* produced needs to be controlled. The resulting loop should have no more than 12 stores, and it seems bad to have more than 3 exit branches w/ forced unrolling.

Is this true for other unroll candidates (Partial, Runtime)? I'm trying to understand what is so specific about Force unroll.
If You need to control number of stores/branches you can limit MaxCount for all kinds of unroll. Or even better - calculate specific count for your architecture based on number of stores/branches.

Having worked with this patch for a while, it has become clear that the *exact number of iterations* produced needs to be controlled. The resulting loop should have no more than 12 stores, and it seems bad to have more than 3 exit branches w/ forced unrolling.

Is this true for other unroll candidates (Partial, Runtime)? I'm trying to understand what is so specific about Force unroll.

All loops are bounded by number of stores. The difference with forced unrolling is that I want to use a general bound (max 3 iterations.). So even if there are no stores, there is still a limit. The produced loops are different with -force, and benchmarks have shown this is better (to not get e.g. 8 iterations with exit branches, but just 2-3).

If You need to control number of stores/branches you can limit MaxCount for all kinds of unroll. Or even better - calculate specific count for your architecture based on number of stores/branches.

This is what the patch is doing, except it doesn't limit on branches for partial / runtime.

jonpa updated this revision to Diff 71915.Sep 20 2016, 4:00 AM
jonpa edited edge metadata.

Added brackets around else statements per suggestions.

jonpa marked 2 inline comments as done.Sep 20 2016, 4:10 AM

Evegeny, fixed the braces, but not sure about your other comment.

I would like to mention that small loops is bad on SystemZ for the reason that they make compiler development generally difficult, as they are sensitive to code changes. In other words, if working on any arbitrary optimization, a small code change could potentially cause a regression. This is the reason I have worked on loop unrolling, and why my scheduler patch for SystemZ is yet to be commited (D17260).

So I would like to commit this patch with the reasoning that

  1. Elimination of small loops is important on SystemZ, and only forced unrolling can (at least currently) handle them all (for numbers, see earlier reply to Chandler).
  2. Forced unrolling produces different results but is yet needed for SystemZ. It should however only be used sparingly as a last resort. It has been shown with benchmarks that it is best to do max 2-3 iterations, to get the loop above the "tiny" threshold. This means it does in fact deserve its own max count variable.
  3. The patch has been proven beneficial on benchmarks.

Will first await Evegenys reply, of course.

lib/Transforms/Utils/LoopUnroll.cpp
308

I am not sure what you mean. The patch does not change any previous behaviour for other targets than SystemZ. Please illustrate.

If You need to control number of stores/branches you can limit MaxCount for all kinds of unroll. Or even better - calculate specific count for your architecture based on number of stores/branches.

This is what the patch is doing, except it doesn't limit on branches for partial / runtime.

There are several cases when runtime unroll fails. More than one exit branch is only one reason. That means you are limiting force unroll count even if there are less than 3 exit branches.

What I suggested is to count exit branches (without a dependency on unroll type). Say there are N in the loop. Then if (N > 1), set MaxCount to 3/(N - 1) without introducing ForceMaxCount.
That way if you have 2 or more exit branches - Runtime Unroll will fail, but Count will be not more than 3.
If you have 1 exit branch you will not affect MaxCount and proceed with RuntimeUnroll.
So basically the behavior would be the same but you'll not limit force unroll count by 3 when you have less than 3 exit branches.

lib/Transforms/Utils/LoopUnroll.cpp
310

Even for SystemZ this is not good. And someone else in future could also use ForceMaxCount for his own architecture.
It would be good not to change Count in many places.

jonpa updated this revision to Diff 72459.Sep 26 2016, 4:38 AM

Per Uli's suggestions, I have now tried an alternate approach of introducing a different change to the common code. Instead of the ForceMaxCount, the DefaultUnrollRuntimeCount value is moved into UnrollingPreferences. This way, SystemZ can set this to 4 instead of 8, which has been shown as good as the earlier version on the benchmarks.

jonpa added a comment.Sep 26 2016, 4:40 AM

What I suggested is to count exit branches (without a dependency on unroll type)...

I agree it might be interesting to count the number of branches and experiment further with that. It would be nice however to first get the basic patch approved as it is now.

The solution in general part is much better. I like it.
I've made only few general comments in SystemZ specific part. If someone working on SystemZ can review it that would be very good.
Unroll part LGTM.

include/llvm/Analysis/TargetTransformInfo.h
268

if -unroll-count is not set

We should mention #pragma as well or nothing.

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
247

according to the latest commits the following (or smth similar in your case) looks better:

for (auto &BB : L->blocks())
  for (auto &I : *BB)
261

should be one 1 line
} else {

jonpa updated this revision to Diff 72644.Sep 27 2016, 6:32 AM

Updated as requested.
NFC

jonpa marked 3 inline comments as done.Sep 27 2016, 6:33 AM
uweigand accepted this revision.Sep 27 2016, 6:44 AM
uweigand edited edge metadata.

The SystemZ part looks good to me.

Evgeny, thanks for your comments and review of the unroll part!

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
243

Comment now seems to be out of date, the code no longer computes a machine instruction count.

This revision is now accepted and ready to land.Sep 27 2016, 6:44 AM
jonpa closed this revision.Sep 28 2016, 2:52 AM

Thanks Evgeny for review and suggestions!

Commited as r282570