This is an archive of the discontinued LLVM Phabricator instance.

LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling
ClosedPublic

Authored by escha on Mar 31 2016, 1:02 PM.

Details

Summary

I wanted to be able to do some partial unrolling on my target, as well as limit the unroll counts for full unrolling. This meant the behavior I wanted (which seemed quite reasonable to ask for!) looks like this:

  1. If doing full unrolling, use threshold X, and don't go over A iterations unrolled.
  2. If we don't meet the requirements for full unrolling, use threshold Y and don't go over A iterations unrolled. Also, make sure to make an unroll count that divides evenly into the loop count.
  3. Don't do runtime unrolling.

Unfortunately, I ran into three problems, which this patch fixes (comments welcome if there's any better ways to fix them).

  1. There's no way to limit the number of iterations for full unrolling -- only for partial/runtime unrolling. So I added that.
  2. A bug in partial unrolling causes it to not reduce the count to be modulo-tripcount if the PartialThreshold is already met. So I fixed that. I'm not sure if this bug can trigger without change 1), though.
  3. A bug in partial unrolling causes it to ignore MaxCount, even though MaxCount says it applies to everything but full unrolling. So I fixed that.

(Use-case: our target, a GPU, can [in TTI] roughly estimate the number of high-latency operations, like loads and texture reads, and make reasonable judgements as to how much unrolling is reasonable given that number. But to do that, we need to be able to put a cap on full unrolling separate from the overall cost threshold.)

Diff Detail

Repository
rL LLVM

Event Timeline

escha updated this revision to Diff 52268.Mar 31 2016, 1:02 PM
escha retitled this revision from to LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.
escha updated this object.
escha added reviewers: mzolotukhin, resistor.
escha set the repository for this revision to rL LLVM.
escha added a subscriber: llvm-commits.
zzheng added a subscriber: zzheng.Mar 31 2016, 4:57 PM

Updates in partial unrolling logic looks reasonable to me... just don't understand about limiting full unroll, and seems it's not limited in this patch.

lib/Transforms/Scalar/LoopUnrollPass.cpp
110

Where's FullUnrollMaxCount used?

If I understand your description correctly, full unroll and partial unroll are limited by the same max unroll count ('A iterations') but different thresholds ('threshold X' and 'threshold Y')?

I don't quiet understand the term 'put a cap on full unrolling'... If we can't unroll by compile-time-known tripcount, it becomes partial unrolling, right?

getUnrollingPreferences() in TTI can update UP.Count, UP.MaxCount and UP.PartialThreshold to bound partial unrolling.

escha added a comment.Mar 31 2016, 5:01 PM

FullUnrollMaxCount isn't used anywhere because it's a TTI option used by our out of tree target.

Count is a force-unroll option; it doesn't have anything to do with this, as far as I know.

MaxCount is what I would want, but it *specifically does not apply to full unrolling*, as specified in the documentation. I want to limit the number of iterations for a full unroll.

PartialThreshold, again, has nothing to do with full unrolls.

I don't quiet understand the term 'put a cap on full unrolling'... If we can't unroll by compile-time-known tripcount, it becomes partial unrolling, right?

Yes, exactly. If the compile-time known tripcount is greater than FullUnrollMaxCount, it will fall back to partial unrolling instead, because it cannot do a full unroll.

FullUnrollMaxCount isn't used anywhere because it's a TTI option used by our out of tree target.

If it's not used by target-independent unroll pass, perhaps you can put it in your TTI?

Count is a force-unroll option; it doesn't have anything to do with this, as far as I know.
...make reasonable judgements as to how much unrolling is reasonable given that number

That's what UP.Count is intended to do: let the target suggest an unroll count it deems reasonable.

MaxCount is what I would want, but it *specifically does not apply to full unrolling*, as specified in the documentation. I want to limit the number of iterations for a full unroll.

PartialThreshold, again, has nothing to do with full unrolls.

unsigned Count = UP.Count;
bool CountSetExplicitly = Count != 0;                                                                   
// Use a heuristic count if we didn't set anything explicitly.                                          
if (!CountSetExplicitly)
  Count = TripCount == 0 ? DefaultUnrollRuntimeCount : TripCount;
if (TripCount && Count > TripCount)
  Count = TripCount;

If Count, derived from UP.Count, (again, suggested by the target), is smaller than TripCount, it becomes partial unrolling and PartialThreshold/MaxCount applies automatically.

I'd add TripCount to UP so TTI::getUnrollingPreferences() can knowingly setup partial unrolling.

I don't quiet understand the term 'put a cap on full unrolling'... If we can't unroll by compile-time-known tripcount, it becomes partial unrolling, right?

Yes, exactly. If the compile-time known tripcount is greater than FullUnrollMaxCount, it will fall back to partial unrolling instead, because it cannot do a full unroll.

It seems partial unrolling is preferred over full unrolling for your target. I would set it up in getUnrollingPerferences()... unless it's too early for the TTI to estimate how much unrolling is suitable during LoopUnrollPass.

escha updated this revision to Diff 52317.Mar 31 2016, 5:45 PM

Ugh, my mistake, I missed one hunk when uploading the diff; the hunk where the variable was used! My mistake.

escha added a comment.Mar 31 2016, 5:47 PM

That's what UP.Count is intended to do: let the target suggest an unroll count it deems reasonable.

I thought UP.Count overrides Threshold, causing it to unroll loops even if their LoopSize is greater than the Threshold? If that's not true, then I guess that part of this patch is unnecessary, but it didn't seem to work that way when I tried.

It seems partial unrolling is preferred over full unrolling for your target. I would set it up in getUnrollingPerferences()... unless it's too early for the TTI to estimate how much unrolling is suitable during LoopUnrollPass.

No, full unrolling is strongly preferred. However, in some cases, full unrolling is too costly, either due to register usage (which our TTI calculates) or due to Threshold (which LoopUnrollPass calculates), and we'd rather partial unroll than do nothing at all.

escha added a comment.Mar 31 2016, 5:49 PM

Going by the code, Count is emphatically not what I want:

if (!AllowPartial && !CountSetExplicitly) {
  DEBUG(dbgs() << "  will not try to unroll partially because "
               << "-unroll-allow-partial not given\n");
  return false;
}
if (!AllowRuntime && !CountSetExplicitly) {
  DEBUG(dbgs() << "  will not try to unroll loop with runtime trip count "
               << "-unroll-runtime not given\n");
  return false;
}

It forces an unroll of Count, even if you haven't enabled Runtime unrolling and don't want it.

mzolotukhin edited edge metadata.Mar 31 2016, 6:05 PM

Hi,

In general the idea looks good to me, one remark is inline.

Michael

PS: Please upload the patch with full-context.

lib/Transforms/Scalar/LoopUnrollPass.cpp
639–644

I think we shouldn't do this if CountSetExplicitly is true.

That's what UP.Count is intended to do: let the target suggest an unroll count it deems reasonable.

I thought UP.Count overrides Threshold, causing it to unroll loops even if their LoopSize is greater than the Threshold? If that's not true, then I guess that part of this patch is unnecessary, but it didn't seem to work that way when I tried.

It seems partial unrolling is preferred over full unrolling for your target. I would set it up in getUnrollingPerferences()... unless it's too early for the TTI to estimate how much unrolling is suitable during LoopUnrollPass.

No, full unrolling is strongly preferred. However, in some cases, full unrolling is too costly, either due to register usage (which our TTI calculates) or due to Threshold (which LoopUnrollPass calculates), and we'd rather partial unroll than do nothing at all.

Count is still bound by UP.Threshold in canUnrollCompletely(), which decides full or partial unrolling. Setting UP.Threshold and UP.PartialThreshold but not UP.Count in TTI can limit full unroll as well.

Going by the code, Count is emphatically not what I want:

if (!AllowPartial && !CountSetExplicitly) {
  DEBUG(dbgs() << "  will not try to unroll partially because "
               << "-unroll-allow-partial not given\n");
  return false;
}
if (!AllowRuntime && !CountSetExplicitly) {
  DEBUG(dbgs() << "  will not try to unroll loop with runtime trip count "
               << "-unroll-runtime not given\n");
  return false;
}

It forces an unroll of Count, even if you haven't enabled Runtime unrolling and don't want it.

You are right with runtime unrolling part, :-)

Can you also add a test case?

lib/Transforms/Scalar/LoopUnrollPass.cpp
570

Now I understand what you're trying to do...

escha added a comment.Apr 1 2016, 10:04 AM

What sort of test case should I do for this (which behavior are you looking to test)? Should I make a commandline option so that this can easily be tested via 'opt'?

Nvm... I just realized FullUnrollMaxCount is set by your out of tree TTI and only that target needs to cap full unroll for now.

The test might be that given a loop with a TripCount=9, we unroll it with a factor of 3 (instead of 2 or 4)? Or something along these lines. Would it be possible, would it make sense?

escha added a comment.Apr 1 2016, 11:50 AM

I don't *think* it's possible for that problem to happen under normal circumstances without FullUnrollMaxCount being set. Normally, the only way we can fail to do full unrolling is if the full unroll cost is greater than Threshold. In this case, we'll go try partial unrolling, and we'll also be above the partial threshold (since a partial threshold higher than Threshold doesn't make sense), and it'll take the path that makes the count modulo-correct.

But if we set FullUnrollMaxCount, it's possible to reach this code with UnrolledSize <= PartialThreshold and a Count that isn't modulo TripCount, which leads to the bug.

In this case, we'll go try partial unrolling, and we'll also be above the partial threshold (since a partial threshold higher than Threshold doesn't make sense), and it'll take the path that makes the count modulo-correct.

Say we have loop with TripCount 9 and size 6, than fully unrolled size is ~9 * (6 - 2) + 2 = 38 (it could be less but not dramatically).
If we unroll the loop by 3, than unrolled size is 3 * (6 - 2) = 12. With Threshold = PartialThreshold = 20 we are able to do partial unrolling and will fail with full.

escha added a comment.Apr 1 2016, 1:00 PM

Yes, I understand that, but what I mean is, the bug won't trigger because we *will* get to the path that corrects Count because UnrolledSize > UP.PartialThreshold with Count = 9, and then it'll lower Count to 3. So we won't end up with a Count that isn't evenly dividing into 9.

Thanks, now I get it.
For your target there is a case when full unroll fails and then partial fails as well. And when patch applied partial unroll is successful. Right? So you can create target test.
It is really more convenient to look at code with maximum context:

svn diff -diff-cmd=diff -x -U999999
escha updated this revision to Diff 52428.Apr 1 2016, 2:33 PM
escha edited edge metadata.

Adding full context.

escha added a comment.Apr 1 2016, 2:36 PM

For your target there is a case when full unroll fails and then partial fails as well. And when patch applied partial unroll is successful. Right? So you can create target test.

Partial unroll didn't fail -- rather, it did the wrong thing. For example, suppose the loop count was 9, and the FullUnrollMaxCount was 6. The process would go like this:

  1. tripcount = 9, count = 9
  2. count = 6, because of FullUnrollMaxCount.
  3. count != tripcount, so full unroll can't be done; do partial instead.
  4. we're within the partial threshold, so skip the logic for that. do a partial unroll with count 6. (this is where we totally skip the logic to fix up the count)

Now we get a 6-wide partial unroll with a fixup loop at the end, which is totally not what we wanted.

Also, it's an out of tree target, so we can't possibly add an in-tree test. But I figure any target that wanted to limit the unroll count would have the same problem.

FullUnrollMaxCount was 6.

But now in LLVM trunk there is no way to make FullUnrollMaxCount 6. Right?
To make FullUnrollMaxCount introduction reasonable there should be a path in LLVM trunk where FullUnrollMaxCount can become something else than UINT_MAX. That will resolve the issue with test as well.

resistor edited edge metadata.Apr 1 2016, 3:14 PM

But now in LLVM trunk there is no way to make FullUnrollMaxCount 6. Right?
To make FullUnrollMaxCount introduction reasonable there should be a path in LLVM trunk where FullUnrollMaxCount can become something else than UINT_MAX. That will resolve the issue with test as well.

Do you have a concrete solution to suggest here? Because locking the infrastructure to only the feature set needed by upstreamed targets seems shortsighted.

Do you have a concrete solution to suggest here?

Maybe just introduce a command-line option for UP.FullUnrollMaxCount then?

escha added a comment.Apr 1 2016, 3:28 PM

Ironically it looks like MaxCount itself doesn't even have a commandline option either... should I introduce one for both? :/ it feels wrong to bloat the commandline options like this, I guess

It feels wrong to bloat the commandline options like this, I guess

I wouldn't worry too much about that. This only affects the command line of opt, not of clang, so it's not impacting an end-user tool.

zzheng added a comment.Apr 1 2016, 3:42 PM

Ironically it looks like MaxCount itself doesn't even have a commandline option either... should I introduce one for both? :/ it feels wrong to bloat the commandline options like this, I guess

static cl::opt<unsigned>                                                                                  
UnrollCount("unroll-count", cl::Hidden,                                                                   
  cl::desc("Use this unroll count for all loops including those with "                                    
           "unroll_count pragma values, for testing purposes"));

Is it reasonable to have one cmd option, similar to UnrollCount above, to curb both MaxCount and FullUnrollMaxCount?

The only target modifying MaxCount is AMDGPU, but it modifies it to UINT_MAX.
So all checks for exceeding MaxCount are not supposed to hit.

Personally I like Threshold bounding more than Count. So I'd better separate full and partial unroll thresholds in the patch. However there could be something target specific that I'm missing.

So is it possible to reuse MaxCount to bound full unrolling? That way we can introduce an option only for MaxCount?

escha updated this revision to Diff 52438.Apr 1 2016, 3:52 PM
escha edited edge metadata.

Added commandline options and a test.

escha added a comment.Apr 1 2016, 3:58 PM

I am slightly afraid to silently modify the behavior of MaxCount to affect full unrolling (when it didn't before) because out of tree users may be using it.

escha added a comment.Apr 4 2016, 12:44 PM

Poking this thread.

mzolotukhin accepted this revision.Apr 4 2016, 12:44 PM
mzolotukhin edited edge metadata.

LGTM!

Michael

test/Transforms/LoopUnroll/partial-unroll-maxcount.ll
19–22 ↗(On Diff #52438)

I think in most of the tests we have CHECK-lines before the test. Could you please move it there and add CHECK-LABEL in the beginning (in case we add more tests in future)?

This revision is now accepted and ready to land.Apr 4 2016, 12:44 PM
escha closed this revision.Apr 6 2016, 10:03 AM