Page MenuHomePhabricator

[LoopUnroll] Unroll and Jam
ClosedPublic

Authored by dmgreen on Jan 11 2018, 10:18 AM.

Details

Summary

This is a implementation of unroll and jam, which is something that comes up as useful for our smaller embedded processors (and hopefully for other systems in general)

The basic idea is that we take an outer loop of the for:

for i..
  ForeBlocks(i)
  for j..
    SubLoopBlocks(i, j)
  AftBlocks(i)

Instead of doing normal inner or outer unrolling, we unroll as follows:

for i... i+=2
  ForeBlocks(i)
  ForeBlocks(i+1)
  for j..
    SubLoopBlocks(i, j)
    SubLoopBlocks(i+1, j)
  AftBlocks(i)
  AftBlocks(i+1)
Remainder

To do this we need to ensure that the ForeBlocks(i+1) can be moved before the SubLoopBlocks(i) and AftBlocks(i), which means potentially moving the phi node operands from AftBlocks into Fore. There are also memory dependency checks and other safety checks that are needed. The transform is then a fairly simple job of using the excellent existing unroll code for cloning blocks and gluing them all back together correctly.

  • The dependency analysis is built upon DependencyAnalysis that loop interchange uses. This might have been a mistake. I have made some changes to DA to ensure that the AA gave correct results (and enable TBAA). I have split these changes out into D42381. There may be more to do here to get values outside the loop to be ignored.
  • I have made this into a separate pass that runs just before the existing loop unroll.
  • The performance heuristics might not be sorted correctly yet. Several parts might be over-conservative when disabling for safety. The remainder loop may not be the best, I'm not sure how it will play with vectorisers, etc. It is hopefully a good enough start.
  • I have tested this from the C level (i.e. csmith) but not a lot from the IR level with some form of fuzzer.
  • Pragmas are mostly copied from unroll.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hintonda removed a subscriber: hintonda.Apr 4 2018, 4:36 PM
dmgreen updated this revision to Diff 142645.Apr 16 2018, 9:59 AM
dmgreen edited the summary of this revision. (Show Details)

This makes things into a new pass, currently disabled by default. I have attempted to copy what the unrolled does, without duplicating too much code. Pragmas and options should work, although I will need to add a few more tests for the different behaviours. I have tried to make it so that any unroll pragma will prevent unroll-and-jam, and added equivalent options/pragmas for unroll and jam. It still uses the unroll code as much as possible.

I have the clang side of the pragmas too, which I'll put up (although they may only even be useful for testing).

Hi Dave, thanks for making this into a pass, I think this looks a lot better now. I skimmed through the whole diff for first time, and just wrote down some things I noticed, mostly nitpicks, see the comments inlined. I will now study the different bits and pieces in more detail.

include/llvm/Analysis/TargetTransformInfo.h
427 ↗(On Diff #142645)

Can you be a bit more explicit here what the threshold exactly is?

427 ↗(On Diff #142645)

typo: trashold :)

lib/Analysis/DependenceAnalysis.cpp
110 ↗(On Diff #142645)

Will other people get unhappy when we flip this switch?

lib/Target/ARM/ARMTargetTransformInfo.cpp
626 ↗(On Diff #142645)

Some comments perhaps why it is set to 60?

lib/Transforms/IPO/PassManagerBuilder.cpp
659 ↗(On Diff #142645)

Elaborate a bit what you mean here (the 2nd sentence)?

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
148 ↗(On Diff #142645)

nit: extra space

203 ↗(On Diff #142645)

nit: know -> known

305 ↗(On Diff #142645)

nit: don't need the "!= 0"

lib/Transforms/Utils/LoopUnrollAndJam.cpp
121 ↗(On Diff #142645)

Some more nitpicking here, about names.

Perhaps "Fore" can be confusing: is it another loop, a block? So something like this:

for (i)
  block[i, 0]
  for (j)
    block[j, 0]
  end
  block[i, 1]
end

becomes:

for (i, i+=2)
  block[i, 0]
  block[i+1, 0]
  for (j, j+=4)
    block[j+0, 0]
    block[j+1, 0]
    block[j+2, 0]
    block[j+3, 0]
  end
  block[i, 1]
  block[i+1, 1]
end

Mentioning that the outer loop has been unrolled with a factor of 2, and the inner loop with an unroll factor of 4.

Not sure if the classic loopunroll and jam literature uses standard terminology here, but perhaps inner/outer loop rather than "Fore, Subloop and Aft".

Also, perhaps you want to mention some restrictions somewhere, if there are any. E.g., can this doubly nested loop occur in more deeply nested loop structure (and loopunroll and jam still trigger)?

139 ↗(On Diff #142645)

Blocks = outer loops?

So perhaps something along the lines of:

"So the outer loops are unrolled and the inner loops fused ("jammed")."

180 ↗(On Diff #142645)

Nit: wont

196 ↗(On Diff #142645)

nit: don't need to capitalize this message?

210 ↗(On Diff #142645)

same here?

542 ↗(On Diff #142645)

typo: shouldnt

dmgreen marked 5 inline comments as done.Apr 20 2018, 4:11 AM

Thanks for the comments! I'll try to split a few things out and update this accordingly

include/llvm/Analysis/TargetTransformInfo.h
427 ↗(On Diff #142645)

trashold?

lib/Analysis/DependenceAnalysis.cpp
110 ↗(On Diff #142645)

Oh yep, this should probably be a separate change.

I don't think people will be unhappy, but it shouldn't be lumped in here.

lib/Target/ARM/ARMTargetTransformInfo.cpp
626 ↗(On Diff #142645)

The limit here is set because the inner loop requires a smaller threshold than the outer, as unroll and jam can increase register pressure in the inner loop significantly. This was chosen semi-randomly to be a something around 20% of the normal threshold, to something that seemed to work in a few cases I tried. More performance testing may be needed to get something better.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
305 ↗(On Diff #142645)

This is copied from loop unroll. I think it's clearer this way what it's testing. But I can change it if you are highly opinionated :)

lib/Transforms/Utils/LoopUnrollAndJam.cpp
121 ↗(On Diff #142645)

I'm not sure about your loop here. It looks like it has inner loop unrolling, which UnJ doesn't do (that is left to the unroller after the outer loop has been unroll and jammed.

By "Fore" I was referring to blocks in the outer loop, but before the inner loop. "Aft" means blocks in the outer loop but after the inner loop. I made up the names, so am open to suggestions on better ones ;)

Also, perhaps you want to mention some restrictions somewhere, if there are any. E.g., can this doubly nested loop occur in more deeply nested loop structure (and loopunroll and jam still trigger)?

That should work I believe. If it doesn't it's a bug :) But I doubt it will ever be a profitable thing to do. It used to be disabled in the profitability check, but we may have lost that when I was moving things over the the new pass. I'll have a look.

There are restrictions mentioned in the safety checks, in isSafeToUnrollAndJam.

139 ↗(On Diff #142645)

I have tried to rewrite this a little.

196 ↗(On Diff #142645)

This is copied from loop unroll

210 ↗(On Diff #142645)

Same

dmgreen updated this revision to Diff 143328.Apr 20 2018, 9:55 AM

Address feedback and added some pragma tests, with some related fixes. Split out some parts into other commits.

I'm still trying to work on making the analysis more accurate so this works on more loops. It would be good if LAA could be made to work here (to get versioning too) but it looks very vectoriser shaped and doesn't handle outer loop quite yet. DA as used here might be the only way, but may need to learn a few more tricks (or may need some fixes).

rja added a subscriber: rja.May 8 2018, 11:39 PM

Address feedback and added some pragma tests, with some related fixes. Split out some parts into other commits.

I'm still trying to work on making the analysis more accurate so this works on more loops. It would be good if LAA could be made to work here (to get versioning too) but it looks very vectoriser shaped and doesn't handle outer loop quite yet. DA as used here might be the only way, but may need to learn a few more tricks (or may need some fixes).

Ping, how work is going on?

SjoerdMeijer added inline comments.May 9 2018, 2:18 AM
test/Transforms/LoopUnrollAndJam/unprofitable.ll
226 ↗(On Diff #143328)

Nit: do you need all this?

test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll
9 ↗(On Diff #143328)

Do we need all these checks? And with all the variable names still there, it looks like a fragile test to me. Same for the other functions below.

Ping, how work is going on?

Hello. I've been away for a few days. Is this something you are interested in?

I'm still wading through improvements / bugs in dependency analysis, in an attempt to make the dependency checking here more precise.

dmgreen updated this revision to Diff 146623.May 14 2018, 9:08 AM
dmgreen added subscribers: efriedma, sebpop, anna and 4 others.

I've updated the dependency checks to be a bit better. This has shown up some bugs/inaccuracies in DA, which will be needed to make this correct.

I'm also adding some people here who may have opinions about any of this.

dmgreen added inline comments.May 14 2018, 9:09 AM
test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll
9 ↗(On Diff #143328)

I think as we are only running a single pass (not -O3 or an entire backend), it should be relatively stable.

I've updated this one to be auto generated and show the entire code output. The others are usually testing some small portion of the resulting function.

SjoerdMeijer accepted this revision.May 16 2018, 8:56 AM

Hi Dave,

I think we have looked long enough at it now, and it's time to get some experience with it. :)
It's off by default, so I don't think there's any harm done here. Once it is in, it's easier to throw
more code at it, get more experience, and see if we further need to tweak or tune it; perhaps
we get feedback from others as well who are interested in using it.

Let's wait a few more days with committing to give people one more chance to object :-)

Inlined just a few more comments, mainly nits.

Cheers.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
84 ↗(On Diff #146623)

This is *almost* identical to GetUnrollMetadata() in LoopUnroll.cpp. Is this something that could be shared somehow? Or would that be too inconvenient?

229 ↗(On Diff #146623)

nit: numInvariant -> NumInvariant

lib/Transforms/Utils/LoopUnrollAndJam.cpp
72 ↗(On Diff #146623)

nit: TODOD

398 ↗(On Diff #146623)

nit: don't need the brackets here

622 ↗(On Diff #146623)

Nit: TODOD

test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll
9 ↗(On Diff #143328)

Ok, fair enough, agreed.

This revision is now accepted and ready to land.May 16 2018, 8:56 AM
dmgreen marked 4 inline comments as done.May 21 2018, 10:57 AM
dmgreen added inline comments.
lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
84 ↗(On Diff #146623)

It's similar, but this version is only testing prefixes, not the whole string.

dmgreen updated this revision to Diff 147820.May 21 2018, 10:57 AM

OK thanks. I will leave this for at least a couple more days whilst I run extra tests. Any other comments from anyone are welcome/appreciated.

xbolva00 added inline comments.
lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
136 ↗(On Diff #147820)

static_cast?

lib/Transforms/Utils/LoopUnrollAndJam.cpp
174 ↗(On Diff #147820)

bool CompletelyUnroll = (Count == TripCount);
maybe better?

435 ↗(On Diff #147820)

Above you use a different style:
for (unsigned It = 1; It != Count; ++It) {

Can you make it same for all occurrences?

474 ↗(On Diff #147820)

This line could be placed above the condition.

dmgreen updated this revision to Diff 147983.May 22 2018, 4:23 AM
dmgreen marked 4 inline comments as done.

Review Updates.

xbolva00 accepted this revision.May 22 2018, 4:34 AM
This revision was automatically updated to reflect the committed changes.
echristo added a subscriber: echristo.

It sounded like Hal wanted to review this and I don't know that any of the other people on the review line have any experience with loop passes and so probably shouldn't have been approving this. I might be wrong here, and if so I apologize, but it seems like this went in a bit early.

From a quick glance it also looks like the testcases could be cleaned up a bit. In particular the naming if nothing else.

Thanks.

Reverted in 333359 as it's failing on some of the builders. I believe the tests are dependant on the arm backend being built.

I'll fix that and try to clean up the tests as you suggest at the same time. Some of them are longer than I'd like because we are dealing with nested loops.

I was taking Sjoerd review as OK to move things forward and improve things over time once it is in-tree. And I had attempted to leave this open for a while with plenty of people as subscribers, so that anyone could object. I had not added them as reviewers though, sorry if I misstepped.

It sounded like Hal wanted to review this

Thanks Eric. I don't need to review this if others who are qualified look at it. I've skimmed the code and it seems reasonable. There are a number of obvious enhancements that might be worthwhile, but I think it's best to work on those after this lands.

and I don't know that any of the other people on the review line have any experience with loop passes and so probably shouldn't have been approving this. I might be wrong here, and if so I apologize, but it seems like this went in a bit early.

From a quick glance it also looks like the testcases could be cleaned up a bit. In particular the naming if nothing else.

Thanks.

llvm/trunk/lib/Transforms/Utils/LoopUnrollAndJam.cpp
727

Can this overflow if the exit count is INT_MAX (or similar)? It's generally a good idea to comment on how that's handled.

This looked very reasonable to me as an initial commit. And I thought so too, like Hal and Eric, that different enhancements can still be made. But as I wrote, we can iterate further on this once we've got something in. So I don't think this was a misstep, and once the buildbot failures are resolved, it looks like this can be recommitted. However, as I also mentioned, another reason for an initial commit is more exposure, potentially more people looking at it. So this discussion is great, the more feedback, the better; especially if there's something that needs to be addressed before a recommit.

hsaito added a subscriber: hsaito.May 28 2018, 11:09 PM

David, outer loop vectorization we've been working on and unroll and jam have a lot of similarities and thus there should be code sharing opportunities around legality and cost modeling. Let's collaborate to improve both.

dmgreen reopened this revision.May 29 2018, 11:47 AM

There are a number of obvious enhancements that might be worthwhile, but I think it's best to work on those after this lands.

Any suggestions on what would be at the top of the list? :)

David, outer loop vectorization we've been working on and unroll and jam have a lot of similarities and thus there should be code sharing opportunities around legality and cost modeling. Let's collaborate to improve both.

Yeah, sounds like a good idea. Vectorisation is not something I know a huge amount about, it not being useful for the particular cores I'm looking at at the moment. But the work on VPlan sounds interesting and I do wish we had something similar for loop transforms in llvm. I was hoping to get loop versioning working, maybe through LoopAccessAnalysis, but as far as I understand it doesn't handle outer loops yet (and seemed very vectoriser shaped when I looked). From what I remember hearing, this isn't part of the outer loop vectorisation work (instead relying on user directives?). I imagine it's not an easy task to extend this out.

The current legality checks here are based on dependence analysis, and there are a couple of patches needed to make it work correctly. The cost modelling is very adhoc. :) All these things I hope we can improve as we go along.

From a quick glance it also looks like the testcases could be cleaned up a bit. In particular the naming if nothing else.

Do you mean names of the test files, or names of variables in the tests? I had look but didn't feel I was making things much better (apart from the names of some lcssa nodes). Any other suggestions?

llvm/trunk/lib/Transforms/Utils/LoopUnrollAndJam.cpp
727

Yep, you are correct. This code comes from unroll-runtime (minus the missing overflow check), and dates back to before unroll-and-jam was a separate pass. I was trying to prevent this function from returning true if the runtime unrolling was going to fail. It can now be safely removed and leave that check to the call to UnrollRuntimeLoopRemainder.

This revision is now accepted and ready to land.May 29 2018, 11:47 AM
dmgreen planned changes to this revision.May 29 2018, 11:47 AM

David, outer loop vectorization we've been working on and unroll and jam have a lot of similarities and thus there should be code sharing opportunities around legality and cost modeling. Let's collaborate to improve both.

Yeah, sounds like a good idea. Vectorisation is not something I know a huge amount about, it not being useful for the particular cores I'm looking at at the moment. But the work on VPlan sounds interesting and I do wish we had something similar for loop transforms in llvm. I was hoping to get loop versioning working, maybe through LoopAccessAnalysis, but as far as I understand it doesn't handle outer loops yet (and seemed very vectoriser shaped when I looked). From what I remember hearing, this isn't part of the outer loop vectorisation work (instead relying on user directives?). I imagine it's not an easy task to extend this out.

The current legality checks here are based on dependence analysis, and there are a couple of patches needed to make it work correctly. The cost modelling is very adhoc. :) All these things I hope we can improve as we go along.

We are currently focusing on directive based implementation, but the same framework can be used for automatic vectorization at the outer loop level. It's not like vectorizer side has a lot of immediately reusable code in place, but I think co-designing the solution makes a lot of sense, to avoid duplicate work (or to consolidate into shared reusable code). For example, uniform control flow check @dcaballe has written is probably shareable. Ditto for Divergence Analysis RFC by Simon Moll. Dependence Analysis for outer loop vectorization isn't something we are currently working on, and thus being able to reuse is a very nice thing.

For example, uniform control flow check @dcaballe has written is probably shareable. Ditto for Divergence Analysis RFC by Simon Moll.

You could have a look at isExplicitVecOuterLoop and isUniformLoop in D42447. Currently, they are implementing specific checks for supported outer loops in VPlan but some of this code could be generalized and refactored.
Simon's RFC is here, in case you missed it: http://lists.llvm.org/pipermail/llvm-dev/2018-May/123606.html

Dependence Analysis for outer loop vectorization isn't something we are currently working on, and thus being able to reuse is a very nice thing.

Agreed. Outer loop auto-vectorization and, in particular, extensions in legality analysis to support outer loops, including LoopAccessAnalysis, is something that we are not currently working on. I briefly mentioned that and other open TODOs in my EuroLLVM talk, in case there is any other overlap/reusability opportunity with unroll-and-jam: https://www.youtube.com/watch?v=z6NeHLRNVok&t=27m50s

Thanks,
Diego

Any suggestions on what would be at the top of the list? :)

For me the first thing would probably be a rather straightforward thing: more OptimizationRemarks. I think it currently reports only something after a successful transformation. But perhaps more interesting to know is why it isn't triggering (if that's what you expect): so optimisation remarks about the legality, profitability, or any other reason, would be crucial to understand what's going on. Then, based on this optimisation report, and if the transformation really needs to be triggered, the user can resort to pragmas (for which you have another patch) and/or tweaking options for treshhold values.

This is an implementation detail, but I am still not sure about some of the terminology that is used, e.g. "Fore", "Subloops", etc., but I've not come up yet with convincing alternatives...

Also, we probably want to revise some of the threshold values, and see if we need to tweak them further.

But assuming that the dependence analysis is now conservative/correct, these are the sort of things I was thinking of that we can improve after this lands and we get some more experience with the pass.

dmgreen updated this revision to Diff 149147.May 30 2018, 9:57 AM
dmgreen added a reviewer: hfinkel.

Attempted to clean up tests and make them not rely on arm. Removed the outer loop IV check, some other small code edits and some comment improvements/formatting.

This revision is now accepted and ready to land.May 30 2018, 9:57 AM
SjoerdMeijer accepted this revision.Jun 1 2018, 3:29 AM

Cleanup/simplification of the tests looks good to me. Just a nit inlined.

lib/Transforms/Utils/LoopUnrollAndJam.cpp
543 ↗(On Diff #149147)

If you move this to after the asserts, the other NDEBUG block , can you then move this statement:

Loop *OuterL = L->getParentLoop();

to inside the other #ifndef NDEBUG block? Thus we avoid having just one statement hash defined here.

dmgreen updated this revision to Diff 151330.Jun 14 2018, 5:57 AM

OK. One DA patch is in, the other is still waiting for review.

Everyone happy with this to go into tree (its off by default)? Hal are you happy enough for us to continue work in tree? Eric you happy enough with the tests? No-one else have any objections?

I think you can commit this. And should there be more comments, then they can be addressed post commit.

OK. The important DA patches are in. Unless anyone objects, I will commit this soon (hopefully over the weekend).

This revision was automatically updated to reflect the committed changes.