This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/
-
llvm/
-
Analysis/
3
TargetTransformInfo.h
-
Transforms/Utils/
-
Utils/
-
UnrollLoop.h
-
lib/
-
Analysis/
2
DependenceAnalysis.cpp
-
Target/ARM/
-
ARM/
2
ARMTargetTransformInfo.cpp
-
Transforms/
-
Scalar/
1
LICM.cpp
-
LoopUnrollPass.cpp
-
Utils/
-
CMakeLists.txt
2
LoopUnroll.cpp
8/17
LoopUnrollAndJam.cpp
-
LoopUtils.cpp
-
test/
-
Other/
-
new-pm-defaults.ll
-
new-pm-thinlto-defaults.ll
-
pass-pipelines.ll
-
Transforms/LoopUnroll/
-
LoopUnroll/
-
unroll-and-jam-disabled.ll
-
unroll-and-jam-unprofitable.ll
-
unroll-and-jam.ll

Differential D41953

[LoopUnroll] Unroll and Jam
ClosedPublic

Authored by dmgreen on Jan 11 2018, 10:18 AM.

Download Raw Diff

Details

Reviewers

deadalnix
javed.absar
SjoerdMeijer
xbolva00
chandlerc
sanjoy
hfinkel

Commits

rG963401d2be2d: [UnrollAndJam] New Unroll and Jam pass
rG3034281b437d: [UnrollAndJam] Add a new Unroll and Jam pass
rL336062: [UnrollAndJam] New Unroll and Jam pass
rL333358: [UnrollAndJam] Add a new Unroll and Jam pass

Summary

This is a implementation of unroll and jam, which is something that comes up as useful for our smaller embedded processors (and hopefully for other systems in general)

The basic idea is that we take an outer loop of the for:

for i..
  ForeBlocks(i)
  for j..
    SubLoopBlocks(i, j)
  AftBlocks(i)

Instead of doing normal inner or outer unrolling, we unroll as follows:

for i... i+=2
  ForeBlocks(i)
  ForeBlocks(i+1)
  for j..
    SubLoopBlocks(i, j)
    SubLoopBlocks(i+1, j)
  AftBlocks(i)
  AftBlocks(i+1)
Remainder

To do this we need to ensure that the ForeBlocks(i+1) can be moved before the SubLoopBlocks(i) and AftBlocks(i), which means potentially moving the phi node operands from AftBlocks into Fore. There are also memory dependency checks and other safety checks that are needed. The transform is then a fairly simple job of using the excellent existing unroll code for cloning blocks and gluing them all back together correctly.

The dependency analysis is built upon DependencyAnalysis that loop interchange uses. This might have been a mistake. I have made some changes to DA to ensure that the AA gave correct results (and enable TBAA). I have split these changes out into D42381. There may be more to do here to get values outside the loop to be ignored.

I have made this into a separate pass that runs just before the existing loop unroll.

The performance heuristics might not be sorted correctly yet. Several parts might be over-conservative when disabling for safety. The remainder loop may not be the best, I'm not sure how it will play with vectorisers, etc. It is hopefully a good enough start.

I have tested this from the C level (i.e. csmith) but not a lot from the IR level with some form of fuzzer.

Pragmas are mostly copied from unroll.

Diff Detail

Event Timeline

dmgreen created this revision.Jan 11 2018, 10:18 AM

Herald added subscribers: kosarev, eraman, javed.absar and 2 others. · View Herald TranscriptJan 11 2018, 10:18 AM

samparker added a subscriber: samparker.Jan 12 2018, 12:23 AM

I have split (most of) the Dependency Analysis parts into https://reviews.llvm.org/D42381 and https://reviews.llvm.org/D42382.

Herald added a subscriber: hintonda. · View Herald TranscriptJan 23 2018, 2:37 AM

Ping. Anyone interested in unroll and jam? (I wasn't sure who, if anyone, to add as reviewers)

lebedev.ri retitled this revision from Unroll and Jam to [LoopUnroll] Unroll and Jam.Feb 5 2018, 11:51 AM

lebedev.ri edited the summary of this revision. (Show Details)

lebedev.ri set the repository for this revision to rL LLVM.

In D41953#998082, @dmgreen wrote:

Ping. Anyone interested in unroll and jam? (I wasn't sure who, if anyone, to add as reviewers)

I'm interested in this, but I might not get to it for a couple of weeks.

Thanks Hal. No rush.

lib/Transforms/Scalar/LICM.cpp
506	I'm moving the move of this code into a separate commit.

venkataramanan.kumar.llvm added a subscriber: venkataramanan.kumar.llvm.Feb 13 2018, 2:46 AM

Hi Dave,

I haven't looked at all the nitty gritty details yet, but had some general questions after a first glance:

this looks like a rather complete implementation (as opposed to some WIP prototype). In your description you mentioned something about testing. I was wondering if you can elaborate a little bit on this, if you had some more experiences with this pass since posting it. E.g., you've obviously added regression tests, but does it trigger also on the test-suite, etc.?
probably the other thing people first want to know if this is (always) beneficial for their back-end. Do you have some data/experience with this?

SjoerdMeijer added inline comments.Mar 21 2018, 7:40 AM

lib/Transforms/Utils/LoopUnroll.cpp
267	Looks like this is used by both LoopUnroll and UnrollAndJam. To simplify things, is it worth to separate this out in a different patch if this already used/beneficial by/for LoopUnroll?

this looks like a rather complete implementation (as opposed to some WIP prototype)

Hopefully. The things I'm not super happy with are where this fits into the unroll pass, and the tests in test/Other. Whether this is a new pass or a part of unroll is a little awkward because of the way we need to do the outer unroll-and-jam before we do the inner unroll.

The DA Analysis pass either needs to be added to the set of preserved loop passes, or (because it's trivial to create and holds no state) we could just create the DependenceInfo as needed. This would just need an AA in loop unroll.

In your description you mentioned something about testing. I was wondering if you can elaborate a little bit on this, if you had some more experiences with this pass since posting it.

So I've ran the normal set of regression tests / supertest / perennial / test-suite / eembc / geekbench / spec's / etc. All without issue. It has also run probably a few hundred thousand csmith seeds (which is pretty good at creating the kind of nested loops with odd memory accesses that unroll and jam looks for). Those are all with -always-unroll-and-jam to bypass the profitability check. (There can, of course, always be bugs)

probably the other thing people first want to know if this is (always) beneficial for their back-end. Do you have some data/experience with this?

With the way DA and the profitability checks are right now, it may be a little conservative in when it is enabled. I have to look into this with the latest DA changes. I would expect it to be a win in most cases though. The performance is something I hope to work on more, so long as the basics are OK here. When it does fire, in the ideal case, it can help quite a bit with removing extra loads.

lib/Transforms/Utils/LoopUnroll.cpp
267	Sure, sounds good. I have split out the DA and other unrelated parts. This one is less useful on it's own, if it's only used in one place, but I can split it out as a cleanup.

Hi Dave,

With respect to the awkward ordering, I don't understand why this transformation (as a standalone pass) couldn't perform the inner unrolling itself? I don't see why there needs to be a dependency. Also, will have loop rotate as a utility (https://reviews.llvm.org/D44595) help simplify this transform?

cheers,
sam

I think I agree with Sam's reply, also because I think that would avoid a regression that would be possible with this patch: neither loopunroll, nor loopunroll and jam runs. Because when it is determined that loopunroll is profitable, it will never run loopunroll, but loopunroll and jam may actually not happen; it is returning "LoopUnrollResult::Unmodified" in a few (exceptional) cases.

Sorry Sam, I missed your comment somehow in a number of other emails. Sorry about that.

By a bit awkward - I wasn't referring to unroll and jam doing the inner unroll. That sounds like a fine idea, like we have here it just tries to call tryToUnrollLoop again.

I just meant that two passes in the same LoopPassManager will process inner loops first for all passes, then move on to the outer loops. So we would either need get the inner loop unroll to not unroll (preventing the outer unroll and jam) or we'd need to ensure the lpm is split in two (which may be fine, and may help with the other issues around analysis dependencies and not needing the explicit inner unroll). Providing this wont change the pass ordering for other code, which I don't think it will if I'm reading this pass pipeline correctly, this sounds like the best way to go to me.

I'll give it a try like that and see how it looks.

zzheng added a subscriber: zzheng.Apr 3 2018, 12:52 PM

Hi Dave,

Sorry, I don't follow you mean by splitting the pass manager. My idea of the problem and phase ordering would be: unroll and jam, as a standalone pass, runs before normal unrolling and, if successful, uses the metadata to prevent further unrolling later on. Can this not be the case? What am I missing here?

cheers,
sam

I was being a bit vague, what I meant was that if we have a pass structure like this, as we have now but with added unroll and jam:

Loop Pass Manager
  Induction Variable Simplification
  Recognize loop idioms
  Delete dead loops
  Unroll and Jam
  Unroll loops

With a nest with inner and outer loops, the passes will be done in this order:

IndVar simplify inner loop
Recognise inner loop
Delete Dead inner loop
UnJ inner loop
Unroll inner loop
IndVar simplify outer loop
Recognise outer loop
Delete Dead outer loop
UnJ outer loop
Unroll outer loop

So when Unj gets to the outer loop, the inner loop may have been unrolled. So the unroll for the inner loop would have to know whether the loop nest was profitable to unroll and jam and -not- unroll in that case. To get the UnJ of the outer loop to happen before the unroll of the inner loop, two loop pass managers are needed (in this case one just for the Unroll Loops). This can affect code generation for existing cases, which was something I was hoping to avoid. I saw some cases of infinite loops being generated differently when I had it like that for testing.

What I am hoping is that we won't have to add UnJ here in the pass pipeline, we can add it later at the other place that we do unrolling. This should allow us to more easily have two loop pass managers, simplifying things a lot :)

hintonda removed a subscriber: hintonda.Apr 4 2018, 4:36 PM

This makes things into a new pass, currently disabled by default. I have attempted to copy what the unrolled does, without duplicating too much code. Pragmas and options should work, although I will need to add a few more tests for the different behaviours. I have tried to make it so that any unroll pragma will prevent unroll-and-jam, and added equivalent options/pragmas for unroll and jam. It still uses the unroll code as much as possible.

I have the clang side of the pragmas too, which I'll put up (although they may only even be useful for testing).

Herald added a reviewer: deadalnix. · View Herald TranscriptApr 16 2018, 10:00 AM

Hi Dave, thanks for making this into a pass, I think this looks a lot better now. I skimmed through the whole diff for first time, and just wrote down some things I noticed, mostly nitpicks, see the comments inlined. I will now study the different bits and pieces in more detail.

include/llvm/Analysis/TargetTransformInfo.h
427	Can you be a bit more explicit here what the threshold exactly is?
427	typo: trashold :)
lib/Analysis/DependenceAnalysis.cpp
111	Will other people get unhappy when we flip this switch?
lib/Target/ARM/ARMTargetTransformInfo.cpp
622	Some comments perhaps why it is set to 60?
lib/Transforms/IPO/PassManagerBuilder.cpp
659 ↗	(On Diff #142645)	Elaborate a bit what you mean here (the 2nd sentence)?
lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
148 ↗	(On Diff #142645)	nit: extra space
203 ↗	(On Diff #142645)	nit: know -> known
305 ↗	(On Diff #142645)	nit: don't need the "!= 0"
lib/Transforms/Utils/LoopUnrollAndJam.cpp
122	Some more nitpicking here, about names. Perhaps "Fore" can be confusing: is it another loop, a block? So something like this: for (i) block[i, 0] for (j) block[j, 0] end block[i, 1] end becomes: for (i, i+=2) block[i, 0] block[i+1, 0] for (j, j+=4) block[j+0, 0] block[j+1, 0] block[j+2, 0] block[j+3, 0] end block[i, 1] block[i+1, 1] end Mentioning that the outer loop has been unrolled with a factor of 2, and the inner loop with an unroll factor of 4. Not sure if the classic loopunroll and jam literature uses standard terminology here, but perhaps inner/outer loop rather than "Fore, Subloop and Aft". Also, perhaps you want to mention some restrictions somewhere, if there are any. E.g., can this doubly nested loop occur in more deeply nested loop structure (and loopunroll and jam still trigger)?
140	Blocks = outer loops? So perhaps something along the lines of: "So the outer loops are unrolled and the inner loops fused ("jammed")."
181	Nit: wont
197	nit: don't need to capitalize this message?
211	same here?
543	typo: shouldnt

Thanks for the comments! I'll try to split a few things out and update this accordingly

include/llvm/Analysis/TargetTransformInfo.h
427	trashold?
lib/Analysis/DependenceAnalysis.cpp
111	Oh yep, this should probably be a separate change. I don't think people will be unhappy, but it shouldn't be lumped in here.
lib/Target/ARM/ARMTargetTransformInfo.cpp
622	The limit here is set because the inner loop requires a smaller threshold than the outer, as unroll and jam can increase register pressure in the inner loop significantly. This was chosen semi-randomly to be a something around 20% of the normal threshold, to something that seemed to work in a few cases I tried. More performance testing may be needed to get something better.
lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
305 ↗	(On Diff #142645)	This is copied from loop unroll. I think it's clearer this way what it's testing. But I can change it if you are highly opinionated :)
lib/Transforms/Utils/LoopUnrollAndJam.cpp
122	I'm not sure about your loop here. It looks like it has inner loop unrolling, which UnJ doesn't do (that is left to the unroller after the outer loop has been unroll and jammed. By "Fore" I was referring to blocks in the outer loop, but before the inner loop. "Aft" means blocks in the outer loop but after the inner loop. I made up the names, so am open to suggestions on better ones ;) Also, perhaps you want to mention some restrictions somewhere, if there are any. E.g., can this doubly nested loop occur in more deeply nested loop structure (and loopunroll and jam still trigger)? That should work I believe. If it doesn't it's a bug :) But I doubt it will ever be a profitable thing to do. It used to be disabled in the profitability check, but we may have lost that when I was moving things over the the new pass. I'll have a look. There are restrictions mentioned in the safety checks, in isSafeToUnrollAndJam.
140	I have tried to rewrite this a little.
197	This is copied from loop unroll
211	Same

dmgreen mentioned this in D45874: [LoopUnroll] Split out simplify code after Unroll into a new function. NFC.Apr 20 2018, 4:49 AM

Address feedback and added some pragma tests, with some related fixes. Split out some parts into other commits.

I'm still trying to work on making the analysis more accurate so this works on more loops. It would be good if LAA could be made to work here (to get versioning too) but it looks very vectoriser shaped and doesn't handle outer loop quite yet. DA as used here might be the only way, but may need to learn a few more tricks (or may need some fixes).

Herald added a reviewer: javed.absar. · View Herald TranscriptApr 20 2018, 9:55 AM

In D41953#1073753, @dmgreen wrote:

Address feedback and added some pragma tests, with some related fixes. Split out some parts into other commits.

I'm still trying to work on making the analysis more accurate so this works on more loops. It would be good if LAA could be made to work here (to get versioning too) but it looks very vectoriser shaped and doesn't handle outer loop quite yet. DA as used here might be the only way, but may need to learn a few more tricks (or may need some fixes).

Ping, how work is going on?

SjoerdMeijer added inline comments.May 9 2018, 2:18 AM

test/Transforms/LoopUnrollAndJam/unprofitable.ll
226 ↗	(On Diff #143328)	Nit: do you need all this?
test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll
9 ↗	(On Diff #143328)	Do we need all these checks? And with all the variable names still there, it looks like a fragile test to me. Same for the other functions below.

Ping, how work is going on?

Hello. I've been away for a few days. Is this something you are interested in?

I'm still wading through improvements / bugs in dependency analysis, in an attempt to make the dependency checking here more precise.

I've updated the dependency checks to be a bit better. This has shown up some bugs/inaccuracies in DA, which will be needed to make this correct.

I'm also adding some people here who may have opinions about any of this.

dmgreen added inline comments.May 14 2018, 9:09 AM

test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll
9 ↗	(On Diff #143328)	I think as we are only running a single pass (not -O3 or an entire backend), it should be relatively stable. I've updated this one to be auto generated and show the entire code output. The others are usually testing some small portion of the resulting function.

Hi Dave,

I think we have looked long enough at it now, and it's time to get some experience with it. :)
It's off by default, so I don't think there's any harm done here. Once it is in, it's easier to throw
more code at it, get more experience, and see if we further need to tweak or tune it; perhaps
we get feedback from others as well who are interested in using it.

Let's wait a few more days with committing to give people one more chance to object :-)

Inlined just a few more comments, mainly nits.

Cheers.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
84 ↗	(On Diff #146623)	This is almost identical to GetUnrollMetadata() in LoopUnroll.cpp. Is this something that could be shared somehow? Or would that be too inconvenient?
229 ↗	(On Diff #146623)	nit: numInvariant -> NumInvariant
lib/Transforms/Utils/LoopUnrollAndJam.cpp
73	nit: TODOD
399	nit: don't need the brackets here
623	Nit: TODOD
test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll
9 ↗	(On Diff #143328)	Ok, fair enough, agreed.

This revision is now accepted and ready to land.May 16 2018, 8:56 AM

dmgreen marked 4 inline comments as done.May 21 2018, 10:57 AM

dmgreen added inline comments.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
84 ↗	(On Diff #146623)	It's similar, but this version is only testing prefixes, not the whole string.

OK thanks. I will leave this for at least a couple more days whilst I run extra tests. Any other comments from anyone are welcome/appreciated.

xbolva00 added a subscriber: xbolva00.May 21 2018, 11:31 AM

xbolva00 added inline comments.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
136 ↗	(On Diff #147820)	static_cast?
lib/Transforms/Utils/LoopUnrollAndJam.cpp
175	bool CompletelyUnroll = (Count == TripCount); maybe better?
436	Above you use a different style: for (unsigned It = 1; It != Count; ++It) { Can you make it same for all occurrences?
475	This line could be placed above the condition.

Review Updates.

xbolva00 accepted this revision.May 22 2018, 4:34 AM

dmgreen mentioned this in D47267: [UnrollAndJam] Add unroll_and_jam pragma handling.May 23 2018, 9:00 AM

Closed by commit rL333358: [UnrollAndJam] Add a new Unroll and Jam pass (authored by dmgreen). · Explain WhyMay 27 2018, 5:15 AM

This revision was automatically updated to reflect the committed changes.

It sounded like Hal wanted to review this and I don't know that any of the other people on the review line have any experience with loop passes and so probably shouldn't have been approving this. I might be wrong here, and if so I apologize, but it seems like this went in a bit early.

From a quick glance it also looks like the testcases could be cleaned up a bit. In particular the naming if nothing else.

Thanks.

Reverted in 333359 as it's failing on some of the builders. I believe the tests are dependant on the arm backend being built.

I'll fix that and try to clean up the tests as you suggest at the same time. Some of them are longer than I'd like because we are dealing with nested loops.

I was taking Sjoerd review as OK to move things forward and improve things over time once it is in-tree. And I had attempted to leave this open for a while with plenty of people as subscribers, so that anyone could object. I had not added them as reviewers though, sorry if I misstepped.

In D41953#1113514, @echristo wrote:

It sounded like Hal wanted to review this

Thanks Eric. I don't need to review this if others who are qualified look at it. I've skimmed the code and it seems reasonable. There are a number of obvious enhancements that might be worthwhile, but I think it's best to work on those after this lands.

and I don't know that any of the other people on the review line have any experience with loop passes and so probably shouldn't have been approving this. I might be wrong here, and if so I apologize, but it seems like this went in a bit early.

From a quick glance it also looks like the testcases could be cleaned up a bit. In particular the naming if nothing else.

Thanks.

llvm/trunk/lib/Transforms/Utils/LoopUnrollAndJam.cpp
726 ↗	(On Diff #148747)	Can this overflow if the exit count is INT_MAX (or similar)? It's generally a good idea to comment on how that's handled.

This looked very reasonable to me as an initial commit. And I thought so too, like Hal and Eric, that different enhancements can still be made. But as I wrote, we can iterate further on this once we've got something in. So I don't think this was a misstep, and once the buildbot failures are resolved, it looks like this can be recommitted. However, as I also mentioned, another reason for an initial commit is more exposure, potentially more people looking at it. So this discussion is great, the more feedback, the better; especially if there's something that needs to be addressed before a recommit.

David, outer loop vectorization we've been working on and unroll and jam have a lot of similarities and thus there should be code sharing opportunities around legality and cost modeling. Let's collaborate to improve both.

In D41953#1113730, @hfinkel wrote:

There are a number of obvious enhancements that might be worthwhile, but I think it's best to work on those after this lands.

Any suggestions on what would be at the top of the list? :)

In D41953#1114289, @hsaito wrote:

David, outer loop vectorization we've been working on and unroll and jam have a lot of similarities and thus there should be code sharing opportunities around legality and cost modeling. Let's collaborate to improve both.

Yeah, sounds like a good idea. Vectorisation is not something I know a huge amount about, it not being useful for the particular cores I'm looking at at the moment. But the work on VPlan sounds interesting and I do wish we had something similar for loop transforms in llvm. I was hoping to get loop versioning working, maybe through LoopAccessAnalysis, but as far as I understand it doesn't handle outer loops yet (and seemed very vectoriser shaped when I looked). From what I remember hearing, this isn't part of the outer loop vectorisation work (instead relying on user directives?). I imagine it's not an easy task to extend this out.

The current legality checks here are based on dependence analysis, and there are a couple of patches needed to make it work correctly. The cost modelling is very adhoc. :) All these things I hope we can improve as we go along.

In D41953#1113514, @echristo wrote:

From a quick glance it also looks like the testcases could be cleaned up a bit. In particular the naming if nothing else.

Do you mean names of the test files, or names of variables in the tests? I had look but didn't feel I was making things much better (apart from the names of some lcssa nodes). Any other suggestions?

llvm/trunk/lib/Transforms/Utils/LoopUnrollAndJam.cpp
726 ↗	(On Diff #148747)	Yep, you are correct. This code comes from unroll-runtime (minus the missing overflow check), and dates back to before unroll-and-jam was a separate pass. I was trying to prevent this function from returning true if the runtime unrolling was going to fail. It can now be safely removed and leave that check to the call to UnrollRuntimeLoopRemainder.

This revision is now accepted and ready to land.May 29 2018, 11:47 AM

dmgreen planned changes to this revision.May 29 2018, 11:47 AM

In D41953#1115101, @dmgreen wrote:

In D41953#1114289, @hsaito wrote:

David, outer loop vectorization we've been working on and unroll and jam have a lot of similarities and thus there should be code sharing opportunities around legality and cost modeling. Let's collaborate to improve both.

Yeah, sounds like a good idea. Vectorisation is not something I know a huge amount about, it not being useful for the particular cores I'm looking at at the moment. But the work on VPlan sounds interesting and I do wish we had something similar for loop transforms in llvm. I was hoping to get loop versioning working, maybe through LoopAccessAnalysis, but as far as I understand it doesn't handle outer loops yet (and seemed very vectoriser shaped when I looked). From what I remember hearing, this isn't part of the outer loop vectorisation work (instead relying on user directives?). I imagine it's not an easy task to extend this out.

The current legality checks here are based on dependence analysis, and there are a couple of patches needed to make it work correctly. The cost modelling is very adhoc. :) All these things I hope we can improve as we go along.

We are currently focusing on directive based implementation, but the same framework can be used for automatic vectorization at the outer loop level. It's not like vectorizer side has a lot of immediately reusable code in place, but I think co-designing the solution makes a lot of sense, to avoid duplicate work (or to consolidate into shared reusable code). For example, uniform control flow check @dcaballe has written is probably shareable. Ditto for Divergence Analysis RFC by Simon Moll. Dependence Analysis for outer loop vectorization isn't something we are currently working on, and thus being able to reuse is a very nice thing.

For example, uniform control flow check @dcaballe has written is probably shareable. Ditto for Divergence Analysis RFC by Simon Moll.

You could have a look at isExplicitVecOuterLoop and isUniformLoop in D42447. Currently, they are implementing specific checks for supported outer loops in VPlan but some of this code could be generalized and refactored.
Simon's RFC is here, in case you missed it: http://lists.llvm.org/pipermail/llvm-dev/2018-May/123606.html

Dependence Analysis for outer loop vectorization isn't something we are currently working on, and thus being able to reuse is a very nice thing.

Agreed. Outer loop auto-vectorization and, in particular, extensions in legality analysis to support outer loops, including LoopAccessAnalysis, is something that we are not currently working on. I briefly mentioned that and other open TODOs in my EuroLLVM talk, in case there is any other overlap/reusability opportunity with unroll-and-jam: https://www.youtube.com/watch?v=z6NeHLRNVok&t=27m50s

Thanks,
Diego

Any suggestions on what would be at the top of the list? :)

For me the first thing would probably be a rather straightforward thing: more OptimizationRemarks. I think it currently reports only something after a successful transformation. But perhaps more interesting to know is why it isn't triggering (if that's what you expect): so optimisation remarks about the legality, profitability, or any other reason, would be crucial to understand what's going on. Then, based on this optimisation report, and if the transformation really needs to be triggered, the user can resort to pragmas (for which you have another patch) and/or tweaking options for treshhold values.

This is an implementation detail, but I am still not sure about some of the terminology that is used, e.g. "Fore", "Subloops", etc., but I've not come up yet with convincing alternatives...

Also, we probably want to revise some of the threshold values, and see if we need to tweak them further.

But assuming that the dependence analysis is now conservative/correct, these are the sort of things I was thinking of that we can improve after this lands and we get some more experience with the pass.

Attempted to clean up tests and make them not rely on arm. Removed the outer loop IV check, some other small code edits and some comment improvements/formatting.

This revision is now accepted and ready to land.May 30 2018, 9:57 AM

Cleanup/simplification of the tests looks good to me. Just a nit inlined.

lib/Transforms/Utils/LoopUnrollAndJam.cpp
544	If you move this to after the asserts, the other NDEBUG block , can you then move this statement: Loop *OuterL = L->getParentLoop(); to inside the other #ifndef NDEBUG block? Thus we avoid having just one statement hash defined here.

OK. One DA patch is in, the other is still waiting for review.

Everyone happy with this to go into tree (its off by default)? Hal are you happy enough for us to continue work in tree? Eric you happy enough with the tests? No-one else have any objections?

I think you can commit this. And should there be more comments, then they can be addressed post commit.

OK. The important DA patches are in. Unless anyone objects, I will commit this soon (hopefully over the weekend).

Closed by commit rL336062: [UnrollAndJam] New Unroll and Jam pass (authored by dmgreen). · Explain WhyJul 1 2018, 5:52 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

4 lines

Transforms/

Utils/

UnrollLoop.h

27 lines

lib/

Analysis/

DependenceAnalysis.cpp

31 lines

Target/

ARM/

ARMTargetTransformInfo.cpp

2 lines

Transforms/

Scalar/

LICM.cpp

37 lines

LoopUnrollPass.cpp

55 lines

Utils/

1 line

86 lines

876 lines

38 lines

test/

Other/

new-pm-defaults.ll

1 line

new-pm-thinlto-defaults.ll

1 line

pass-pipelines.ll

2 lines

Transforms/

LoopUnroll/

unroll-and-jam-disabled.ll

728 lines

unroll-and-jam-unprofitable.ll

223 lines

unroll-and-jam.ll

699 lines

Diff 129294

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 416 Lines • ▼ Show 20 Lines	struct UnrollingPreferences {
/// (mainly to loops that fail runtime unrolling).		/// (mainly to loops that fail runtime unrolling).
bool Force;		bool Force;
/// Allow using trip count upper bound to unroll loops.		/// Allow using trip count upper bound to unroll loops.
bool UpperBound;		bool UpperBound;
/// Allow peeling off loop iterations for loops with low dynamic tripcount.		/// Allow peeling off loop iterations for loops with low dynamic tripcount.
bool AllowPeeling;		bool AllowPeeling;
/// Allow unrolling of all the iterations of the runtime loop remainder.		/// Allow unrolling of all the iterations of the runtime loop remainder.
bool UnrollRemainder;		bool UnrollRemainder;
		/// Allow unroll and jam
		bool UnrollAndJam;
		/// Threshold for unroll and jam, for inner loop size
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Can you be a bit more explicit here what the threshold exactly is? SjoerdMeijer: Can you be a bit more explicit here what the threshold exactly is?
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions typo: trashold :) SjoerdMeijer: typo: trashold :)
		dmgreenAuthorUnsubmitted Not Done Reply Inline Actions trashold? dmgreen: trashold?
		unsigned UnrollAndJamThreshold;
};		};

/// \brief Get target-customized preferences for the generic loop unrolling		/// \brief Get target-customized preferences for the generic loop unrolling
/// transformation. The caller will initialize UP with the current		/// transformation. The caller will initialize UP with the current
/// target-independent defaults.		/// target-independent defaults.
void getUnrollingPreferences(Loop *L, ScalarEvolution &,		void getUnrollingPreferences(Loop *L, ScalarEvolution &,
UnrollingPreferences &UP) const;		UnrollingPreferences &UP) const;

▲ Show 20 Lines • Show All 1,155 Lines • Show Last 20 Lines

include/llvm/Transforms/Utils/UnrollLoop.h

Show All 13 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H		#ifndef LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H
#define LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H		#define LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H

#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
		#include "llvm/Transforms/Utils/ValueMapper.h"

namespace llvm {		namespace llvm {

class AssumptionCache;		class AssumptionCache;
class BasicBlock;		class BasicBlock;
		class DependenceInfo;
class DominatorTree;		class DominatorTree;
class Loop;		class Loop;
class LoopInfo;		class LoopInfo;
class MDNode;		class MDNode;
class OptimizationRemarkEmitter;		class OptimizationRemarkEmitter;
class ScalarEvolution;		class ScalarEvolution;

using NewLoopsMap = SmallDenseMap<const Loop , Loop , 4>;		using NewLoopsMap = SmallDenseMap<const Loop , Loop , 4>;
Show All 21 Lines	LoopUnrollResult UnrollLoop(Loop *L, unsigned Count, unsigned TripCount,
bool Force, bool AllowRuntime,		bool Force, bool AllowRuntime,
bool AllowExpensiveTripCount, bool PreserveCondBr,		bool AllowExpensiveTripCount, bool PreserveCondBr,
bool PreserveOnlyFirst, unsigned TripMultiple,		bool PreserveOnlyFirst, unsigned TripMultiple,
unsigned PeelCount, bool UnrollRemainder,		unsigned PeelCount, bool UnrollRemainder,
LoopInfo LI, ScalarEvolution SE,		LoopInfo LI, ScalarEvolution SE,
DominatorTree DT, AssumptionCache AC,		DominatorTree DT, AssumptionCache AC,
OptimizationRemarkEmitter *ORE, bool PreserveLCSSA);		OptimizationRemarkEmitter *ORE, bool PreserveLCSSA);

		LoopUnrollResult UnrollAndJamLoop(Loop *L, unsigned Count, unsigned TripCount,
		unsigned TripMultiple, bool UnrollRemainder,
		LoopInfo LI, ScalarEvolution SE,
		DominatorTree DT, AssumptionCache AC,
		OptimizationRemarkEmitter *ORE);

bool UnrollRuntimeLoopRemainder(Loop *L, unsigned Count,		bool UnrollRuntimeLoopRemainder(Loop *L, unsigned Count,
bool AllowExpensiveTripCount,		bool AllowExpensiveTripCount,
bool UseEpilogRemainder, bool UnrollRemainder,		bool UseEpilogRemainder, bool UnrollRemainder,
LoopInfo *LI,		LoopInfo *LI,
ScalarEvolution SE, DominatorTree DT,		ScalarEvolution SE, DominatorTree DT,
AssumptionCache *AC,		AssumptionCache *AC,
bool PreserveLCSSA);		bool PreserveLCSSA);

void computePeelCount(Loop *L, unsigned LoopSize,		void computePeelCount(Loop *L, unsigned LoopSize,
TargetTransformInfo::UnrollingPreferences &UP,		TargetTransformInfo::UnrollingPreferences &UP,
unsigned &TripCount);		unsigned &TripCount);

bool peelLoop(Loop L, unsigned PeelCount, LoopInfo LI, ScalarEvolution *SE,		bool peelLoop(Loop L, unsigned PeelCount, LoopInfo LI, ScalarEvolution *SE,
DominatorTree DT, AssumptionCache AC, bool PreserveLCSSA);		DominatorTree DT, AssumptionCache AC, bool PreserveLCSSA);

		bool isSafeToUnrollAndJam(Loop *L, ScalarEvolution &SE, DominatorTree &DT,
		DependenceInfo &DI);

		bool isProfitableToUnrollAndJam(Loop *L, const TargetTransformInfo &TTI,
		ScalarEvolution &SE, DominatorTree &DT,
		DependenceInfo &DI,
		TargetTransformInfo::UnrollingPreferences &UP);

		BasicBlock foldBlockIntoPredecessor(BasicBlock BB, LoopInfo *LI,
		ScalarEvolution *SE,
		SmallPtrSetImpl<Loop *> &ForgottenLoops,
		DominatorTree *DT);

		void remapInstruction(Instruction *I, ValueToValueMapTy &VMap);

MDNode GetUnrollMetadata(MDNode LoopID, StringRef Name);		MDNode GetUnrollMetadata(MDNode LoopID, StringRef Name);

		void simplifyLoopAfterUnroll(Loop L, bool SimplifyIVs, LoopInfo LI,
		ScalarEvolution SE, DominatorTree DT,
		AssumptionCache *AC);

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H		#endif // LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H

lib/Analysis/DependenceAnalysis.cpp

Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines
STATISTIC(GCDapplications, "GCD applications");		STATISTIC(GCDapplications, "GCD applications");
STATISTIC(GCDsuccesses, "GCD successes");		STATISTIC(GCDsuccesses, "GCD successes");
STATISTIC(GCDindependence, "GCD independence");		STATISTIC(GCDindependence, "GCD independence");
STATISTIC(BanerjeeApplications, "Banerjee applications");		STATISTIC(BanerjeeApplications, "Banerjee applications");
STATISTIC(BanerjeeIndependence, "Banerjee independence");		STATISTIC(BanerjeeIndependence, "Banerjee independence");
STATISTIC(BanerjeeSuccesses, "Banerjee successes");		STATISTIC(BanerjeeSuccesses, "Banerjee successes");

static cl::opt<bool>		static cl::opt<bool>
Delinearize("da-delinearize", cl::init(false), cl::Hidden, cl::ZeroOrMore,		Delinearize("da-delinearize", cl::init(true), cl::Hidden, cl::ZeroOrMore,
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Will other people get unhappy when we flip this switch? SjoerdMeijer: Will other people get unhappy when we flip this switch?
		dmgreenAuthorUnsubmitted Not Done Reply Inline Actions Oh yep, this should probably be a separate change. I don't think people will be unhappy, but it shouldn't be lumped in here. dmgreen: Oh yep, this should probably be a separate change. I don't think people will be unhappy, but…
cl::desc("Try to delinearize array references."));		cl::desc("Try to delinearize array references."));

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// basics		// basics

DependenceAnalysis::Result		DependenceAnalysis::Result
DependenceAnalysis::run(Function &F, FunctionAnalysisManager &FAM) {		DependenceAnalysis::run(Function &F, FunctionAnalysisManager &FAM) {
auto &AA = FAM.getResult<AAManager>(F);		auto &AA = FAM.getResult<AAManager>(F);
▲ Show 20 Lines • Show All 498 Lines • ▼ Show 20 Lines	else {
OS << "]";		OS << "]";
if (Splitable)		if (Splitable)
OS << " splitable";		OS << " splitable";
}		}
OS << "!\n";		OS << "!\n";
}		}

static AliasResult underlyingObjectsAlias(AliasAnalysis *AA,		static AliasResult underlyingObjectsAlias(AliasAnalysis *AA,
const DataLayout &DL, const Value *A,		const DataLayout &DL,
		const Instruction AI, const Value A,
		const Instruction *BI,
const Value *B) {		const Value *B) {
const Value *AObj = GetUnderlyingObject(A, DL);		const Value *AObj = GetUnderlyingObject(A, DL);
const Value *BObj = GetUnderlyingObject(B, DL);		const Value *BObj = GetUnderlyingObject(B, DL);
return AA->alias(AObj, DL.getTypeStoreSize(AObj->getType()),
BObj, DL.getTypeStoreSize(BObj->getType()));		MemoryLocation MAObj(AObj);
		MemoryLocation MBObj(BObj);

		if (const auto *LI = dyn_cast<const LoadInst>(AI))
		LI->getAAMetadata(MAObj.AATags);
		else if (const auto *SI = dyn_cast<const StoreInst>(AI))
		SI->getAAMetadata(MAObj.AATags);

		if (const auto *LI = dyn_cast<const LoadInst>(BI))
		LI->getAAMetadata(MBObj.AATags);
		else if (const auto *SI = dyn_cast<const StoreInst>(BI))
		SI->getAAMetadata(MBObj.AATags);

		return AA->alias(MAObj, MBObj);
}		}


// Returns true if the load or store can be analyzed. Atomic and volatile		// Returns true if the load or store can be analyzed. Atomic and volatile
// operations have properties which this analysis does not understand.		// operations have properties which this analysis does not understand.
static		static
bool isLoadOrStore(const Instruction *I) {		bool isLoadOrStore(const Instruction *I) {
if (const LoadInst *LI = dyn_cast<LoadInst>(I))		if (const LoadInst *LI = dyn_cast<LoadInst>(I))
▲ Show 20 Lines • Show All 2,661 Lines • ▼ Show 20 Lines	if (!isLoadOrStore(Src) \|\| !isLoadOrStore(Dst)) {
// can only analyze simple loads and stores, i.e., no calls, invokes, etc.		// can only analyze simple loads and stores, i.e., no calls, invokes, etc.
DEBUG(dbgs() << "can only handle simple loads and stores\n");		DEBUG(dbgs() << "can only handle simple loads and stores\n");
return make_unique<Dependence>(Src, Dst);		return make_unique<Dependence>(Src, Dst);
}		}

Value *SrcPtr = getPointerOperand(Src);		Value *SrcPtr = getPointerOperand(Src);
Value *DstPtr = getPointerOperand(Dst);		Value *DstPtr = getPointerOperand(Dst);

switch (underlyingObjectsAlias(AA, F->getParent()->getDataLayout(), DstPtr,		switch (underlyingObjectsAlias(AA, F->getParent()->getDataLayout(), Dst,
SrcPtr)) {		DstPtr, Src, SrcPtr)) {
case MayAlias:		case MayAlias:
case PartialAlias:		case PartialAlias:
// cannot analyse objects if we don't understand their aliasing.		// cannot analyse objects if we don't understand their aliasing.
DEBUG(dbgs() << "can't analyze may or partial alias\n");		DEBUG(dbgs() << "can't analyze may or partial alias\n");
return make_unique<Dependence>(Src, Dst);		return make_unique<Dependence>(Src, Dst);
case NoAlias:		case NoAlias:
// If the objects noalias, they are distinct, accesses are independent.		// If the objects noalias, they are distinct, accesses are independent.
DEBUG(dbgs() << "no alias\n");		DEBUG(dbgs() << "no alias\n");
▲ Show 20 Lines • Show All 431 Lines • ▼ Show 20 Lines	const SCEV *DependenceInfo::getSplitIteration(const Dependence &Dep,
Instruction *Src = Dep.getSrc();		Instruction *Src = Dep.getSrc();
Instruction *Dst = Dep.getDst();		Instruction *Dst = Dep.getDst();
assert(Src->mayReadFromMemory() \|\| Src->mayWriteToMemory());		assert(Src->mayReadFromMemory() \|\| Src->mayWriteToMemory());
assert(Dst->mayReadFromMemory() \|\| Dst->mayWriteToMemory());		assert(Dst->mayReadFromMemory() \|\| Dst->mayWriteToMemory());
assert(isLoadOrStore(Src));		assert(isLoadOrStore(Src));
assert(isLoadOrStore(Dst));		assert(isLoadOrStore(Dst));
Value *SrcPtr = getPointerOperand(Src);		Value *SrcPtr = getPointerOperand(Src);
Value *DstPtr = getPointerOperand(Dst);		Value *DstPtr = getPointerOperand(Dst);
assert(underlyingObjectsAlias(AA, F->getParent()->getDataLayout(), DstPtr,		assert(underlyingObjectsAlias(AA, F->getParent()->getDataLayout(), Dst,
SrcPtr) == MustAlias);		DstPtr, Src, SrcPtr) == MustAlias);

// establish loop nesting levels		// establish loop nesting levels
establishNestingLevels(Src, Dst);		establishNestingLevels(Src, Dst);

FullDependence Result(Src, Dst, false, CommonLevels);		FullDependence Result(Src, Dst, false, CommonLevels);

// See if there are GEPs we can use.		// See if there are GEPs we can use.
bool UsefulGEP = false;		bool UsefulGEP = false;
▲ Show 20 Lines • Show All 180 Lines • Show Last 20 Lines

lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 612 Lines • ▼ Show 20 Lines	void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
}		}

DEBUG(dbgs() << "Cost of loop: " << Cost << "\n");		DEBUG(dbgs() << "Cost of loop: " << Cost << "\n");

UP.Partial = true;		UP.Partial = true;
UP.Runtime = true;		UP.Runtime = true;
UP.UnrollRemainder = true;		UP.UnrollRemainder = true;
UP.DefaultUnrollRuntimeCount = 4;		UP.DefaultUnrollRuntimeCount = 4;
		UP.UnrollAndJam = true;
		UP.UnrollAndJamThreshold = 50;
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Some comments perhaps why it is set to 60? SjoerdMeijer: Some comments perhaps why it is set to 60?
		dmgreenAuthorUnsubmitted Not Done Reply Inline Actions The limit here is set because the inner loop requires a smaller threshold than the outer, as unroll and jam can increase register pressure in the inner loop significantly. This was chosen semi-randomly to be a something around 20% of the normal threshold, to something that seemed to work in a few cases I tried. More performance testing may be needed to get something better. dmgreen: The limit here is set because the inner loop requires a smaller threshold than the outer, as…

// Force unrolling small loops can be very useful because of the branch		// Force unrolling small loops can be very useful because of the branch
// taken cost of the backedge.		// taken cost of the backedge.
if (Cost < 12)		if (Cost < 12)
UP.Force = true;		UP.Force = true;
}		}

lib/Transforms/Scalar/LICM.cpp

Show First 20 Lines • Show All 497 Lines • ▼ Show 20 Lines	if (!inSubLoop(BB, CurLoop, LI))
CurLoop->getLoopPreheader()->getTerminator()))		CurLoop->getLoopPreheader()->getTerminator()))
Changed \|= hoist(I, DT, CurLoop, SafetyInfo, ORE);		Changed \|= hoist(I, DT, CurLoop, SafetyInfo, ORE);
}		}
}		}

return Changed;		return Changed;
}		}

/// Computes loop safety information, checks loop body & header
dmgreenAuthorUnsubmitted Not Done Reply Inline Actions I'm moving the move of this code into a separate commit. dmgreen: I'm moving the move of this code into a separate commit.
/// for the possibility of may throw exception.
///
void llvm::computeLoopSafetyInfo(LoopSafetyInfo SafetyInfo, Loop CurLoop) {
assert(CurLoop != nullptr && "CurLoop cant be null");
BasicBlock *Header = CurLoop->getHeader();
// Setting default safety values.
SafetyInfo->MayThrow = false;
SafetyInfo->HeaderMayThrow = false;
// Iterate over header and compute safety info.
for (BasicBlock::iterator I = Header->begin(), E = Header->end();
(I != E) && !SafetyInfo->HeaderMayThrow; ++I)
SafetyInfo->HeaderMayThrow \|=
!isGuaranteedToTransferExecutionToSuccessor(&*I);

SafetyInfo->MayThrow = SafetyInfo->HeaderMayThrow;
// Iterate over loop instructions and compute safety info.
// Skip header as it has been computed and stored in HeaderMayThrow.
// The first block in loopinfo.Blocks is guaranteed to be the header.
assert(Header == *CurLoop->getBlocks().begin() &&
"First block must be header");
for (Loop::block_iterator BB = std::next(CurLoop->block_begin()),
BBE = CurLoop->block_end();
(BB != BBE) && !SafetyInfo->MayThrow; ++BB)
for (BasicBlock::iterator I = (BB)->begin(), E = (BB)->end();
(I != E) && !SafetyInfo->MayThrow; ++I)
SafetyInfo->MayThrow \|= !isGuaranteedToTransferExecutionToSuccessor(&*I);

// Compute funclet colors if we might sink/hoist in a function with a funclet
// personality routine.
Function *Fn = CurLoop->getHeader()->getParent();
if (Fn->hasPersonalityFn())
if (Constant *PersonalityFn = Fn->getPersonalityFn())
if (isFuncletEHPersonality(classifyEHPersonality(PersonalityFn)))
SafetyInfo->BlockColors = colorEHFunclets(*Fn);
}

// Return true if LI is invariant within scope of the loop. LI is invariant if		// Return true if LI is invariant within scope of the loop. LI is invariant if
// CurLoop is dominated by an invariant.start representing the same memory		// CurLoop is dominated by an invariant.start representing the same memory
// location and size as the memory location LI loads from, and also the		// location and size as the memory location LI loads from, and also the
// invariant.start has no uses.		// invariant.start has no uses.
static bool isLoadInvariantInLoop(LoadInst LI, DominatorTree DT,		static bool isLoadInvariantInLoop(LoadInst LI, DominatorTree DT,
Loop *CurLoop) {		Loop *CurLoop) {
Value *Addr = LI->getOperand(0);		Value *Addr = LI->getOperand(0);
const DataLayout &DL = LI->getModule()->getDataLayout();		const DataLayout &DL = LI->getModule()->getDataLayout();
▲ Show 20 Lines • Show All 1,011 Lines • Show Last 20 Lines

lib/Transforms/Scalar/LoopUnrollPass.cpp

Show All 19 Lines
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/CodeMetrics.h"		#include "llvm/Analysis/CodeMetrics.h"
		#include "llvm/Analysis/DependenceAnalysis.h"
#include "llvm/Analysis/LoopAnalysisManager.h"		#include "llvm/Analysis/LoopAnalysisManager.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/LoopPass.h"		#include "llvm/Analysis/LoopPass.h"
#include "llvm/Analysis/LoopUnrollAnalyzer.h"		#include "llvm/Analysis/LoopUnrollAnalyzer.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"		#include "llvm/Analysis/OptimizationRemarkEmitter.h"
#include "llvm/Analysis/ProfileSummaryInfo.h"		#include "llvm/Analysis/ProfileSummaryInfo.h"
#include "llvm/Analysis/ScalarEvolution.h"		#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
▲ Show 20 Lines • Show All 150 Lines • ▼ Show 20 Lines	static TargetTransformInfo::UnrollingPreferences gatherUnrollingPreferences(
UP.Partial = false;		UP.Partial = false;
UP.Runtime = false;		UP.Runtime = false;
UP.AllowRemainder = true;		UP.AllowRemainder = true;
UP.UnrollRemainder = false;		UP.UnrollRemainder = false;
UP.AllowExpensiveTripCount = false;		UP.AllowExpensiveTripCount = false;
UP.Force = false;		UP.Force = false;
UP.UpperBound = false;		UP.UpperBound = false;
UP.AllowPeeling = true;		UP.AllowPeeling = true;
		UP.UnrollAndJam = false;
		UP.UnrollAndJamThreshold = 50;

// Override with any target specific settings		// Override with any target specific settings
TTI.getUnrollingPreferences(L, SE, UP);		TTI.getUnrollingPreferences(L, SE, UP);

// Apply size attributes		// Apply size attributes
if (L->getHeader()->getParent()->optForSize()) {		if (L->getHeader()->getParent()->optForSize()) {
UP.Threshold = UP.OptSizeThreshold;		UP.Threshold = UP.OptSizeThreshold;
UP.PartialThreshold = UP.PartialOptSizeThreshold;		UP.PartialThreshold = UP.PartialOptSizeThreshold;
▲ Show 20 Lines • Show All 743 Lines • ▼ Show 20 Lines	#endif
DEBUG(dbgs() << " partially unrolling with count: " << UP.Count << "\n");		DEBUG(dbgs() << " partially unrolling with count: " << UP.Count << "\n");
if (UP.Count < 2)		if (UP.Count < 2)
UP.Count = 0;		UP.Count = 0;
return ExplicitUnroll;		return ExplicitUnroll;
}		}

static LoopUnrollResult tryToUnrollLoop(		static LoopUnrollResult tryToUnrollLoop(
Loop L, DominatorTree &DT, LoopInfo LI, ScalarEvolution &SE,		Loop L, DominatorTree &DT, LoopInfo LI, ScalarEvolution &SE,
const TargetTransformInfo &TTI, AssumptionCache &AC,		const TargetTransformInfo &TTI, AssumptionCache &AC, DependenceInfo *DI,
OptimizationRemarkEmitter &ORE, bool PreserveLCSSA, int OptLevel,		OptimizationRemarkEmitter &ORE, bool PreserveLCSSA, int OptLevel,
Optional<unsigned> ProvidedCount, Optional<unsigned> ProvidedThreshold,		Optional<unsigned> ProvidedCount, Optional<unsigned> ProvidedThreshold,
Optional<bool> ProvidedAllowPartial, Optional<bool> ProvidedRuntime,		Optional<bool> ProvidedAllowPartial, Optional<bool> ProvidedRuntime,
Optional<bool> ProvidedUpperBound, Optional<bool> ProvidedAllowPeeling) {		Optional<bool> ProvidedUpperBound, Optional<bool> ProvidedAllowPeeling) {
DEBUG(dbgs() << "Loop Unroll: F[" << L->getHeader()->getParent()->getName()		DEBUG(dbgs() << "Loop Unroll: F[" << L->getHeader()->getParent()->getName()
<< "] Loop %" << L->getHeader()->getName() << "\n");		<< "] Loop %" << L->getHeader()->getName() << "\n");
if (HasUnrollDisablePragma(L))		if (HasUnrollDisablePragma(L))
return LoopUnrollResult::Unmodified;		return LoopUnrollResult::Unmodified;
▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	bool IsCountSetExplicitly =
computeUnrollCount(L, TTI, DT, LI, SE, &ORE, TripCount, MaxTripCount,		computeUnrollCount(L, TTI, DT, LI, SE, &ORE, TripCount, MaxTripCount,
TripMultiple, LoopSize, UP, UseUpperBound);		TripMultiple, LoopSize, UP, UseUpperBound);
if (!UP.Count)		if (!UP.Count)
return LoopUnrollResult::Unmodified;		return LoopUnrollResult::Unmodified;
// Unroll factor (Count) must be less or equal to TripCount.		// Unroll factor (Count) must be less or equal to TripCount.
if (TripCount && UP.Count > TripCount)		if (TripCount && UP.Count > TripCount)
UP.Count = TripCount;		UP.Count = TripCount;

		// TODOD Move somewhere better
		bool UnrollAndJam = false;
		if (UP.UnrollAndJam && UP.Count > 1 && DI) {
		if (isProfitableToUnrollAndJam(L, TTI, SE, DT, *DI, UP))
		UnrollAndJam = true;
		else if (L->getParentLoop() &&
		isProfitableToUnrollAndJam(L->getParentLoop(), TTI, SE, DT, *DI,
		UP))
		return LoopUnrollResult::Unmodified;
		}

// Unroll the loop.		// Unroll the loop.
LoopUnrollResult UnrollResult = UnrollLoop(		LoopUnrollResult UnrollResult;
L, UP.Count, TripCount, UP.Force, UP.Runtime, UP.AllowExpensiveTripCount,		if (!UnrollAndJam)
UseUpperBound, MaxOrZero, TripMultiple, UP.PeelCount, UP.UnrollRemainder,		UnrollResult =
LI, &SE, &DT, &AC, &ORE, PreserveLCSSA);		UnrollLoop(L, UP.Count, TripCount, UP.Force, UP.Runtime,
		UP.AllowExpensiveTripCount, UseUpperBound, MaxOrZero,
		TripMultiple, UP.PeelCount, UP.UnrollRemainder, LI, &SE, &DT,
		&AC, &ORE, PreserveLCSSA);
		else {
		Loop *SubLoop = L->getSubLoops()[0];

		UnrollResult =
		UnrollAndJamLoop(L, UP.Count, TripCount, TripMultiple,
		UP.UnrollRemainder, LI, &SE, &DT, &AC, &ORE);

		// Outer loop has been unrolled and jammed, now unroll the inner loop up to
		// the threshold
		tryToUnrollLoop(SubLoop, DT, LI, SE, TTI, AC, DI, ORE, PreserveLCSSA,
		OptLevel, ProvidedCount, ProvidedThreshold,
		ProvidedAllowPartial, ProvidedRuntime, ProvidedUpperBound,
		ProvidedAllowPeeling);
		}

if (UnrollResult == LoopUnrollResult::Unmodified)		if (UnrollResult == LoopUnrollResult::Unmodified)
return LoopUnrollResult::Unmodified;		return LoopUnrollResult::Unmodified;

// If loop has an unroll count pragma or unrolled by explicitly set count		// If loop has an unroll count pragma or unrolled by explicitly set count
// mark loop as unrolled to prevent unrolling beyond that requested.		// mark loop as unrolled to prevent unrolling beyond that requested.
// If the loop was peeled, we already "used up" the profile information		// If the loop was peeled, we already "used up" the profile information
// we had, so we don't want to unroll or peel again.		// we had, so we don't want to unroll or peel again.
if (UnrollResult != LoopUnrollResult::FullyUnrolled &&		if (UnrollResult != LoopUnrollResult::FullyUnrolled &&
Show All 36 Lines	bool runOnLoop(Loop *L, LPPassManager &LPM) override {
Function &F = *L->getHeader()->getParent();		Function &F = *L->getHeader()->getParent();

auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();		auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
LoopInfo *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		LoopInfo *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
ScalarEvolution &SE = getAnalysis<ScalarEvolutionWrapperPass>().getSE();		ScalarEvolution &SE = getAnalysis<ScalarEvolutionWrapperPass>().getSE();
const TargetTransformInfo &TTI =		const TargetTransformInfo &TTI =
getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
auto &AC = getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);		auto &AC = getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
		auto &DI = getAnalysis<DependenceAnalysisWrapperPass>().getDI();
// For the old PM, we can't use OptimizationRemarkEmitter as an analysis		// For the old PM, we can't use OptimizationRemarkEmitter as an analysis
// pass. Function analyses need to be preserved across loop transformations		// pass. Function analyses need to be preserved across loop transformations
// but ORE cannot be preserved (see comment before the pass definition).		// but ORE cannot be preserved (see comment before the pass definition).
OptimizationRemarkEmitter ORE(&F);		OptimizationRemarkEmitter ORE(&F);
bool PreserveLCSSA = mustPreserveAnalysisID(LCSSAID);		bool PreserveLCSSA = mustPreserveAnalysisID(LCSSAID);

LoopUnrollResult Result = tryToUnrollLoop(		LoopUnrollResult Result = tryToUnrollLoop(
L, DT, LI, SE, TTI, AC, ORE, PreserveLCSSA, OptLevel, ProvidedCount,		L, DT, LI, SE, TTI, AC, &DI, ORE, PreserveLCSSA, OptLevel,
ProvidedThreshold, ProvidedAllowPartial, ProvidedRuntime,		ProvidedCount, ProvidedThreshold, ProvidedAllowPartial, ProvidedRuntime,
ProvidedUpperBound, ProvidedAllowPeeling);		ProvidedUpperBound, ProvidedAllowPeeling);

if (Result == LoopUnrollResult::FullyUnrolled)		if (Result == LoopUnrollResult::FullyUnrolled)
LPM.markLoopAsDeleted(*L);		LPM.markLoopAsDeleted(*L);

return Result != LoopUnrollResult::Unmodified;		return Result != LoopUnrollResult::Unmodified;
}		}

/// This transformation requires natural loop information & requires that		/// This transformation requires natural loop information & requires that
/// loop preheaders be inserted into the CFG...		/// loop preheaders be inserted into the CFG...
void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<AssumptionCacheTracker>();		AU.addRequired<AssumptionCacheTracker>();
AU.addRequired<TargetTransformInfoWrapperPass>();		AU.addRequired<TargetTransformInfoWrapperPass>();
		AU.addRequired<DependenceAnalysisWrapperPass>();
// FIXME: Loop passes are required to preserve domtree, and for now we just		// FIXME: Loop passes are required to preserve domtree, and for now we just
// recreate dom info if anything gets unrolled.		// recreate dom info if anything gets unrolled.
getLoopAnalysisUsage(AU);		getLoopAnalysisUsage(AU);
}		}
};		};

} // end anonymous namespace		} // end anonymous namespace

char LoopUnroll::ID = 0;		char LoopUnroll::ID = 0;

INITIALIZE_PASS_BEGIN(LoopUnroll, "loop-unroll", "Unroll loops", false, false)		INITIALIZE_PASS_BEGIN(LoopUnroll, "loop-unroll", "Unroll loops", false, false)
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)		INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
INITIALIZE_PASS_DEPENDENCY(LoopPass)		INITIALIZE_PASS_DEPENDENCY(LoopPass)
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
		INITIALIZE_PASS_DEPENDENCY(DependenceAnalysisWrapperPass)
INITIALIZE_PASS_END(LoopUnroll, "loop-unroll", "Unroll loops", false, false)		INITIALIZE_PASS_END(LoopUnroll, "loop-unroll", "Unroll loops", false, false)

Pass *llvm::createLoopUnrollPass(int OptLevel, int Threshold, int Count,		Pass *llvm::createLoopUnrollPass(int OptLevel, int Threshold, int Count,
int AllowPartial, int Runtime, int UpperBound,		int AllowPartial, int Runtime, int UpperBound,
int AllowPeeling) {		int AllowPeeling) {
// TODO: It would make more sense for this function to take the optionals		// TODO: It would make more sense for this function to take the optionals
// directly, but that's dangerous since it would silently break out of tree		// directly, but that's dangerous since it would silently break out of tree
// callers.		// callers.
Show All 12 Lines

PreservedAnalyses LoopFullUnrollPass::run(Loop &L, LoopAnalysisManager &AM,		PreservedAnalyses LoopFullUnrollPass::run(Loop &L, LoopAnalysisManager &AM,
LoopStandardAnalysisResults &AR,		LoopStandardAnalysisResults &AR,
LPMUpdater &Updater) {		LPMUpdater &Updater) {
const auto &FAM =		const auto &FAM =
AM.getResult<FunctionAnalysisManagerLoopProxy>(L, AR).getManager();		AM.getResult<FunctionAnalysisManagerLoopProxy>(L, AR).getManager();
Function *F = L.getHeader()->getParent();		Function *F = L.getHeader()->getParent();

		auto DI = FAM.getCachedResult<DependenceAnalysis>(F);
auto ORE = FAM.getCachedResult<OptimizationRemarkEmitterAnalysis>(F);		auto ORE = FAM.getCachedResult<OptimizationRemarkEmitterAnalysis>(F);
// FIXME: This should probably be optional rather than required.		// FIXME: This should probably be optional rather than required.
if (!ORE)		if (!ORE)
report_fatal_error(		report_fatal_error(
"LoopFullUnrollPass: OptimizationRemarkEmitterAnalysis not "		"LoopFullUnrollPass: OptimizationRemarkEmitterAnalysis not "
"cached at a higher level");		"cached at a higher level");

// Keep track of the previous loop structure so we can identify new loops		// Keep track of the previous loop structure so we can identify new loops
// created by unrolling.		// created by unrolling.
Loop *ParentL = L.getParentLoop();		Loop *ParentL = L.getParentLoop();
SmallPtrSet<Loop *, 4> OldLoops;		SmallPtrSet<Loop *, 4> OldLoops;
if (ParentL)		if (ParentL)
OldLoops.insert(ParentL->begin(), ParentL->end());		OldLoops.insert(ParentL->begin(), ParentL->end());
else		else
OldLoops.insert(AR.LI.begin(), AR.LI.end());		OldLoops.insert(AR.LI.begin(), AR.LI.end());

std::string LoopName = L.getName();		std::string LoopName = L.getName();

bool Changed =		bool Changed =
tryToUnrollLoop(&L, AR.DT, &AR.LI, AR.SE, AR.TTI, AR.AC, *ORE,		tryToUnrollLoop(&L, AR.DT, &AR.LI, AR.SE, AR.TTI, AR.AC, DI, *ORE,
/PreserveLCSSA/ true, OptLevel, /Count/ None,		/PreserveLCSSA/ true, OptLevel, /Count/ None,
/Threshold/ None, /AllowPartial/ false,		/Threshold/ None, /AllowPartial/ false,
/Runtime/ false, /UpperBound/ false,		/Runtime/ false, /UpperBound/ false,
/AllowPeeling/ false) != LoopUnrollResult::Unmodified;		/AllowPeeling/ false) != LoopUnrollResult::Unmodified;
if (!Changed)		if (!Changed)
return PreservedAnalyses::all();		return PreservedAnalyses::all();

// The parent must not be damaged by unrolling!		// The parent must not be damaged by unrolling!
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines

PreservedAnalyses LoopUnrollPass::run(Function &F,		PreservedAnalyses LoopUnrollPass::run(Function &F,
FunctionAnalysisManager &AM) {		FunctionAnalysisManager &AM) {
auto &SE = AM.getResult<ScalarEvolutionAnalysis>(F);		auto &SE = AM.getResult<ScalarEvolutionAnalysis>(F);
auto &LI = AM.getResult<LoopAnalysis>(F);		auto &LI = AM.getResult<LoopAnalysis>(F);
auto &TTI = AM.getResult<TargetIRAnalysis>(F);		auto &TTI = AM.getResult<TargetIRAnalysis>(F);
auto &DT = AM.getResult<DominatorTreeAnalysis>(F);		auto &DT = AM.getResult<DominatorTreeAnalysis>(F);
auto &AC = AM.getResult<AssumptionAnalysis>(F);		auto &AC = AM.getResult<AssumptionAnalysis>(F);
		auto &DI = AM.getResult<DependenceAnalysis>(F);
auto &ORE = AM.getResult<OptimizationRemarkEmitterAnalysis>(F);		auto &ORE = AM.getResult<OptimizationRemarkEmitterAnalysis>(F);

LoopAnalysisManager *LAM = nullptr;		LoopAnalysisManager *LAM = nullptr;
if (auto *LAMProxy = AM.getCachedResult<LoopAnalysisManagerFunctionProxy>(F))		if (auto *LAMProxy = AM.getCachedResult<LoopAnalysisManagerFunctionProxy>(F))
LAM = &LAMProxy->getManager();		LAM = &LAMProxy->getManager();

const ModuleAnalysisManager &MAM =		const ModuleAnalysisManager &MAM =
AM.getResult<ModuleAnalysisManagerFunctionProxy>(F).getManager();		AM.getResult<ModuleAnalysisManagerFunctionProxy>(F).getManager();
Show All 32 Lines	Optional<bool> AllowPartialParam, RuntimeParam, UpperBoundParam,
AllowPeeling;		AllowPeeling;
// Check if the profile summary indicates that the profiled application		// Check if the profile summary indicates that the profiled application
// has a huge working set size, in which case we disable peeling to avoid		// has a huge working set size, in which case we disable peeling to avoid
// bloating it further.		// bloating it further.
if (PSI && PSI->hasHugeWorkingSetSize())		if (PSI && PSI->hasHugeWorkingSetSize())
AllowPeeling = false;		AllowPeeling = false;
std::string LoopName = L.getName();		std::string LoopName = L.getName();
LoopUnrollResult Result =		LoopUnrollResult Result =
tryToUnrollLoop(&L, DT, &LI, SE, TTI, AC, ORE,		tryToUnrollLoop(&L, DT, &LI, SE, TTI, AC, &DI, ORE,
/PreserveLCSSA/ true, OptLevel, /Count/ None,		/PreserveLCSSA/ true, OptLevel, /Count/ None,
/Threshold/ None, AllowPartialParam, RuntimeParam,		/Threshold/ None, AllowPartialParam, RuntimeParam,
UpperBoundParam, AllowPeeling);		UpperBoundParam, AllowPeeling);
Changed \|= Result != LoopUnrollResult::Unmodified;		Changed \|= Result != LoopUnrollResult::Unmodified;

// The parent must not be damaged by unrolling!		// The parent must not be damaged by unrolling!
#ifndef NDEBUG		#ifndef NDEBUG
if (Result != LoopUnrollResult::Unmodified && ParentL)		if (Result != LoopUnrollResult::Unmodified && ParentL)
Show All 13 Lines

lib/Transforms/Utils/CMakeLists.txt

Show All 21 Lines	add_llvm_library(LLVMTransformUtils
ImportedFunctionsInliningStatistics.cpp		ImportedFunctionsInliningStatistics.cpp
InstructionNamer.cpp		InstructionNamer.cpp
IntegerDivision.cpp		IntegerDivision.cpp
LCSSA.cpp		LCSSA.cpp
LibCallsShrinkWrap.cpp		LibCallsShrinkWrap.cpp
Local.cpp		Local.cpp
LoopSimplify.cpp		LoopSimplify.cpp
LoopUnroll.cpp		LoopUnroll.cpp
		LoopUnrollAndJam.cpp
LoopUnrollPeel.cpp		LoopUnrollPeel.cpp
LoopUnrollRuntime.cpp		LoopUnrollRuntime.cpp
LoopUtils.cpp		LoopUtils.cpp
LoopVersioning.cpp		LoopVersioning.cpp
LowerInvoke.cpp		LowerInvoke.cpp
LowerMemIntrinsics.cpp		LowerMemIntrinsics.cpp
LowerSwitch.cpp		LowerSwitch.cpp
Mem2Reg.cpp		Mem2Reg.cpp
Show All 28 Lines

lib/Transforms/Utils/LoopUnroll.cpp

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
STATISTIC(NumCompletelyUnrolled, "Number of loops completely unrolled");		STATISTIC(NumCompletelyUnrolled, "Number of loops completely unrolled");
STATISTIC(NumUnrolled, "Number of loops unrolled (completely or otherwise)");		STATISTIC(NumUnrolled, "Number of loops unrolled (completely or otherwise)");

static cl::opt<bool>		static cl::opt<bool>
UnrollRuntimeEpilog("unroll-runtime-epilog", cl::init(false), cl::Hidden,		UnrollRuntimeEpilog("unroll-runtime-epilog", cl::init(false), cl::Hidden,
cl::desc("Allow runtime unrolled loops to be unrolled "		cl::desc("Allow runtime unrolled loops to be unrolled "
"with epilog instead of prolog."));		"with epilog instead of prolog."));

static cl::opt<bool>		cl::opt<bool>
UnrollVerifyDomtree("unroll-verify-domtree", cl::Hidden,		UnrollVerifyDomtree("unroll-verify-domtree", cl::Hidden,
cl::desc("Verify domtree after unrolling"),		cl::desc("Verify domtree after unrolling"),
#ifdef NDEBUG		#ifdef NDEBUG
cl::init(false)		cl::init(false)
#else		#else
cl::init(true)		cl::init(true)
#endif		#endif
);		);

/// Convert the instruction operands from referencing the current values into		/// Convert the instruction operands from referencing the current values into
/// those specified by VMap.		/// those specified by VMap.
static inline void remapInstruction(Instruction *I,		void llvm::remapInstruction(Instruction *I, ValueToValueMapTy &VMap) {
ValueToValueMapTy &VMap) {
for (unsigned op = 0, E = I->getNumOperands(); op != E; ++op) {		for (unsigned op = 0, E = I->getNumOperands(); op != E; ++op) {
Value *Op = I->getOperand(op);		Value *Op = I->getOperand(op);

// Unwrap arguments of dbg.value intrinsics.		// Unwrap arguments of dbg.value intrinsics.
bool Wrapped = false;		bool Wrapped = false;
if (auto *V = dyn_cast<MetadataAsValue>(Op))		if (auto *V = dyn_cast<MetadataAsValue>(Op))
if (auto *Unwrapped = dyn_cast<ValueAsMetadata>(V->getMetadata())) {		if (auto *Unwrapped = dyn_cast<ValueAsMetadata>(V->getMetadata())) {
Op = Unwrapped->getValue();		Op = Unwrapped->getValue();
Show All 22 Lines
/// Folds a basic block into its predecessor if it only has one predecessor, and		/// Folds a basic block into its predecessor if it only has one predecessor, and
/// that predecessor only has one successor.		/// that predecessor only has one successor.
/// The LoopInfo Analysis that is passed will be kept consistent. If folding is		/// The LoopInfo Analysis that is passed will be kept consistent. If folding is
/// successful references to the containing loop must be removed from		/// successful references to the containing loop must be removed from
/// ScalarEvolution by calling ScalarEvolution::forgetLoop because SE may have		/// ScalarEvolution by calling ScalarEvolution::forgetLoop because SE may have
/// references to the eliminated BB. The argument ForgottenLoops contains a set		/// references to the eliminated BB. The argument ForgottenLoops contains a set
/// of loops that have already been forgotten to prevent redundant, expensive		/// of loops that have already been forgotten to prevent redundant, expensive
/// calls to ScalarEvolution::forgetLoop. Returns the new combined block.		/// calls to ScalarEvolution::forgetLoop. Returns the new combined block.
static BasicBlock *		BasicBlock *llvm::foldBlockIntoPredecessor(
foldBlockIntoPredecessor(BasicBlock BB, LoopInfo LI, ScalarEvolution *SE,		BasicBlock BB, LoopInfo LI, ScalarEvolution *SE,
SmallPtrSetImpl<Loop *> &ForgottenLoops,		SmallPtrSetImpl<Loop > &ForgottenLoops, DominatorTree DT) {
DominatorTree *DT) {
// Merge basic blocks into their predecessor if there is only one distinct		// Merge basic blocks into their predecessor if there is only one distinct
// pred, and if there is only one distinct successor of the predecessor, and		// pred, and if there is only one distinct successor of the predecessor, and
// if there are no PHI nodes.		// if there are no PHI nodes.
BasicBlock *OnlyPred = BB->getSinglePredecessor();		BasicBlock *OnlyPred = BB->getSinglePredecessor();
if (!OnlyPred) return nullptr;		if (!OnlyPred) return nullptr;

if (OnlyPred->getTerminator()->getNumSuccessors() != 1)		if (OnlyPred->getTerminator()->getNumSuccessors() != 1)
return nullptr;		return nullptr;

DEBUG(dbgs() << "Merging: " << BB << "into: " << OnlyPred);		DEBUG(dbgs() << "Merging: " << BB->getName() << " into "
		<< OnlyPred->getName() << "\n");

// Resolve any PHI nodes at the start of the block. They are all		// Resolve any PHI nodes at the start of the block. They are all
// guaranteed to have exactly one entry if they exist, unless there are		// guaranteed to have exactly one entry if they exist, unless there are
// multiple duplicate (but guaranteed to be equal) entries for the		// multiple duplicate (but guaranteed to be equal) entries for the
// incoming edges. This occurs when there are multiple edges from		// incoming edges. This occurs when there are multiple edges from
// OnlyPred to OnlySucc.		// OnlyPred to OnlySucc.
FoldSingleEntryPHINodes(BB);		FoldSingleEntryPHINodes(BB);

▲ Show 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	static bool isEpilogProfitable(Loop *L) {
assert(PreHeader && Header);		assert(PreHeader && Header);
for (const PHINode &PN : Header->phis()) {		for (const PHINode &PN : Header->phis()) {
if (isa<ConstantInt>(PN.getIncomingValueForBlock(PreHeader)))		if (isa<ConstantInt>(PN.getIncomingValueForBlock(PreHeader)))
return true;		return true;
}		}
return false;		return false;
}		}

		void llvm::simplifyLoopAfterUnroll(Loop L, bool SimplifyIVs, LoopInfo LI,
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Looks like this is used by both LoopUnroll and UnrollAndJam. To simplify things, is it worth to separate this out in a different patch if this already used/beneficial by/for LoopUnroll? SjoerdMeijer: Looks like this is used by both LoopUnroll and UnrollAndJam. To simplify things, is it worth to…
		dmgreenAuthorUnsubmitted Not Done Reply Inline Actions Sure, sounds good. I have split out the DA and other unrelated parts. This one is less useful on it's own, if it's only used in one place, but I can split it out as a cleanup. dmgreen: Sure, sounds good. I have split out the DA and other unrelated parts. This one is less useful…
		ScalarEvolution SE, DominatorTree DT,
		AssumptionCache *AC) {
		// Simplify any new induction variables in the partially unrolled loop.
		if (SE && SimplifyIVs) {
		SmallVector<WeakTrackingVH, 16> DeadInsts;
		simplifyLoopIVs(L, SE, DT, LI, DeadInsts);

		// Aggressively clean up dead instructions that simplifyLoopIVs already
		// identified. Any remaining should be cleaned up below.
		while (!DeadInsts.empty())
		if (Instruction *Inst =
		dyn_cast_or_null<Instruction>(&*DeadInsts.pop_back_val()))
		RecursivelyDeleteTriviallyDeadInstructions(Inst);
		}

		// At this point, the code is well formed. We now do a quick sweep over the
		// inserted code, doing constant propagation and dead code elimination as we
		// go.
		const DataLayout &DL = L->getHeader()->getModule()->getDataLayout();
		const std::vector<BasicBlock *> &NewLoopBlocks = L->getBlocks();
		for (BasicBlock *BB : NewLoopBlocks) {
		for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E;) {
		Instruction Inst = &I++;

		if (Value *V = SimplifyInstruction(Inst, {DL, nullptr, DT, AC}))
		if (LI->replacementPreservesLCSSAForm(Inst, V))
		Inst->replaceAllUsesWith(V);
		if (isInstructionTriviallyDead(Inst))
		BB->getInstList().erase(Inst);
		}
		}

		// TODO: after peeling or unrolling, previously loop variant conditions are
		// likely to fold to constants, eagerly propagating those here will require
		// fewer cleanup passes to be run. Alternatively, a LoopEarlyCSE might be
		// appropriate.
		}

/// Unroll the given loop by Count. The loop must be in LCSSA form. Unrolling		/// Unroll the given loop by Count. The loop must be in LCSSA form. Unrolling
/// can only fail when the loop's latch block is not terminated by a conditional		/// can only fail when the loop's latch block is not terminated by a conditional
/// branch instruction. However, if the trip count (and multiple) are not known,		/// branch instruction. However, if the trip count (and multiple) are not known,
/// loop unrolling will mostly produce more code that is no faster.		/// loop unrolling will mostly produce more code that is no faster.
///		///
/// TripCount is the upper bound of the iteration on which control exits		/// TripCount is the upper bound of the iteration on which control exits
/// LatchBlock. Control may exit the loop prior to TripCount iterations either		/// LatchBlock. Control may exit the loop prior to TripCount iterations either
/// via an early branch in other loop block or via LatchBlock terminator. This		/// via an early branch in other loop block or via LatchBlock terminator. This
▲ Show 20 Lines • Show All 508 Lines • ▼ Show 20 Lines	if (Term->isUnconditional()) {
std::replace(Latches.begin(), Latches.end(), Dest, Fold);		std::replace(Latches.begin(), Latches.end(), Dest, Fold);
UnrolledLoopBlocks.erase(std::remove(UnrolledLoopBlocks.begin(),		UnrolledLoopBlocks.erase(std::remove(UnrolledLoopBlocks.begin(),
UnrolledLoopBlocks.end(), Dest),		UnrolledLoopBlocks.end(), Dest),
UnrolledLoopBlocks.end());		UnrolledLoopBlocks.end());
}		}
}		}
}		}

// Simplify any new induction variables in the partially unrolled loop.
if (SE && !CompletelyUnroll && Count > 1) {
SmallVector<WeakTrackingVH, 16> DeadInsts;
simplifyLoopIVs(L, SE, DT, LI, DeadInsts);

// Aggressively clean up dead instructions that simplifyLoopIVs already
// identified. Any remaining should be cleaned up below.
while (!DeadInsts.empty())
if (Instruction *Inst =
dyn_cast_or_null<Instruction>(&*DeadInsts.pop_back_val()))
RecursivelyDeleteTriviallyDeadInstructions(Inst);
}

// At this point, the code is well formed. We now do a quick sweep over the		// At this point, the code is well formed. We now do a quick sweep over the
// inserted code, doing constant propagation and dead code elimination as we		// inserted code, doing constant propagation and dead code elimination as we
// go.		// go.
const DataLayout &DL = Header->getModule()->getDataLayout();		simplifyLoopAfterUnroll(L, !CompletelyUnroll && Count > 1, LI, SE, DT, AC);
const std::vector<BasicBlock*> &NewLoopBlocks = L->getBlocks();
for (BasicBlock *BB : NewLoopBlocks) {
for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ) {
Instruction Inst = &I++;

if (Value *V = SimplifyInstruction(Inst, {DL, nullptr, DT, AC}))
if (LI->replacementPreservesLCSSAForm(Inst, V))
Inst->replaceAllUsesWith(V);
if (isInstructionTriviallyDead(Inst))
BB->getInstList().erase(Inst);
}
}

// TODO: after peeling or unrolling, previously loop variant conditions are
// likely to fold to constants, eagerly propagating those here will require
// fewer cleanup passes to be run. Alternatively, a LoopEarlyCSE might be
// appropriate.

NumCompletelyUnrolled += CompletelyUnroll;		NumCompletelyUnrolled += CompletelyUnroll;
++NumUnrolled;		++NumUnrolled;

Loop *OuterL = L->getParentLoop();		Loop *OuterL = L->getParentLoop();
// Update LoopInfo if the loop is completely removed.		// Update LoopInfo if the loop is completely removed.
if (CompletelyUnroll)		if (CompletelyUnroll)
LI->erase(L);		LI->erase(L);
▲ Show 20 Lines • Show All 73 Lines • Show Last 20 Lines

lib/Transforms/Utils/LoopUnrollAndJam.cpp

This file was added.

				//===-- LoopUnrollAndJam.cpp - Loop unrolling utilities -------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements loop unroll and jam as a routine, much like
				// LoopUnroll.cpp implements loop unroll.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/ADT/SmallPtrSet.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/AssumptionCache.h"
				#include "llvm/Analysis/DependenceAnalysis.h"
				#include "llvm/Analysis/InstructionSimplify.h"
				#include "llvm/Analysis/LoopAnalysisManager.h"
				#include "llvm/Analysis/LoopIterator.h"
				#include "llvm/Analysis/LoopPass.h"
				#include "llvm/Analysis/OptimizationRemarkEmitter.h"
				#include "llvm/Analysis/ScalarEvolution.h"
				#include "llvm/Analysis/ScalarEvolutionExpander.h"
				#include "llvm/IR/BasicBlock.h"
				#include "llvm/IR/DataLayout.h"
				#include "llvm/IR/DebugInfoMetadata.h"
				#include "llvm/IR/Dominators.h"
				#include "llvm/IR/IntrinsicInst.h"
				#include "llvm/IR/LLVMContext.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/raw_ostream.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"
				#include "llvm/Transforms/Utils/Cloning.h"
				#include "llvm/Transforms/Utils/Local.h"
				#include "llvm/Transforms/Utils/LoopSimplify.h"
				#include "llvm/Transforms/Utils/LoopUtils.h"
				#include "llvm/Transforms/Utils/SimplifyIndVar.h"
				#include "llvm/Transforms/Utils/UnrollLoop.h"
				using namespace llvm;

				#define DEBUG_TYPE "loop-unroll"

				STATISTIC(NumUnrolledAndJammed, "Number of loops unroll and jammed");
				STATISTIC(NumCompletelyUnrolledAndJammed, "Number of loops unroll and jammed");

				static cl::opt<bool> AllowUnrollAndJam("unroll-and-jam", cl::init(true),
				cl::Hidden,
				cl::desc("Allow Unroll And Jam"));

				static cl::opt<bool>
				AlwaysUnrollAndJam("always-unroll-and-jam", cl::init(false), cl::Hidden,
				cl::desc("Always Unroll And Jam whenever Safe"));

				extern cl::opt<bool> UnrollVerifyDomtree;

				bool containsBB(std::vector<BasicBlock > &V, BasicBlock BB) {
				return std::find(V.begin(), V.end(), BB) != V.end();
				}

				// Partition blocks in an outer/inner loop pair into blocks before and after
				// the loop
				bool partitionOuterLoopBlocks(Loop L, Loop SubLoop,
				std::vector<BasicBlock *> &ForeBlocks,
				std::vector<BasicBlock *> &SubLoopBlocks,
				std::vector<BasicBlock *> &AftBlocks,
				DominatorTree *DT) {
				BasicBlock *SubLoopLatch = SubLoop->getLoopLatch();
				SubLoopBlocks = SubLoop->getBlocks();

				for (BasicBlock *BB : L->blocks()) {
				if (!SubLoop->contains(BB)) {
				SjoerdMeijerUnsubmitted Done Reply Inline Actions nit: TODOD SjoerdMeijer: nit: TODOD
				if (DT->dominates(SubLoopLatch, BB))
				AftBlocks.push_back(BB);
				else
				ForeBlocks.push_back(BB);
				}
				}

				// Check that all blocks in ForeBlocks together dominate the subloop
				// TODOD: This might ideally be done better with a dominator/postdominators.
				BasicBlock *SubLoopPreHeader = SubLoop->getLoopPreheader();
				for (BasicBlock *BB : ForeBlocks) {
				if (BB == SubLoopPreHeader)
				continue;
				TerminatorInst *TI = BB->getTerminator();
				for (unsigned i = 0, e = TI->getNumSuccessors(); i != e; ++i)
				if (!containsBB(ForeBlocks, TI->getSuccessor(i)))
				return false;
				}

				return true;
				}

				// Move the phi operands of Header from Latch out of AftBlocks to InsertLoc.
				void moveHeaderPhiOperandsToForeBlocks(BasicBlock Header, BasicBlock Latch,
				Instruction *InsertLoc,
				std::vector<BasicBlock *> &AftBlocks) {
				// We need to ensure we move the instructions in the correct order,
				// starting with the earliest required instruction and going moving
				// forward.
				std::vector<Instruction *> Worklist;
				std::vector<Instruction *> Visited;
				for (auto &Phi : Header->phis()) {
				Value *V = Phi.getIncomingValueForBlock(Latch);
				if (Instruction *I = dyn_cast<Instruction>(V))
				Worklist.push_back(I);
				}

				while (!Worklist.empty()) {
				Instruction *I = Worklist.back();
				Worklist.pop_back();
				if (!containsBB(AftBlocks, I->getParent()))
				continue;

				Visited.push_back(I);
				for (auto &U : I->operands())
				if (Instruction *II = dyn_cast<Instruction>(U))
				Worklist.push_back(II);
				}

				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Some more nitpicking here, about names. Perhaps "Fore" can be confusing: is it another loop, a block? So something like this: for (i) block[i, 0] for (j) block[j, 0] end block[i, 1] end becomes: for (i, i+=2) block[i, 0] block[i+1, 0] for (j, j+=4) block[j+0, 0] block[j+1, 0] block[j+2, 0] block[j+3, 0] end block[i, 1] block[i+1, 1] end Mentioning that the outer loop has been unrolled with a factor of 2, and the inner loop with an unroll factor of 4. Not sure if the classic loopunroll and jam literature uses standard terminology here, but perhaps inner/outer loop rather than "Fore, Subloop and Aft". Also, perhaps you want to mention some restrictions somewhere, if there are any. E.g., can this doubly nested loop occur in more deeply nested loop structure (and loopunroll and jam still trigger)? SjoerdMeijer: Some more nitpicking here, about names. Perhaps "Fore" can be confusing: is it another loop, a…
				dmgreenAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure about your loop here. It looks like it has inner loop unrolling, which UnJ doesn't do (that is left to the unroller after the outer loop has been unroll and jammed. By "Fore" I was referring to blocks in the outer loop, but before the inner loop. "Aft" means blocks in the outer loop but after the inner loop. I made up the names, so am open to suggestions on better ones ;) Also, perhaps you want to mention some restrictions somewhere, if there are any. E.g., can this doubly nested loop occur in more deeply nested loop structure (and loopunroll and jam still trigger)? That should work I believe. If it doesn't it's a bug :) But I doubt it will ever be a profitable thing to do. It used to be disabled in the profitability check, but we may have lost that when I was moving things over the the new pass. I'll have a look. There are restrictions mentioned in the safety checks, in isSafeToUnrollAndJam. dmgreen: I'm not sure about your loop here. It looks like it has inner loop unrolling, which UnJ doesn't…
				// Move all instructions in program order to before the InsertLoc
				BasicBlock *InsertLocBB = InsertLoc->getParent();
				for (Instruction *I : reverse(Visited)) {
				if (I->getParent() != InsertLocBB)
				I->moveBefore(InsertLoc);
				}
				}

				/*
				This method performs Unroll and Jam. For a simple loop like:
				for (i = ..)
				Fore(i)
				for (j = ..)
				SubLoop(i, j)
				Aft(i)

				Instead of doing normal inner or outer unrolling, we do:
				for (i = .., i+=2)
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Blocks = outer loops? So perhaps something along the lines of: "So the outer loops are unrolled and the inner loops fused ("jammed")." SjoerdMeijer: Blocks = outer loops? So perhaps something along the lines of: "So the outer loops are…
				dmgreenAuthorUnsubmitted Not Done Reply Inline Actions I have tried to rewrite this a little. dmgreen: I have tried to rewrite this a little.
				Fore(i)
				Fore(i+1)
				for (j = ..)
				SubLoop(i, j)
				SubLoop(i+1, j)
				Aft(i)
				Aft(i+1)

				So the blocks are essentially outer unrolled and then fused ("jammed")
				together into a single inner loop. This can increase speed when there are
				loads in SubLoop that are invariant to i, as they become shared between the
				loops.

				We do this by spliting the blocks in the loop into Fore, Subloop and Aft.
				Normal Unroll code is used to copy each of these sets of blocks and the
				results are combined together into the final form above.

				isSafeTounrollAndFuse should be used prior to calling this to make sure the
				unrolling will be valid. isProfitable is also advisable.
				*/
				LoopUnrollResult
				llvm::UnrollAndJamLoop(Loop *L, unsigned Count, unsigned TripCount,
				unsigned TripMultiple, bool UnrollRemainder,
				LoopInfo LI, ScalarEvolution SE, DominatorTree *DT,
				AssumptionCache AC, OptimizationRemarkEmitter ORE) {

				// When we enter here we should already have checked isSafeToUnrollAndJam
				BasicBlock *Header = L->getHeader();
				assert(L->getSubLoops().size() == 1);
				Loop SubLoop = L->begin();

				// Don't enter the unroll code if there is nothing to do.
				if (TripCount == 0 && Count < 2) {
				DEBUG(dbgs() << "Won't unroll; almost nothing to do\n");
				return LoopUnrollResult::Unmodified;
				xbolva00Unsubmitted Done Reply Inline Actions bool CompletelyUnroll = (Count == TripCount); maybe better? xbolva00: bool CompletelyUnroll = (Count == TripCount); maybe better?
				}

				assert(Count > 0);
				assert(TripMultiple > 0);
				assert(TripCount == 0 \|\| TripCount % TripMultiple == 0);

				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: wont SjoerdMeijer: Nit: wont
				// Are we eliminating the loop control altogether?
				bool CompletelyUnroll = Count == TripCount;

				// We use the runtime remainder in cases where we don't know trip multiple
				if (TripMultiple == 1 \|\| TripMultiple % Count != 0) {
				if (!UnrollRuntimeLoopRemainder(L, Count, false /AllowExpensiveTripCount/,
				/UseEpilogRemainder/ true,
				UnrollRemainder, LI, SE, DT, AC, true)) {
				DEBUG(dbgs() << "Wont unroll-and-jam; remainder loop could not be "
				"generated when assuming runtime trip count\n");
				return LoopUnrollResult::Unmodified;
				}
				}

				// Notify ScalarEvolution that the loop will be substantially changed,
				// if not outright eliminated.
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: don't need to capitalize this message? SjoerdMeijer: nit: don't need to capitalize this message?
				dmgreenAuthorUnsubmitted Not Done Reply Inline Actions This is copied from loop unroll dmgreen: This is copied from loop unroll
				if (SE) {
				SE->forgetLoop(L);
				SE->forgetLoop(SubLoop);
				}

				using namespace ore;
				// Report the unrolling decision.
				if (CompletelyUnroll) {
				DEBUG(dbgs() << "COMPLETELY UNROLL AND JAMMING loop %" << Header->getName()
				<< " with trip count " << TripCount << "!\n");
				ORE->emit(OptimizationRemark(DEBUG_TYPE, "FullyUnrolled", L->getStartLoc(),
				L->getHeader())
				<< "completely unroll and jammed loop with "
				<< NV("UnrollCount", TripCount) << " iterations");
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions same here? SjoerdMeijer: same here?
				dmgreenAuthorUnsubmitted Not Done Reply Inline Actions Same dmgreen: Same
				} else {
				auto DiagBuilder = [&]() {
				OptimizationRemark Diag(DEBUG_TYPE, "PartialUnrolled", L->getStartLoc(),
				L->getHeader());
				return Diag << "unroll and jammed loop by a factor of "
				<< NV("UnrollCount", Count);
				};

				DEBUG(dbgs() << "UNROLL AND JAMMING loop %" << Header->getName() << " by "
				<< Count);
				if (TripMultiple != 1) {
				DEBUG(dbgs() << " with " << TripMultiple << " trips per branch");
				ORE->emit([&]() {
				return DiagBuilder() << " with " << NV("TripMultiple", TripMultiple)
				<< " trips per branch";
				});
				} else {
				DEBUG(dbgs() << " with run-time trip count");
				ORE->emit([&]() { return DiagBuilder() << " with run-time trip count"; });
				}
				DEBUG(dbgs() << "!\n");
				}

				BasicBlock *Preheader = L->getLoopPreheader();
				BasicBlock *LatchBlock = L->getLoopLatch();
				BranchInst *BI = dyn_cast<BranchInst>(LatchBlock->getTerminator());
				assert(Preheader && LatchBlock && Header);
				assert(BI && !BI->isUnconditional());
				bool ContinueOnTrue = L->contains(BI->getSuccessor(0));
				BasicBlock *LoopExit = BI->getSuccessor(ContinueOnTrue);
				bool SubLoopContinueOnTrue = SubLoop->contains(
				SubLoop->getLoopLatch()->getTerminator()->getSuccessor(0));

				// Partition blocks in an outer/inner loop pair into blocks before and after
				// the loop
				std::vector<BasicBlock *> SubLoopBlocks;
				std::vector<BasicBlock *> ForeBlocks;
				std::vector<BasicBlock *> AftBlocks;
				partitionOuterLoopBlocks(L, SubLoop, ForeBlocks, SubLoopBlocks, AftBlocks,
				DT);

				// We keep track of the entering/first and exiting/last block of each
				// of Fore/SubLoop/Aft in each iteration. This helps make the stapling up of
				// blocks easier.
				std::vector<BasicBlock *> ForeBlocksFirst;
				std::vector<BasicBlock *> ForeBlocksLast;
				std::vector<BasicBlock *> SubLoopBlocksFirst;
				std::vector<BasicBlock *> SubLoopBlocksLast;
				std::vector<BasicBlock *> AftBlocksFirst;
				std::vector<BasicBlock *> AftBlocksLast;
				ForeBlocksFirst.push_back(Header);
				ForeBlocksLast.push_back(SubLoop->getLoopPreheader());
				SubLoopBlocksFirst.push_back(SubLoop->getHeader());
				SubLoopBlocksLast.push_back(SubLoop->getExitingBlock());
				AftBlocksFirst.push_back(SubLoop->getExitBlock());
				AftBlocksLast.push_back(L->getExitingBlock());
				// Maps Blocks[0] -> Blocks[It]
				ValueToValueMapTy LastValueMap;

				// Move any instructions from fore phi operands from AftBlocks into Fore.
				moveHeaderPhiOperandsToForeBlocks(
				Header, LatchBlock, SubLoop->getLoopPreheader()->getTerminator(),
				AftBlocks);

				// The current on-the-fly SSA update requires blocks to be processed in
				// reverse postorder so that LastValueMap contains the correct value at each
				// exit.
				LoopBlocksDFS DFS(L);
				DFS.perform(LI);
				// Stash the DFS iterators before adding blocks to the loop.
				LoopBlocksDFS::RPOIterator BlockBegin = DFS.beginRPO();
				LoopBlocksDFS::RPOIterator BlockEnd = DFS.endRPO();

				if (Header->getParent()->isDebugInfoForProfiling())
				for (BasicBlock *BB : L->getBlocks())
				for (Instruction &I : *BB)
				if (!isa<DbgInfoIntrinsic>(&I))
				if (const DILocation *DIL = I.getDebugLoc())
				I.setDebugLoc(DIL->cloneWithDuplicationFactor(Count));

				// Copy all blocks
				for (unsigned It = 1; It != Count; ++It) {
				std::vector<BasicBlock *> NewBlocks;
				// Maps Blocks[It] -> Blocks[It-1]
				DenseMap<Value , Value > PrevItValueMap;

				for (LoopBlocksDFS::RPOIterator BB = BlockBegin; BB != BlockEnd; ++BB) {
				ValueToValueMapTy VMap;
				BasicBlock New = CloneBasicBlock(BB, VMap, "." + Twine(It));
				Header->getParent()->getBasicBlockList().push_back(New);

				if (containsBB(ForeBlocks, *BB)) {
				L->addBasicBlockToLoop(New, *LI);

				if (*BB == ForeBlocksFirst[0])
				ForeBlocksFirst.push_back(New);
				if (*BB == ForeBlocksLast[0])
				ForeBlocksLast.push_back(New);
				} else if (containsBB(SubLoopBlocks, *BB)) {
				SubLoop->addBasicBlockToLoop(New, *LI);

				if (*BB == SubLoopBlocksFirst[0])
				SubLoopBlocksFirst.push_back(New);
				if (*BB == SubLoopBlocksLast[0])
				SubLoopBlocksLast.push_back(New);
				} else if (containsBB(AftBlocks, *BB)) {
				L->addBasicBlockToLoop(New, *LI);

				if (*BB == AftBlocksFirst[0])
				AftBlocksFirst.push_back(New);
				if (*BB == AftBlocksLast[0])
				AftBlocksLast.push_back(New);
				} else {
				llvm_unreachable("BB being cloned should be in Fore/Sub/Aft");
				}

				// Update our running maps of newest clones
				PrevItValueMap[New] = (It == 1 ? BB : LastValueMap[BB]);
				LastValueMap[*BB] = New;
				for (ValueToValueMapTy::iterator VI = VMap.begin(), VE = VMap.end();
				VI != VE; ++VI) {
				PrevItValueMap[VI->second] =
				const_cast<Value *>(It == 1 ? VI->first : LastValueMap[VI->first]);
				LastValueMap[VI->first] = VI->second;
				}

				NewBlocks.push_back(New);

				// Update DomTree:
				if (*BB == ForeBlocksFirst[0])
				DT->addNewBlock(New, ForeBlocksLast[It - 1]);
				else if (*BB == SubLoopBlocksFirst[0])
				DT->addNewBlock(New, SubLoopBlocksLast[It - 1]);
				else if (*BB == AftBlocksFirst[0])
				DT->addNewBlock(New, AftBlocksLast[It - 1]);
				else {
				// Each set of blocks (Fore/Sub/Aft) will have the same
				// internal domtree structure.
				auto BBDomNode = DT->getNode(*BB);
				auto BBIDom = BBDomNode->getIDom();
				BasicBlock *OriginalBBIDom = BBIDom->getBlock();
				assert(OriginalBBIDom);
				assert(LastValueMap[cast<Value>(OriginalBBIDom)]);
				DT->addNewBlock(
				New, cast<BasicBlock>(LastValueMap[cast<Value>(OriginalBBIDom)]));
				}
				}

				// Remap all instructions in the most recent iteration
				for (BasicBlock *NewBlock : NewBlocks) {
				for (Instruction &I : *NewBlock) {
				::remapInstruction(&I, LastValueMap);
				if (auto *II = dyn_cast<IntrinsicInst>(&I))
				if (II->getIntrinsicID() == Intrinsic::assume)
				AC->registerAssumption(II);
				}
				}

				// Alter the ForeBlocks phi's, pointing them at the latest version of the
				// value from the previous iteration's phis
				for (PHINode &Phi : ForeBlocksFirst[It]->phis()) {
				Value *OldValue = Phi.getIncomingValueForBlock(AftBlocksLast[It]);
				assert(OldValue && "should have incoming edge from Aft[It]");
				Value *NewValue = OldValue;
				if (Value *PrevValue = PrevItValueMap[OldValue])
				NewValue = PrevValue;

				assert(Phi.getNumOperands() == 2);
				Phi.setIncomingBlock(0, ForeBlocksLast[It - 1]);
				Phi.setIncomingValue(0, NewValue);
				Phi.removeIncomingValue(1);
				}
				}

				// Now that all the basic blocks for the unrolled iterations are in place,
				// finish up connecting the blocks and phi nodes. At this point LastValueMap
				// is the last unrolled iterations values.

				// Update Phis in BB from OldBB to point to NewBB
				auto updatePHIBlocks = [](BasicBlock BB, BasicBlock OldBB,
				BasicBlock *NewBB) {
				for (PHINode &Phi : BB->phis()) {
				int I = Phi.getBasicBlockIndex(OldBB);
				Phi.setIncomingBlock(I, NewBB);
				}
				};
				// Update Phis in BB from OldBB to point to NewBB and use the latest value
				// from LastValueMap
				SjoerdMeijerUnsubmitted Done Reply Inline Actions nit: don't need the brackets here SjoerdMeijer: nit: don't need the brackets here
				auto updatePHIBlocksAndValues = [](BasicBlock BB, BasicBlock OldBB,
				BasicBlock *NewBB,
				ValueToValueMapTy &LastValueMap) {
				for (PHINode &Phi : BB->phis()) {
				for (unsigned b = 0; b < Phi.getNumIncomingValues(); ++b) {
				if (Phi.getIncomingBlock(b) == OldBB) {
				Value *OldValue = Phi.getIncomingValue(b);
				if (Value *LastValue = LastValueMap[OldValue]) {
				Phi.setIncomingValue(b, LastValue);
				}
				Phi.setIncomingBlock(b, NewBB);
				break;
				}
				}
				}
				};
				// Move all the phis from Src into Dest
				auto movePHIs = [](BasicBlock Src, BasicBlock Dest) {
				Instruction *insertPoint = Dest->getFirstNonPHI();
				while (PHINode *Phi = dyn_cast<PHINode>(Src->begin()))
				Phi->moveBefore(insertPoint);
				};

				// Update the PHI values outside the loop to point to the last block
				updatePHIBlocksAndValues(LoopExit, AftBlocksLast[0], AftBlocksLast.back(),
				LastValueMap);

				// Update ForeBlocks successors and phi nodes
				BranchInst *ForeTerm =
				cast<BranchInst>(ForeBlocksLast.back()->getTerminator());
				BasicBlock *Dest = SubLoopBlocksFirst[0];
				ForeTerm->setSuccessor(0, Dest);

				if (CompletelyUnroll) {
				while (PHINode *Phi = dyn_cast<PHINode>(ForeBlocksFirst[0]->begin())) {
				Phi->replaceAllUsesWith(Phi->getIncomingValueForBlock(Preheader));
				Phi->getParent()->getInstList().erase(Phi);
				xbolva00Unsubmitted Done Reply Inline Actions Above you use a different style: for (unsigned It = 1; It != Count; ++It) { Can you make it same for all occurrences? xbolva00: Above you use a different style: for (unsigned It = 1; It != Count; ++It) { Can you make it…
				}
				} else {
				// Update the PHI values to point to the last aft block
				updatePHIBlocksAndValues(ForeBlocksFirst[0], AftBlocksLast[0],
				AftBlocksLast.back(), LastValueMap);
				}

				for (unsigned It = 1; It < Count; It++) {
				// Remap ForeBlock successors from previous iteration to this
				BranchInst *ForeTerm =
				cast<BranchInst>(ForeBlocksLast[It - 1]->getTerminator());
				BasicBlock *Dest = ForeBlocksFirst[It];
				ForeTerm->setSuccessor(0, Dest);
				}

				// Subloop successors and phis
				BranchInst *SubTerm =
				cast<BranchInst>(SubLoopBlocksLast.back()->getTerminator());
				SubTerm->setSuccessor(!SubLoopContinueOnTrue, SubLoopBlocksFirst[0]);
				SubTerm->setSuccessor(SubLoopContinueOnTrue, AftBlocksFirst[0]);
				updatePHIBlocks(SubLoopBlocksFirst[0], ForeBlocksLast[0],
				ForeBlocksLast.back());
				updatePHIBlocks(SubLoopBlocksFirst[0], SubLoopBlocksLast[0],
				SubLoopBlocksLast.back());

				for (unsigned It = 1; It < Count; It++) {
				// Replace the conditional branch of the previous iteration subloop
				// with an unconditional one to this one
				BranchInst *SubTerm =
				cast<BranchInst>(SubLoopBlocksLast[It - 1]->getTerminator());
				BranchInst::Create(SubLoopBlocksFirst[It], SubTerm);
				SubTerm->eraseFromParent();

				updatePHIBlocks(SubLoopBlocksFirst[It], ForeBlocksLast[It],
				ForeBlocksLast.back());
				updatePHIBlocks(SubLoopBlocksFirst[It], SubLoopBlocksLast[It],
				SubLoopBlocksLast.back());
				movePHIs(SubLoopBlocksFirst[It], SubLoopBlocksFirst[0]);
				}
				xbolva00Unsubmitted Done Reply Inline Actions This line could be placed above the condition. xbolva00: This line could be placed above the condition.

				// Aft blocks successors and phis
				if (CompletelyUnroll) {
				BranchInst *Term = cast<BranchInst>(AftBlocksLast.back()->getTerminator());
				BranchInst::Create(LoopExit, Term);
				Term->eraseFromParent();
				} else {
				BranchInst *Term = cast<BranchInst>(AftBlocksLast.back()->getTerminator());
				Term->setSuccessor(!ContinueOnTrue, ForeBlocksFirst[0]);
				}
				updatePHIBlocks(AftBlocksFirst[0], SubLoopBlocksLast[0],
				SubLoopBlocksLast.back());

				for (unsigned It = 1; It < Count; It++) {
				// Replace the conditional branch of the previous iteration subloop
				// with an unconditional one to this one
				BranchInst *AftTerm =
				cast<BranchInst>(AftBlocksLast[It - 1]->getTerminator());
				BranchInst::Create(AftBlocksFirst[It], AftTerm);
				AftTerm->eraseFromParent();

				updatePHIBlocks(AftBlocksFirst[It], SubLoopBlocksLast[It],
				SubLoopBlocksLast.back());
				movePHIs(AftBlocksFirst[It], AftBlocksFirst[0]);
				}

				// Dominator Tree. Remove the old links between Fore, Sub and Aft, adding the
				// new ones required.
				if (Count != 1) {
				SmallVector<DominatorTree::UpdateType, 4> DTUpdates;
				DTUpdates.emplace_back(DominatorTree::UpdateKind::Delete, ForeBlocksLast[0],
				SubLoopBlocksFirst[0]);
				DTUpdates.emplace_back(DominatorTree::UpdateKind::Delete,
				SubLoopBlocksLast[0], AftBlocksFirst[0]);

				DTUpdates.emplace_back(DominatorTree::UpdateKind::Insert,
				ForeBlocksLast.back(), SubLoopBlocksFirst[0]);
				DTUpdates.emplace_back(DominatorTree::UpdateKind::Insert,
				SubLoopBlocksLast.back(), AftBlocksFirst[0]);
				DT->applyUpdates(DTUpdates);
				}

				if (UnrollVerifyDomtree)
				DT->verifyDomTree();

				// Merge adjacent basic blocks, if possible.
				SmallPtrSet<BasicBlock *, 16> MergeBlocks;
				MergeBlocks.insert(ForeBlocksLast.begin(), ForeBlocksLast.end());
				MergeBlocks.insert(SubLoopBlocksLast.begin(), SubLoopBlocksLast.end());
				MergeBlocks.insert(AftBlocksLast.begin(), AftBlocksLast.end());
				SmallPtrSet<Loop *, 4> ForgottenLoops;
				while (!MergeBlocks.empty()) {
				BasicBlock BB = MergeBlocks.begin();
				BranchInst *Term = dyn_cast<BranchInst>(BB->getTerminator());
				if (Term && Term->isUnconditional() && L->contains(Term->getSuccessor(0))) {
				BasicBlock *Dest = Term->getSuccessor(0);
				if (BasicBlock *Fold =
				foldBlockIntoPredecessor(Dest, LI, SE, ForgottenLoops, DT)) {
				// Don't remove BB and add Fold as they are the same BB
				assert(Fold == BB);
				MergeBlocks.erase(Dest);
				} else
				MergeBlocks.erase(BB);
				} else
				MergeBlocks.erase(BB);
				}

				// At this point, the code is well formed. We now do a quick sweep over the
				SjoerdMeijerUnsubmitted Done Reply Inline Actions typo: shouldnt SjoerdMeijer: typo: shouldnt
				// inserted code, doing constant propagation and dead code elimination as we
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions If you move this to after the asserts, the other NDEBUG block , can you then move this statement: Loop OuterL = L->getParentLoop(); to inside the other #ifndef NDEBUG block? Thus we avoid having just one statement hash defined here. SjoerdMeijer:* If you move this to after the asserts, the other NDEBUG block , can you then move this…
				// go.
				simplifyLoopAfterUnroll(SubLoop, true, LI, SE, DT, AC);
				simplifyLoopAfterUnroll(L, !CompletelyUnroll && Count > 1, LI, SE, DT, AC);

				NumCompletelyUnrolledAndJammed += CompletelyUnroll;
				++NumUnrolledAndJammed;

				Loop *OuterL = L->getParentLoop();
				// Update LoopInfo if the loop is completely removed.
				if (CompletelyUnroll)
				LI->erase(L);

				// We shouldnt have done anything to break loop simplify form or LCSSA.
				Loop *OutestLoop = OuterL ? OuterL : (!CompletelyUnroll ? L : SubLoop);
				assert(OutestLoop->isRecursivelyLCSSAForm(DT, LI));
				if (!CompletelyUnroll)
				assert(L->isLoopSimplifyForm());
				assert(SubLoop->isLoopSimplifyForm());
				DT->verifyDomTree();

				return CompletelyUnroll ? LoopUnrollResult::FullyUnrolled
				: LoopUnrollResult::PartiallyUnrolled;
				}

				static bool getLoadsAndStores(std::vector<BasicBlock *> &Blocks,
				SmallVector<Value *, 4> &MemInstr) {
				// Scan the BBs and collect legal loads and stores.
				// Returns false if non-simple loads/stores are found.
				for (BasicBlock *BB : Blocks) {
				for (Instruction &I : *BB) {
				if (auto *Ld = dyn_cast<LoadInst>(&I)) {
				if (!Ld->isSimple())
				return false;
				MemInstr.push_back(&I);
				} else if (auto *St = dyn_cast<StoreInst>(&I)) {
				if (!St->isSimple())
				return false;
				MemInstr.push_back(&I);
				} else if (I.mayReadOrWriteMemory()) {
				return false;
				}
				}
				}
				return true;
				}

				static bool checkDependencies(SmallVector<Value *, 4> &Earlier,
				SmallVector<Value *, 4> &Later,
				unsigned LoopDepth, DependenceInfo &DI) {
				// Use DA to check for dependencies between loads and
				// stores that make unroll and jam invalid
				for (Value *I : Earlier) {
				for (Value *J : Later) {
				Instruction *Src = cast<Instruction>(I);
				Instruction *Dst = cast<Instruction>(J);
				if (Src == Dst)
				continue;
				// Ignore Input dependencies.
				if (isa<LoadInst>(Src) && isa<LoadInst>(Dst))
				continue;

				// Track dependencies, and if we find them take a conservative approach
				// by allowing only = or > (not <), altough some < would be safe
				// (depending upon unroll width).
				// FIXME: Allow < so long as distance is less than unroll width
				if (auto D = DI.depends(Src, Dst, true)) {
				assert(D->isOrdered() && "Expected an output, flow or anti dep.");
				unsigned Levels = D->getLevels();

				if (D->isConfused())
				return false;
				for (unsigned II = LoopDepth; II <= Levels; ++II) {
				const SCEV *Distance = D->getDistance(II);
				const SCEVConstant *SCEVConst =
				dyn_cast_or_null<SCEVConstant>(Distance);
				if (SCEVConst) {
				const ConstantInt *CI = SCEVConst->getValue();
				if (CI->isNegative())
				return false;
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: TODOD SjoerdMeijer: Nit: TODOD
				} else
				return false;
				}
				}
				}
				}
				return true;
				}

				static bool checkDependencies(Loop L, std::vector<BasicBlock > &ForeBlocks,
				std::vector<BasicBlock *> &SubLoopBlocks,
				std::vector<BasicBlock *> &AftBlocks,
				DependenceInfo &DI) {
				// TODOD This needs someone who knows what they are talking about to take a look.
				// Get all loads/store pairs for each blocks
				SmallVector<Value *, 4> ForeMemInstr;
				SmallVector<Value *, 4> SubLoopMemInstr;
				SmallVector<Value *, 4> AftMemInstr;
				if (!getLoadsAndStores(ForeBlocks, ForeMemInstr) \|\|
				!getLoadsAndStores(SubLoopBlocks, SubLoopMemInstr) \|\|
				!getLoadsAndStores(AftBlocks, AftMemInstr))
				return false;

				// Check for dependencies between any blocks that may change order
				unsigned LoopDepth = L->getLoopDepth();
				return checkDependencies(ForeMemInstr, SubLoopMemInstr, LoopDepth, DI) &&
				checkDependencies(ForeMemInstr, AftMemInstr, LoopDepth, DI) &&
				checkDependencies(SubLoopMemInstr, AftMemInstr, LoopDepth, DI) &&
				checkDependencies(SubLoopMemInstr, SubLoopMemInstr, LoopDepth + 1, DI);
				}

				bool llvm::isSafeToUnrollAndJam(Loop *L, ScalarEvolution &SE, DominatorTree &DT,
				DependenceInfo &DI) {
				/* We currently handle outer loops like this:
				\|
				ForeFirst <----\ }
				Blocks \| } ForeBlocks
				ForeLast \| }
				\| \|
				SubLoopFirst <\ \| }
				Blocks \| \| } SubLoopBlocks
				SubLoopLast -/ \| }
				\| \|
				AftFirst \| }
				Blocks \| } AftBlocks
				AftLast ------/ }
				\|

				There are (theoretically) any number of blocks in ForeBlocks, SubLoopBlocks
				and AftBlocks, providing that there is one edge from Fores to
				SubLoops, one edge from SubLoops to Afts and a single outer loop exit (from
				Afts). In practice we currently limit Aft blocks to a single block, and
				limit things further in isProfitableToUnrollAndJam.

				Because of the way we rearrange basic blocks, we also require that
				the Fore blocks on all unrolled iterations are safe to move before the
				SubLoop blocks of all iterations. So we require that the phi node looping
				operands of ForeHeader can be moved to at least the end of ForeEnd, so that
				we can arrange cloned Fore Blocks before the subloop and match up Phi's
				correctly.

				i.e. The old order of blocks used to be F1 S1 S1 S1 A1 F2 S2 S2 S2 A2.
				It needs to be safe to tranform this to F1 F2 S1 S2 S1 S2 S1 S2 A1 A2.

				There are then a number of checks along the lines of no calls, no
				exceptions, inner loop IV is consistent, etc.
				*/

				if (!AllowUnrollAndJam)
				return false;
				if (!L->isLoopSimplifyForm() \|\| L->getSubLoops().size() != 1)
				return false;
				Loop *SubLoop = L->getSubLoops()[0];
				if (!SubLoop->isLoopSimplifyForm())
				return false;

				BasicBlock *PreHeader = L->getLoopPreheader();
				BasicBlock *Header = L->getHeader();
				BasicBlock *Latch = L->getLoopLatch();
				BasicBlock *Exit = L->getExitingBlock();
				BasicBlock *SubLoopHeader = SubLoop->getHeader();
				BasicBlock *SubLoopLatch = SubLoop->getLoopLatch();
				BasicBlock *SubLoopExit = SubLoop->getExitingBlock();

				if (Latch != Exit)
				return false;
				if (SubLoopLatch != SubLoopExit)
				return false;

				if (Header->hasAddressTaken() \|\| SubLoopHeader->hasAddressTaken())
				return false;

				// Split blocks into Fore/SubLoop/Aft based on dominators
				std::vector<BasicBlock *> SubLoopBlocks;
				std::vector<BasicBlock *> ForeBlocks;
				std::vector<BasicBlock *> AftBlocks;
				if (!partitionOuterLoopBlocks(L, SubLoop, ForeBlocks, SubLoopBlocks,
				AftBlocks, &DT))
				return false;

				// Aft blocks may need to move instructions to fore blocks, which
				// becomes more difficult if there are multiple (potentially conditionally
				// executed) blocks. For now we just exclude loops with multiple aft blocks.
				if (AftBlocks.size() != 1)
				return false;

				// Check outer loop IV is easily calcable
				const SCEV *BECountSC = SE.getExitCount(L, Latch);
				if (isa<SCEVCouldNotCompute>(BECountSC) \|\|
				!BECountSC->getType()->isIntegerTy())
				return false;
				// Add 1 since the backedge count doesn't include the first loop iteration.
				const SCEV *TripCountSC =
				SE.getAddExpr(BECountSC, SE.getConstant(BECountSC->getType(), 1));
				if (isa<SCEVCouldNotCompute>(TripCountSC))
				return false;
				BranchInst *PreHeaderBR = cast<BranchInst>(PreHeader->getTerminator());
				const DataLayout &DL = Header->getModule()->getDataLayout();
				SCEVExpander Expander(SE, DL, "loop-unroll");
				if (Expander.isHighCostExpansion(TripCountSC, L, PreHeaderBR))
				return false;

				// Check inner loop IV is consistent between all iterations
				const SCEV *SubLoopBECountSC = SE.getExitCount(SubLoop, SubLoopLatch);
				if (isa<SCEVCouldNotCompute>(SubLoopBECountSC) \|\|
				!SubLoopBECountSC->getType()->isIntegerTy())
				return false;
				ScalarEvolution::LoopDisposition LD =
				SE.getLoopDisposition(SubLoopBECountSC, L);
				if (LD != ScalarEvolution::LoopInvariant)
				return false;

				// Check the loop safety info for exceptions.
				LoopSafetyInfo LSI;
				computeLoopSafetyInfo(&LSI, L);
				if (LSI.MayThrow)
				return false;

				// We've ruled out the easy stuff, and need to check that there
				// are no interdependencies which may prevent us from moving
				// the:
				// ForeBlocks before Subloop and AftBlocks.
				// Subloop before AftBlocks.
				// ForeBlock phi operands before the subloop

				// Make sure we can move all instructions we need to before the subloop
				SmallVector<Instruction *, 8> Worklist;
				SmallPtrSet<Instruction *, 8> Visited;
				for (auto &Phi : Header->phis()) {
				Value *V = Phi.getIncomingValueForBlock(Latch);
				if (Instruction *I = dyn_cast<Instruction>(V))
				Worklist.push_back(I);
				}
				while (!Worklist.empty()) {
				Instruction *I = Worklist.back();
				Worklist.pop_back();
				if (Visited.insert(I).second) {
				if (SubLoop->contains(I->getParent()))
				return false;
				if (containsBB(AftBlocks, I->getParent())) {
				// If we hit a phi node in afts we know we are done (probably LCSSA)
				if (isa<PHINode>(I))
				return false;
				if (I->mayHaveSideEffects() \|\| I->mayReadOrWriteMemory())
				return false;
				for (auto &U : I->operands())
				if (Instruction *II = dyn_cast<Instruction>(U))
				Worklist.push_back(II);
				}
				}
				}

				// Check for memory dependencies which prohibit the unrolling
				// we are doing. Because of the way we are unrolling Fore/Sub/Aft
				// blocks, we need to check there are no dependencies between
				// Fore-Sub, Fore-Aft, Sub-Aft and Sub-Sub.
				if (!checkDependencies(L, ForeBlocks, SubLoopBlocks, AftBlocks, DI))
				return false;

				return true;
				}

				/*
				Attempts to work out of it will be profitable to unroll-and-jam, as opposed to
				normal unrolling. Also calls isSafeToUnrollAndJam to ensure we can unroll and
				jam. Returns true if it is not expected to be profitable. May alter UP.Count.
				*/
				bool llvm::isProfitableToUnrollAndJam(
				Loop *L, const TargetTransformInfo &TTI, ScalarEvolution &SE,
				DominatorTree &DT, DependenceInfo &DI,
				TargetTransformInfo::UnrollingPreferences &UP) {
				if (AlwaysUnrollAndJam)
				return isSafeToUnrollAndJam(L, SE, DT, DI);

				// First up we do some very quick checks to see if the structure is anything
				// like what we want. Only unroll and jam on the outer two most levels
				if (L->getSubLoops().size() != 1)
				return false;
				Loop *SubLoop = L->getSubLoops()[0];
				if (!SubLoop->getSubLoops().empty())
				return false;
				// Also limit to single basic block inner loops. As a proxy for simple
				// subloops
				if (SubLoop->getBlocks().size() != 1)
				return false;

				// Check that unroll and jam is safe
				if (!isSafeToUnrollAndJam(L, SE, DT, DI))
				return false;

				// Check if the inner loop count is constant. In this case we should just
				// unroll the inner loop.
				// FIXME: this isn't always true when the inner loop is not completely
				// unrolled
				const SCEV *BECount = SE.getExitCount(SubLoop, SubLoop->getLoopLatch());
				if (isa<SCEVConstant>(BECount))
				return false;

				// We ideally want a load that after unrollandjam is shared between inner
				// loop iterations. I.e is invariant to the outer loop.
				// We also measure the size of the inner loop and ensure it's below some
				// threshold.
				int numInvariant = 0;
				unsigned Cost = 0;
				unsigned Threshold = UP.UnrollAndJamThreshold;
				for (BasicBlock *BB : SubLoop->getBlocks()) {
				for (Instruction &I : *BB) {
				if (auto *Ld = dyn_cast<LoadInst>(&I)) {
				Value *V = Ld->getPointerOperand();
				const SCEV *LSCEV = SE.getSCEVAtScope(V, L);
				if (SE.isLoopInvariant(LSCEV, L))
				numInvariant++;
				}

				SmallVector<const Value *, 4> Operands(I.value_op_begin(),
				I.value_op_end());
				Cost += TTI.getUserCost(&I, Operands);
				if (Cost >= Threshold)
				return false;
				}
				}
				if (numInvariant == 0)
				return false;

				// Very (very) rough measure of register pressure.
				while (Cost * UP.Count > Threshold)
				UP.Count--;
				if (UP.Count <= 1)
				return false;

				// FIXME: May be useful in more general situations.
				return true;
				}

lib/Transforms/Utils/LoopUtils.cpp

Show All 17 Lines
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/LoopPass.h"		#include "llvm/Analysis/LoopPass.h"
#include "llvm/Analysis/ScalarEvolution.h"		#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/ScalarEvolutionAliasAnalysis.h"		#include "llvm/Analysis/ScalarEvolutionAliasAnalysis.h"
#include "llvm/Analysis/ScalarEvolutionExpander.h"		#include "llvm/Analysis/ScalarEvolutionExpander.h"
#include "llvm/Analysis/ScalarEvolutionExpressions.h"		#include "llvm/Analysis/ScalarEvolutionExpressions.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/IR/ValueHandle.h"		#include "llvm/IR/ValueHandle.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Transforms/Utils/BasicBlockUtils.h"		#include "llvm/Transforms/Utils/BasicBlockUtils.h"
▲ Show 20 Lines • Show All 1,607 Lines • ▼ Show 20 Lines	void llvm::propagateIRFlags(Value I, ArrayRef<Value > VL, Value *OpValue) {
for (auto *V : VL) {		for (auto *V : VL) {
auto *Instr = dyn_cast<Instruction>(V);		auto *Instr = dyn_cast<Instruction>(V);
if (!Instr)		if (!Instr)
continue;		continue;
if (OpValue == nullptr \|\| Opcode == Instr->getOpcode())		if (OpValue == nullptr \|\| Opcode == Instr->getOpcode())
VecOp->andIRFlags(V);		VecOp->andIRFlags(V);
}		}
}		}

		/// Computes loop safety information, checks loop body & header
		/// for the possibility of may throw exception.
		///
		void llvm::computeLoopSafetyInfo(LoopSafetyInfo SafetyInfo, Loop CurLoop) {
		assert(CurLoop != nullptr && "CurLoop cant be null");
		BasicBlock *Header = CurLoop->getHeader();
		// Setting default safety values.
		SafetyInfo->MayThrow = false;
		SafetyInfo->HeaderMayThrow = false;
		// Iterate over header and compute safety info.
		for (BasicBlock::iterator I = Header->begin(), E = Header->end();
		(I != E) && !SafetyInfo->HeaderMayThrow; ++I)
		SafetyInfo->HeaderMayThrow \|=
		!isGuaranteedToTransferExecutionToSuccessor(&*I);

		SafetyInfo->MayThrow = SafetyInfo->HeaderMayThrow;
		// Iterate over loop instructions and compute safety info.
		// Skip header as it has been computed and stored in HeaderMayThrow.
		// The first block in loopinfo.Blocks is guaranteed to be the header.
		assert(Header == *CurLoop->getBlocks().begin() &&
		"First block must be header");
		for (Loop::block_iterator BB = std::next(CurLoop->block_begin()),
		BBE = CurLoop->block_end();
		(BB != BBE) && !SafetyInfo->MayThrow; ++BB)
		for (BasicBlock::iterator I = (BB)->begin(), E = (BB)->end();
		(I != E) && !SafetyInfo->MayThrow; ++I)
		SafetyInfo->MayThrow \|= !isGuaranteedToTransferExecutionToSuccessor(&*I);

		// Compute funclet colors if we might sink/hoist in a function with a funclet
		// personality routine.
		Function *Fn = CurLoop->getHeader()->getParent();
		if (Fn->hasPersonalityFn())
		if (Constant *PersonalityFn = Fn->getPersonalityFn())
		if (isFuncletEHPersonality(classifyEHPersonality(PersonalityFn)))
		SafetyInfo->BlockColors = colorEHFunclets(*Fn);
		}

test/Other/new-pm-defaults.ll

	Show First 20 Lines • Show All 211 Lines • ▼ Show 20 Lines
	; CHECK-O-NEXT: Running analysis: BranchProbabilityAnalysis			; CHECK-O-NEXT: Running analysis: BranchProbabilityAnalysis
	; CHECK-O-NEXT: Running pass: LoopLoadEliminationPass			; CHECK-O-NEXT: Running pass: LoopLoadEliminationPass
	; CHECK-O-NEXT: Running analysis: LoopAccessAnalysis			; CHECK-O-NEXT: Running analysis: LoopAccessAnalysis
	; CHECK-O-NEXT: Running pass: InstCombinePass			; CHECK-O-NEXT: Running pass: InstCombinePass
	; CHECK-O-NEXT: Running pass: SimplifyCFGPass			; CHECK-O-NEXT: Running pass: SimplifyCFGPass
	; CHECK-O-NEXT: Running pass: SLPVectorizerPass			; CHECK-O-NEXT: Running pass: SLPVectorizerPass
	; CHECK-O-NEXT: Running pass: InstCombinePass			; CHECK-O-NEXT: Running pass: InstCombinePass
	; CHECK-O-NEXT: Running pass: LoopUnrollPass			; CHECK-O-NEXT: Running pass: LoopUnrollPass
				; CHECK-O-NEXT: Running analysis: DependenceAnalysis
	; CHECK-O-NEXT: Running analysis: OuterAnalysisManagerProxy			; CHECK-O-NEXT: Running analysis: OuterAnalysisManagerProxy
	; CHECK-O-NEXT: Running pass: InstCombinePass			; CHECK-O-NEXT: Running pass: InstCombinePass
	; CHECK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis			; CHECK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis
	; CHECK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass			; CHECK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass
	; CHECK-O-NEXT: Starting llvm::Function pass manager run.			; CHECK-O-NEXT: Starting llvm::Function pass manager run.
	; CHECK-O-NEXT: Running pass: LoopSimplifyPass			; CHECK-O-NEXT: Running pass: LoopSimplifyPass
	; CHECK-O-NEXT: Running pass: LCSSAPass			; CHECK-O-NEXT: Running pass: LCSSAPass
	; CHECK-O-NEXT: Finished llvm::Function pass manager run.			; CHECK-O-NEXT: Finished llvm::Function pass manager run.
	▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

test/Other/new-pm-thinlto-defaults.ll

	Show First 20 Lines • Show All 199 Lines • ▼ Show 20 Lines
	; CHECK-POSTLINK-O-NEXT: Running analysis: BranchProbabilityAnalysis			; CHECK-POSTLINK-O-NEXT: Running analysis: BranchProbabilityAnalysis
	; CHECK-POSTLINK-O-NEXT: Running pass: LoopLoadEliminationPass			; CHECK-POSTLINK-O-NEXT: Running pass: LoopLoadEliminationPass
	; CHECK-POSTLINK-O-NEXT: Running analysis: LoopAccessAnalysis			; CHECK-POSTLINK-O-NEXT: Running analysis: LoopAccessAnalysis
	; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass			; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass
	; CHECK-POSTLINK-O-NEXT: Running pass: SimplifyCFGPass			; CHECK-POSTLINK-O-NEXT: Running pass: SimplifyCFGPass
	; CHECK-POSTLINK-O-NEXT: Running pass: SLPVectorizerPass			; CHECK-POSTLINK-O-NEXT: Running pass: SLPVectorizerPass
	; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass			; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass
	; CHECK-POSTLINK-O-NEXT: Running pass: LoopUnrollPass			; CHECK-POSTLINK-O-NEXT: Running pass: LoopUnrollPass
				; CHECK-POSTLINK-O-NEXT: Running analysis: DependenceAnalysis
	; CHECK-POSTLINK-O-NEXT: Running analysis: OuterAnalysisManagerProxy			; CHECK-POSTLINK-O-NEXT: Running analysis: OuterAnalysisManagerProxy
	; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass			; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass
	; CHECK-POSTLINK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis			; CHECK-POSTLINK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis
	; CHECK-POSTLINK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass			; CHECK-POSTLINK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass
	; CHECK-POSTLINK-O-NEXT: Starting llvm::Function pass manager run			; CHECK-POSTLINK-O-NEXT: Starting llvm::Function pass manager run
	; CHECK-POSTLINK-O-NEXT: Running pass: LoopSimplifyPass			; CHECK-POSTLINK-O-NEXT: Running pass: LoopSimplifyPass
	; CHECK-POSTLINK-O-NEXT: Running pass: LCSSAPass			; CHECK-POSTLINK-O-NEXT: Running pass: LCSSAPass
	; CHECK-POSTLINK-O-NEXT: Finished llvm::Function pass manager run			; CHECK-POSTLINK-O-NEXT: Finished llvm::Function pass manager run
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

test/Other/pass-pipelines.ll

	Show All 39 Lines
	; CHECK-O2-NOT: Manager			; CHECK-O2-NOT: Manager
	; FIXME: We shouldn't be pulling out to simplify-cfg and instcombine and			; FIXME: We shouldn't be pulling out to simplify-cfg and instcombine and
	; causing new loop pass managers.			; causing new loop pass managers.
	; CHECK-O2: Simplify the CFG			; CHECK-O2: Simplify the CFG
	; CHECK-O2-NOT: Manager			; CHECK-O2-NOT: Manager
	; CHECK-O2: Combine redundant instructions			; CHECK-O2: Combine redundant instructions
	; CHECK-O2-NOT: Manager			; CHECK-O2-NOT: Manager
	; CHECK-O2: Loop Pass Manager			; CHECK-O2: Loop Pass Manager
				; FIXME: DA is adding another loop pass manager. Fix this.
				; CHECK-O2: Loop Pass Manager
	; CHECK-O2-NOT: Manager			; CHECK-O2-NOT: Manager
	; FIXME: It isn't clear that we need yet another loop pass pipeline			; FIXME: It isn't clear that we need yet another loop pass pipeline
	; and run of LICM here.			; and run of LICM here.
	; CHECK-O2-NOT: Manager			; CHECK-O2-NOT: Manager
	; CHECK-O2: Loop Pass Manager			; CHECK-O2: Loop Pass Manager
	; CHECK-O2-NEXT: Loop Invariant Code Motion			; CHECK-O2-NEXT: Loop Invariant Code Motion
	; CHECK-O2-NOT: Manager			; CHECK-O2-NOT: Manager
	; Next we break out of the main Function passes inside the CGSCC pipeline with			; Next we break out of the main Function passes inside the CGSCC pipeline with
	▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

test/Transforms/LoopUnroll/unroll-and-jam-disabled.ll

This file was added.

				; RUN: opt -loop-unroll -unroll-and-jam -always-unroll-and-jam -pass-remarks=loop-unroll < %s -S 2>&1 \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv8m.main-arm-none-eabi"

				;; Common check for all tests. None should be unroll and jammed
				; CHECK-NOT: remark: {{.*}} unroll and jammed


				; CHECK-LABEL: disabled1
				; Tests for(i) { sum = A[i]; for(j) sum += B[j]; A[i+1] = sum; }
				; A[i] to A[i+1] dependency should block unrollandjam
				define void @disabled1(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) local_unnamed_addr #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp127 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp127, %cmp
				br i1 %or.cond, label %for.preheader, label %return

				for.preheader:
				br label %for.outer

				for.outer:
				%i.029 = phi i32 [ %add10, %for.latch ], [ 0, %for.preheader ]
				%b.028 = phi i32 [ %inc8, %for.latch ], [ 1, %for.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i.029
				%0 = load i32, i32* %arrayidx, align 4, !tbaa !5
				br label %for.inner

				for.inner:
				%j.026 = phi i32 [ 0, %for.outer ], [ %inc, %for.inner ]
				%sum1.025 = phi i32 [ %0, %for.outer ], [ %add, %for.inner ]
				%arrayidx6 = getelementptr inbounds i32, i32* %B, i32 %j.026
				%1 = load i32, i32* %arrayidx6, align 4, !tbaa !5
				%add = add i32 %1, %sum1.025
				%inc = add nuw i32 %j.026, 1
				%exitcond = icmp eq i32 %inc, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%arrayidx7 = getelementptr inbounds i32, i32* %A, i32 %b.028
				store i32 %add, i32* %arrayidx7, align 4, !tbaa !5
				%inc8 = add nuw nsw i32 %b.028, 1
				%add10 = add nuw nsw i32 %i.029, 1
				%exitcond30 = icmp eq i32 %add10, %I
				br i1 %exitcond30, label %return, label %for.outer

				return:
				ret void
				}


				; CHECK-LABEL: disabled2
				; Tests an incompatible block layout (for.outer jumps past for.inner)
				; FIXME: Make this work
				define void @disabled2(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) local_unnamed_addr #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp131 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp131, %cmp
				br i1 %or.cond, label %for.preheader, label %for.end14

				for.preheader:
				br label %for.outer

				for.outer:
				%i.032 = phi i32 [ %add13, %for.latch ], [ 0, %for.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i32 %i.032
				%0 = load i32, i32* %arrayidx, align 4, !tbaa !5
				%tobool = icmp eq i32 %0, 0
				br i1 %tobool, label %for.latch, label %for.inner

				for.inner:
				%j.030 = phi i32 [ %inc, %for.inner ], [ 0, %for.outer ]
				%sum1.029 = phi i32 [ %sum1.1, %for.inner ], [ 0, %for.outer ]
				%arrayidx6 = getelementptr inbounds i32, i32* %B, i32 %j.030
				%1 = load i32, i32* %arrayidx6, align 4, !tbaa !5
				%tobool7 = icmp eq i32 %1, 0
				%sub = add i32 %sum1.029, 10
				%add = sub i32 %sub, %1
				%sum1.1 = select i1 %tobool7, i32 %sum1.029, i32 %add
				%inc = add nuw i32 %j.030, 1
				%exitcond = icmp eq i32 %inc, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%sum1.1.lcssa = phi i32 [ 0, %for.outer ], [ %sum1.1, %for.inner ]
				%arrayidx11 = getelementptr inbounds i32, i32* %A, i32 %i.032
				store i32 %sum1.1.lcssa, i32* %arrayidx11, align 4, !tbaa !5
				%add13 = add nuw i32 %i.032, 1
				%exitcond33 = icmp eq i32 %add13, %I
				br i1 %exitcond33, label %for.end14, label %for.outer

				for.end14:
				ret void
				}



				; CHECK-LABEL: disabled3
				; Tests loop carry dependencies in an array S
				define void @disabled3(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) local_unnamed_addr #0 {
				entry:
				%S = alloca [4 x i32], align 4
				%cmp = icmp eq i32 %J, 0
				br i1 %cmp, label %return, label %if.end

				if.end:
				%0 = bitcast [4 x i32]* %S to i8*
				call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %0) #2
				%cmp128 = icmp eq i32 %I, 0
				br i1 %cmp128, label %for.cond.cleanup, label %for.preheader

				for.preheader:
				%arrayidx9 = getelementptr inbounds [4 x i32], [4 x i32]* %S, i32 0, i32 0
				br label %for.outer

				for.cond.cleanup:
				call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %0) #2
				br label %return

				for.outer:
				%i.029 = phi i32 [ 0, %for.preheader ], [ %add12, %for.latch ]
				br label %for.inner

				for.inner:
				%j.027 = phi i32 [ 0, %for.outer ], [ %inc, %for.inner ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i32 %j.027
				%l2 = load i32, i32* %arrayidx, align 4, !tbaa !5
				%add = add i32 %j.027, %i.029
				%rem = urem i32 %add, %J
				%arrayidx6 = getelementptr inbounds i32, i32* %B, i32 %rem
				%l3 = load i32, i32* %arrayidx6, align 4, !tbaa !5
				%mul = mul i32 %l3, %l2
				%rem7 = urem i32 %j.027, 3
				%arrayidx8 = getelementptr inbounds [4 x i32], [4 x i32]* %S, i32 0, i32 %rem7
				store i32 %mul, i32* %arrayidx8, align 4, !tbaa !5
				%inc = add nuw i32 %j.027, 1
				%exitcond = icmp eq i32 %inc, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%l1 = load i32, i32* %arrayidx9, align 4, !tbaa !5
				%arrayidx10 = getelementptr inbounds i32, i32* %A, i32 %i.029
				store i32 %l1, i32* %arrayidx10, align 4, !tbaa !5
				%add12 = add nuw i32 %i.029, 1
				%exitcond31 = icmp eq i32 %add12, %I
				br i1 %exitcond31, label %for.cond.cleanup, label %for.outer

				return:
				ret void
				}


				; CHECK-LABEL: disabled4
				; Inner looop induction variable in not consistent
				; ie for(i = 0..n) for (j = 0..i) sum+=B[j]
				define void @disabled4(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) local_unnamed_addr #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ugt i32 %I, 1
				%or.cond = and i1 %cmp122, %cmp
				br i1 %or.cond, label %for.preheader, label %for.end9

				for.preheader:
				br label %for.outer

				for.outer:
				%indvars.iv = phi i32 [ %indvars.iv.next, %for.latch ], [ 1, %for.preheader ]
				br label %for.inner

				for.inner:
				%j.021 = phi i32 [ 0, %for.outer ], [ %inc, %for.inner ]
				%sum1.020 = phi i32 [ 0, %for.outer ], [ %add, %for.inner ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i32 %j.021
				%0 = load i32, i32* %arrayidx, align 4, !tbaa !5
				%add = add i32 %0, %sum1.020
				%inc = add nuw i32 %j.021, 1
				%exitcond = icmp eq i32 %inc, %indvars.iv
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%arrayidx6 = getelementptr inbounds i32, i32* %A, i32 %indvars.iv
				store i32 %add, i32* %arrayidx6, align 4, !tbaa !5
				%indvars.iv.next = add nuw i32 %indvars.iv, 1
				%exitcond24 = icmp eq i32 %indvars.iv.next, %I
				br i1 %exitcond24, label %for.end9, label %for.outer

				for.end9:
				ret void
				}


				; CHECK-LABEL: disabled5
				; Test odd uses of phi nodes where the outer IV cannot be moved into Fore as it hits a PHI
				@f = hidden local_unnamed_addr global i32 0, align 4
				define i32 @disabled5() local_unnamed_addr #0 {
				entry:
				%f.promoted10 = load i32, i32* @f, align 4, !tbaa !5
				br label %for.outer

				for.outer:
				%0 = phi i32 [ %f.promoted10, %entry ], [ 2, %for.latch ]
				%d.018 = phi i16 [ 0, %entry ], [ %odd.lcssa, %for.latch ]
				%inc5.sink9 = phi i32 [ 2, %entry ], [ %inc5, %for.latch ]
				br label %for.inner

				for.inner:
				%1 = phi i32 [ %0, %for.outer ], [ 2, %for.inner ]
				%inc.sink8 = phi i32 [ 0, %for.outer ], [ %inc, %for.inner ]
				%inc = add nuw nsw i32 %inc.sink8, 1
				%exitcond = icmp ne i32 %inc, 7
				br i1 %exitcond, label %for.inner, label %for.latch

				for.latch:
				%.lcssa = phi i32 [ %1, %for.inner ]
				%odd.lcssa = phi i16 [ 1, %for.inner ]
				%inc5 = add nuw nsw i32 %inc5.sink9, 1
				%exitcond11 = icmp ne i32 %inc5, 7
				br i1 %exitcond11, label %for.outer, label %for.end

				for.end:
				%.lcssa.lcssa = phi i32 [ %.lcssa, %for.latch ]
				%inc.lcssa.lcssa = phi i32 [ 7, %for.latch ]
				ret i32 0
				}


				; CHECK-LABEL: disabled6
				; There is a dependency in here, between @d and %0 (=@f)
				@d6 = hidden global i16 5, align 2
				@f6 = hidden local_unnamed_addr global i16* @d6, align 4
				define i32 @disabled6() local_unnamed_addr #1 {
				entry:
				store i16 1, i16* @d6, align 2
				%0 = load i16, i16* @f6, align 4
				br label %for.body.i

				for.body.i:
				%inc8.sink14.i = phi i16 [ 1, %entry ], [ %inc8.i, %for.cond.cleanup.i ]
				%1 = load i16, i16* %0, align 2
				br label %for.body6.i

				for.cond.cleanup.i:
				%inc8.i = add nuw nsw i16 %inc8.sink14.i, 1
				store i16 %inc8.i, i16* @d6, align 2
				%cmp.i = icmp ult i16 %inc8.i, 6
				br i1 %cmp.i, label %for.body.i, label %test.exit

				for.body6.i:
				%c.013.i = phi i32 [ 0, %for.body.i ], [ %inc.i, %for.body6.i ]
				%inc.i = add nuw nsw i32 %c.013.i, 1
				%exitcond.i = icmp eq i32 %inc.i, 7
				br i1 %exitcond.i, label %for.cond.cleanup.i, label %for.body6.i

				test.exit:
				%conv2.i = sext i16 %1 to i32
				ret i32 0
				}



				; CHECK-LABEL: disabled7
				; Has negative output dependency
				define void @disabled7(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) local_unnamed_addr #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp127 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp127, %cmp
				br i1 %or.cond, label %for.body.preheader, label %for.end12

				for.body.preheader:
				br label %for.body

				for.body:
				%i.028 = phi i32 [ %add11, %for.cond3.for.cond.cleanup5_crit_edge ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i.028
				store i32 0, i32* %arrayidx, align 4, !tbaa !5
				%sub = add i32 %i.028, -1
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i32 %sub
				store i32 2, i32* %arrayidx2, align 4, !tbaa !5
				br label %for.body6

				for.cond3.for.cond.cleanup5_crit_edge:
				store i32 %add, i32* %arrayidx, align 4, !tbaa !5
				%add11 = add nuw i32 %i.028, 1
				%exitcond29 = icmp eq i32 %add11, %I
				br i1 %exitcond29, label %for.end12, label %for.body

				for.body6:
				%0 = phi i32 [ 0, %for.body ], [ %add, %for.body6 ]
				%j.026 = phi i32 [ 0, %for.body ], [ %add9, %for.body6 ]
				%arrayidx7 = getelementptr inbounds i32, i32* %B, i32 %j.026
				%1 = load i32, i32* %arrayidx7, align 4, !tbaa !5
				%add = add i32 %1, %0
				%add9 = add nuw i32 %j.026, 1
				%exitcond = icmp eq i32 %add9, %J
				br i1 %exitcond, label %for.cond3.for.cond.cleanup5_crit_edge, label %for.body6

				for.end12:
				ret void
				}


				; CHECK-LABEL: disabled8
				; Same as above with an extra outer loop nest
				define void @disabled8(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) local_unnamed_addr #0 {
				entry:
				%cmp = icmp eq i32 %J, 0
				%cmp335 = icmp eq i32 %I, 0
				%or.cond = or i1 %cmp, %cmp335
				br i1 %or.cond, label %for.end18, label %for.body.preheader

				for.body.preheader:
				br label %for.body

				for.body:
				%x.037 = phi i32 [ %inc, %for.cond.cleanup4 ], [ 0, %for.body.preheader ]
				br label %for.outer

				for.cond.cleanup4:
				%inc = add nuw nsw i32 %x.037, 1
				%exitcond40 = icmp eq i32 %inc, 5
				br i1 %exitcond40, label %for.end18, label %for.body

				for.outer:
				%i.036 = phi i32 [ %add15, %for.latch ], [ 0, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i.036
				store i32 0, i32* %arrayidx, align 4, !tbaa !5
				%sub = add i32 %i.036, -1
				%arrayidx6 = getelementptr inbounds i32, i32* %A, i32 %sub
				store i32 2, i32* %arrayidx6, align 4, !tbaa !5
				br label %for.inner

				for.latch:
				store i32 %add, i32* %arrayidx, align 4, !tbaa !5
				%add15 = add nuw i32 %i.036, 1
				%exitcond38 = icmp eq i32 %add15, %I
				br i1 %exitcond38, label %for.cond.cleanup4, label %for.outer

				for.inner:
				%0 = phi i32 [ 0, %for.outer ], [ %add, %for.inner ]
				%j.034 = phi i32 [ 0, %for.outer ], [ %add13, %for.inner ]
				%arrayidx11 = getelementptr inbounds i32, i32* %B, i32 %j.034
				%1 = load i32, i32* %arrayidx11, align 4, !tbaa !5
				%add = add i32 %1, %0
				%add13 = add nuw i32 %j.034, 1
				%exitcond = icmp eq i32 %add13, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.end18:
				ret void
				}


				; CHECK-LABEL: disabled9
				; Can't prove alias between A and B
				define void @disabled9(i32 %I, i32 %J, i32* nocapture %A, i32* nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}


				; CHECK-LABEL: disable10
				; Simple call
				declare void @f10(i32, i32) local_unnamed_addr #0
				define void @disable10(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				tail call void @f10(i32 %i.us, i32 %j.us) nounwind
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}


				; CHECK-LABEL: disable11
				; volatile
				define void @disable11(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load volatile i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}


				; CHECK-LABEL: disable12
				; Multiple aft blocks
				define void @disable12(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch3 ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%cmpl = icmp eq i32 %add.us.lcssa, 10
				br i1 %cmpl, label %for.latch2, label %for.latch3

				for.latch2:
				br label %for.latch3

				for.latch3:
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}


				; CHECK-LABEL: disable13
				; Two subloops
				define void @disable13(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.inner2, label %for.inner

				for.inner2:
				%j.us2 = phi i32 [ 0, %for.inner ], [ %inc.us2, %for.inner2 ]
				%sum1.us2 = phi i32 [ 0, %for.inner ], [ %add.us2, %for.inner2 ]
				%arrayidx.us2 = getelementptr inbounds i32, i32* %B, i32 %j.us2
				%l0 = load i32, i32* %arrayidx.us2, align 4, !tbaa !5
				%add.us2 = add i32 %l0, %sum1.us2
				%inc.us2 = add nuw i32 %j.us2, 1
				%exitcond2 = icmp eq i32 %inc.us2, %J
				br i1 %exitcond2, label %for.latch, label %for.inner2

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner2 ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}


				; CHECK-LABEL: disable14
				; Multiple exits blocks
				define void @disable14(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				%add8.us = add nuw i32 %i.us, 1
				%exitcond23 = icmp eq i32 %add8.us, %I
				br i1 %exitcond23, label %for.end.loopexit, label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}


				; CHECK-LABEL: disable15
				; Latch != exit
				define void @disable15(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				br label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}


				; CHECK-LABEL: disable16
				; Latch != exit
				define void @disable16(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				%otherphi = phi i32 [ %other, %for.latch ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				%loadarr = getelementptr inbounds i32, i32* %A, i32 %i.us
				%load = load i32, i32* %arrayidx6.us, align 4, !tbaa !5
				%other = add i32 %otherphi, %load
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}



				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.start.p0i8(i64, i8* nocapture) #1

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.end.p0i8(i64, i8* nocapture) #1

				attributes #0 = { noinline norecurse nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "denormal-fp-math"="preserve-sign" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="cortex-m33" "target-features"="+d16,+dsp,+fp-armv8,+fp-only-sp,+hwdiv,+thumb-mode,-crc,-dotprod,-hwdiv-arm,-ras" "unsafe-fp-math"="false" "use-soft-float"="false" }
				attributes #1 = { argmemonly nounwind }

				!5 = !{!6, !6, i64 0}
				!6 = !{!"omnipotent char", !7, i64 0}
				!7 = !{!"Simple C/C++ TBAA"}

test/Transforms/LoopUnroll/unroll-and-jam-unprofitable.ll

This file was added.

				; RUN: opt -loop-unroll -unroll-and-jam -pass-remarks=loop-unroll < %s -S 2>&1 \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv8m.main-arm-none-eabi"

				;; Common check for all tests. None should be unroll and jammed due to profitability
				; CHECK-NOT: remark: {{.*}} unroll and jammed


				; CHECK-LABEL: unprof1
				; Multiple inner loop blocks
				define void @unprof1(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner2 ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner2 ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				br label %for.inner2

				for.inner2:
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner2 ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}


				; CHECK-LABEL: unprof2
				; Constant inner loop count
				define void @unprof2(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, 10
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}


				; CHECK-LABEL: unprof3
				; Complex inner loop
				define void @unprof3(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%add.us0 = add i32 %0, %sum1.us
				%add.us1 = add i32 %0, %sum1.us
				%add.us2 = add i32 %0, %sum1.us
				%add.us3 = add i32 %0, %sum1.us
				%add.us4 = add i32 %0, %sum1.us
				%add.us5 = add i32 %0, %sum1.us
				%add.us6 = add i32 %0, %sum1.us
				%add.us7 = add i32 %0, %sum1.us
				%add.us8 = add i32 %0, %sum1.us
				%add.us9 = add i32 %0, %sum1.us
				%add.us10 = add i32 %0, %sum1.us
				%add.us11 = add i32 %0, %sum1.us
				%add.us12 = add i32 %0, %sum1.us
				%add.us13 = add i32 %0, %sum1.us
				%add.us14 = add i32 %0, %sum1.us
				%add.us15 = add i32 %0, %sum1.us
				%add.us16 = add i32 %0, %sum1.us
				%add.us17 = add i32 %0, %sum1.us
				%add.us18 = add i32 %0, %sum1.us
				%add.us19 = add i32 %0, %sum1.us
				%add.us20 = add i32 %0, %sum1.us
				%add.us21 = add i32 %0, %sum1.us
				%add.us22 = add i32 %0, %sum1.us
				%add.us23 = add i32 %0, %sum1.us
				%add.us24 = add i32 %0, %sum1.us
				%add.us25 = add i32 %0, %sum1.us
				%add.us26 = add i32 %0, %sum1.us
				%add.us27 = add i32 %0, %sum1.us
				%add.us28 = add i32 %0, %sum1.us
				%add.us29 = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}


				; CHECK-LABEL: unprof4
				; No loop invariant loads
				define void @unprof4(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%j2 = add i32 %j.us, %i.us
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j2
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}


				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.start.p0i8(i64, i8* nocapture) #1

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.end.p0i8(i64, i8* nocapture) #1

				attributes #0 = { noinline norecurse nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "denormal-fp-math"="preserve-sign" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="cortex-m33" "target-features"="+d16,+dsp,+fp-armv8,+fp-only-sp,+hwdiv,+thumb-mode,-crc,-dotprod,-hwdiv-arm,-ras" "unsafe-fp-math"="false" "use-soft-float"="false" }
				attributes #1 = { argmemonly nounwind }

				!5 = !{!6, !6, i64 0}
				!6 = !{!"omnipotent char", !7, i64 0}
				!7 = !{!"Simple C/C++ TBAA"}

test/Transforms/LoopUnroll/unroll-and-jam.ll

This file was added.

				; RUN: opt -basicaa -tbaa -loop-unroll -unroll-and-jam -always-unroll-and-jam < %s -S \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv8m.main-arm-none-eabi"

				; CHECK-LABEL: test1
				; Tests for(i) { sum = 0; for(j) sum += B[j]; A[i] = sum; }
				; CHECK: entry:
				; CHECK: br i1 %or.cond, label %for.outer.preheader, label %for.end
				; CHECK: for.outer.preheader:
				; CHECK: br i1 %1, label %for.end.loopexit.unr-lcssa, label %for.outer.preheader.new
				; CHECK: for.outer.preheader.new:
				; CHECK: br label %for.outer
				; CHECK: for.outer:
				; CHECK: %i.us = phi i32 [ %add8.us.3, %for.latch ], [ 0, %for.outer.preheader.new ]
				; CHECK: %niter = phi i32 [ %unroll_iter, %for.outer.preheader.new ], [ %niter.nsub.3, %for.latch ]
				; CHECK: %add8.us = add nuw nsw i32 %i.us, 1
				; CHECK: %niter.nsub = sub i32 %niter, 1
				; CHECK: %add8.us.1 = add nuw nsw i32 %add8.us, 1
				; CHECK: %niter.nsub.1 = sub i32 %niter.nsub, 1
				; CHECK: %add8.us.2 = add nuw nsw i32 %add8.us.1, 1
				; CHECK: %niter.nsub.2 = sub i32 %niter.nsub.1, 1
				; CHECK: %add8.us.3 = add nuw i32 %add8.us.2, 1
				; CHECK: %niter.nsub.3 = sub i32 %niter.nsub.2, 1
				; CHECK: br label %for.inner
				; CHECK: for.inner:
				; CHECK: %j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				; CHECK: %sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				; CHECK: %j.us.1 = phi i32 [ 0, %for.outer ], [ %inc.us.1, %for.inner ]
				; CHECK: %sum1.us.1 = phi i32 [ 0, %for.outer ], [ %add.us.1, %for.inner ]
				; CHECK: %j.us.2 = phi i32 [ 0, %for.outer ], [ %inc.us.2, %for.inner ]
				; CHECK: %sum1.us.2 = phi i32 [ 0, %for.outer ], [ %add.us.2, %for.inner ]
				; CHECK: %j.us.3 = phi i32 [ 0, %for.outer ], [ %inc.us.3, %for.inner ]
				; CHECK: %sum1.us.3 = phi i32 [ 0, %for.outer ], [ %add.us.3, %for.inner ]
				; CHECK: %arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				; CHECK: %2 = load i32, i32* %arrayidx.us, align 4, !tbaa !0
				; CHECK: %add.us = add i32 %2, %sum1.us
				; CHECK: %inc.us = add nuw i32 %j.us, 1
				; CHECK: %arrayidx.us.1 = getelementptr inbounds i32, i32* %B, i32 %j.us.1
				; CHECK: %3 = load i32, i32* %arrayidx.us.1, align 4, !tbaa !0
				; CHECK: %add.us.1 = add i32 %3, %sum1.us.1
				; CHECK: %inc.us.1 = add nuw i32 %j.us.1, 1
				; CHECK: %arrayidx.us.2 = getelementptr inbounds i32, i32* %B, i32 %j.us.2
				; CHECK: %4 = load i32, i32* %arrayidx.us.2, align 4, !tbaa !0
				; CHECK: %add.us.2 = add i32 %4, %sum1.us.2
				; CHECK: %inc.us.2 = add nuw i32 %j.us.2, 1
				; CHECK: %arrayidx.us.3 = getelementptr inbounds i32, i32* %B, i32 %j.us.3
				; CHECK: %5 = load i32, i32* %arrayidx.us.3, align 4, !tbaa !0
				; CHECK: %add.us.3 = add i32 %5, %sum1.us.3
				; CHECK: %inc.us.3 = add nuw i32 %j.us.3, 1
				; CHECK: %exitcond.3 = icmp eq i32 %inc.us.3, %J
				; CHECK: br i1 %exitcond.3, label %for.latch, label %for.inner
				; CHECK: for.latch:
				; CHECK: %add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				; CHECK: %add.us.lcssa.1 = phi i32 [ %add.us.1, %for.inner ]
				; CHECK: %add.us.lcssa.2 = phi i32 [ %add.us.2, %for.inner ]
				; CHECK: %add.us.lcssa.3 = phi i32 [ %add.us.3, %for.inner ]
				; CHECK: %arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				; CHECK: store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !0
				; CHECK: %arrayidx6.us.1 = getelementptr inbounds i32, i32* %A, i32 %add8.us
				; CHECK: store i32 %add.us.lcssa.1, i32* %arrayidx6.us.1, align 4, !tbaa !0
				; CHECK: %arrayidx6.us.2 = getelementptr inbounds i32, i32* %A, i32 %add8.us.1
				; CHECK: store i32 %add.us.lcssa.2, i32* %arrayidx6.us.2, align 4, !tbaa !0
				; CHECK: %arrayidx6.us.3 = getelementptr inbounds i32, i32* %A, i32 %add8.us.2
				; CHECK: store i32 %add.us.lcssa.3, i32* %arrayidx6.us.3, align 4, !tbaa !0
				; CHECK: %niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				; CHECK: br i1 %niter.ncmp.3, label %for.end.loopexit.unr-lcssa.loopexit, label %for.outer
				; CHECK: for.end.loopexit.unr-lcssa.loopexit:
				; CHECK: %i.us.unr.ph = phi i32 [ %add8.us.3, %for.latch ]
				; CHECK: br label %for.end.loopexit.unr-lcssa
				; CHECK: for.end.loopexit.unr-lcssa:
				; CHECK: br i1 %lcmp.mod, label %for.outer.epil.preheader, label %for.end.loopexit
				; CHECK: for.outer.epil.preheader:
				; CHECK: br label %for.outer.epil
				; CHECK: for.outer.epil:
				; CHECK: br label %for.inner.epil
				; CHECK: for.inner.epil:
				; CHECK: br i1 %exitcond.epil, label %for.latch.epil, label %for.inner.epil
				; CHECK: for.latch.epil:
				; CHECK: br i1 %epil.iter.cmp, label %for.outer.epil.1, label %for.end.loopexit.epilog-lcssa
				; CHECK: for.end.loopexit.epilog-lcssa:
				; CHECK: br label %for.end.loopexit
				; CHECK: for.end.loopexit:
				; CHECK: br label %for.end
				; CHECK: for.end:
				; CHECK: ret void
				; CHECK: for.outer.epil.1:
				; CHECK: br label %for.inner.epil.1
				; CHECK: for.inner.epil.1:
				; CHECK: br i1 %exitcond.epil.1, label %for.latch.epil.1, label %for.inner.epil.1
				; CHECK: for.latch.epil.1:
				; CHECK: br i1 %epil.iter.cmp.1, label %for.outer.epil.2, label %for.end.loopexit.epilog-lcssa
				; CHECK: for.outer.epil.2:
				; CHECK: br label %for.inner.epil.2
				; CHECK: for.inner.epil.2:
				; CHECK: br i1 %exitcond.epil.2, label %for.latch.epil.2, label %for.inner.epil.2
				; CHECK: for.latch.epil.2:
				; CHECK: br label %for.end.loopexit.epilog-lcssa
				define void @test1(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				%add.us = add i32 %0, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner, !llvm.loop !1

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}

				; CHECK-LABEL: test2
				; Tests for(i) { sum = A[i]; for(j) sum += B[j]; A[i] = sum; }
				; A[i] load/store dependency should not block unroll-and-jam
				; CHECK: for.outer:
				; CHECK: %i.us = phi i32 [ %add9.us.3, %for.latch ], [ 0, %for.outer.preheader.new ]
				; CHECK: %niter = phi i32 [ %unroll_iter, %for.outer.preheader.new ], [ %niter.nsub.3, %for.latch ]
				; CHECK: br label %for.inner
				; CHECK: for.inner:
				; CHECK: %j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				; CHECK: %sum1.us = phi i32 [ %2, %for.outer ], [ %add.us, %for.inner ]
				; CHECK: %j.us.1 = phi i32 [ 0, %for.outer ], [ %inc.us.1, %for.inner ]
				; CHECK: %sum1.us.1 = phi i32 [ %3, %for.outer ], [ %add.us.1, %for.inner ]
				; CHECK: %j.us.2 = phi i32 [ 0, %for.outer ], [ %inc.us.2, %for.inner ]
				; CHECK: %sum1.us.2 = phi i32 [ %4, %for.outer ], [ %add.us.2, %for.inner ]
				; CHECK: %j.us.3 = phi i32 [ 0, %for.outer ], [ %inc.us.3, %for.inner ]
				; CHECK: %sum1.us.3 = phi i32 [ %5, %for.outer ], [ %add.us.3, %for.inner ]
				; CHECK: br i1 %exitcond.3, label %for.latch, label %for.inner
				; CHECK: for.latch:
				; CHECK: %add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				; CHECK: %add.us.lcssa.1 = phi i32 [ %add.us.1, %for.inner ]
				; CHECK: %add.us.lcssa.2 = phi i32 [ %add.us.2, %for.inner ]
				; CHECK: %add.us.lcssa.3 = phi i32 [ %add.us.3, %for.inner ]
				; CHECK: br i1 %niter.ncmp.3, label %for.end10.loopexit.unr-lcssa.loopexit, label %for.outer
				; CHECK: for.end10.loopexit.unr-lcssa.loopexit:
				define void @test2(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp125 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp125
				br i1 %or.cond, label %for.outer.preheader, label %for.end10

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add9.us, %for.latch ], [ 0, %for.outer.preheader ]
				%arrayidx.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				%0 = load i32, i32* %arrayidx.us, align 4, !tbaa !5
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ %0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %B, i32 %j.us
				%1 = load i32, i32* %arrayidx6.us, align 4, !tbaa !5
				%add.us = add i32 %1, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner, !llvm.loop !1

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				store i32 %add.us.lcssa, i32* %arrayidx.us, align 4, !tbaa !5
				%add9.us = add nuw i32 %i.us, 1
				%exitcond28 = icmp eq i32 %add9.us, %I
				br i1 %exitcond28, label %for.end10.loopexit, label %for.outer

				for.end10.loopexit:
				br label %for.end10

				for.end10:
				ret void
				}


				; CHECK-LABEL: test3
				; Tests Complete unroll-and-jam of the outer loop
				; CHECK: for.outer:
				; CHECK: br label %for.inner
				; CHECK: for.inner:
				; CHECK: %j.021 = phi i32 [ 0, %for.outer ], [ %inc, %for.inner ]
				; CHECK: %sum1.020 = phi i32 [ 0, %for.outer ], [ %add, %for.inner ]
				; CHECK: %j.021.1 = phi i32 [ 0, %for.outer ], [ %inc.1, %for.inner ]
				; CHECK: %sum1.020.1 = phi i32 [ 0, %for.outer ], [ %add.1, %for.inner ]
				; CHECK: %j.021.2 = phi i32 [ 0, %for.outer ], [ %inc.2, %for.inner ]
				; CHECK: %sum1.020.2 = phi i32 [ 0, %for.outer ], [ %add.2, %for.inner ]
				; CHECK: %j.021.3 = phi i32 [ 0, %for.outer ], [ %inc.3, %for.inner ]
				; CHECK: %sum1.020.3 = phi i32 [ 0, %for.outer ], [ %add.3, %for.inner ]
				; CHECK: br i1 %exitcond.3, label %for.latch, label %for.inner
				; CHECK: for.latch:
				; CHECK: %add.lcssa = phi i32 [ %add, %for.inner ]
				; CHECK: %add.lcssa.1 = phi i32 [ %add.1, %for.inner ]
				; CHECK: %add.lcssa.2 = phi i32 [ %add.2, %for.inner ]
				; CHECK: %add.lcssa.3 = phi i32 [ %add.3, %for.inner ]
				; CHECK: br label %for.end
				; CHECK: for.end:
				define void @test3(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) local_unnamed_addr #0 {
				entry:
				%cmp = icmp eq i32 %J, 0
				br i1 %cmp, label %for.end, label %for.preheader

				for.preheader:
				br label %for.outer

				for.outer:
				%i.022 = phi i32 [ %add8, %for.latch ], [ 0, %for.preheader ]
				br label %for.inner

				for.inner:
				%j.021 = phi i32 [ 0, %for.outer ], [ %inc, %for.inner ]
				%sum1.020 = phi i32 [ 0, %for.outer ], [ %add, %for.inner ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i32 %j.021
				%0 = load i32, i32* %arrayidx, align 4, !tbaa !5
				%sub = add i32 %sum1.020, 10
				%add = sub i32 %sub, %0
				%inc = add nuw i32 %j.021, 1
				%exitcond = icmp eq i32 %inc, %J
				br i1 %exitcond, label %for.latch, label %for.inner, !llvm.loop !1

				for.latch:
				%arrayidx6 = getelementptr inbounds i32, i32* %A, i32 %i.022
				store i32 %add, i32* %arrayidx6, align 4, !tbaa !5
				%add8 = add nuw nsw i32 %i.022, 1
				%exitcond23 = icmp eq i32 %add8, 4
				br i1 %exitcond23, label %for.end, label %for.outer

				for.end:
				ret void
				}

				; CHECK-LABEL: test4
				; Tests Complete unroll-and-jam with a trip count of 1
				; CHECK: for.outer:
				; CHECK: br label %for.inner
				; CHECK: for.inner:
				; CHECK: %j.021 = phi i32 [ 0, %for.outer ], [ %inc, %for.inner ]
				; CHECK: %sum1.020 = phi i32 [ 0, %for.outer ], [ %add, %for.inner ]
				; CHECK: br i1 %exitcond, label %for.latch, label %for.inner
				; CHECK: for.latch:
				; CHECK: %add.lcssa = phi i32 [ %add, %for.inner ]
				; CHECK: br label %for.end
				; CHECK: for.end:
				define void @test4(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) local_unnamed_addr #0 {
				entry:
				%cmp = icmp eq i32 %J, 0
				br i1 %cmp, label %for.end, label %for.preheader

				for.preheader:
				br label %for.outer

				for.outer:
				%i.022 = phi i32 [ %add8, %for.latch ], [ 0, %for.preheader ]
				br label %for.inner

				for.inner:
				%j.021 = phi i32 [ 0, %for.outer ], [ %inc, %for.inner ]
				%sum1.020 = phi i32 [ 0, %for.outer ], [ %add, %for.inner ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i32 %j.021
				%0 = load i32, i32* %arrayidx, align 4, !tbaa !5
				%sub = add i32 %sum1.020, 10
				%add = sub i32 %sub, %0
				%inc = add nuw i32 %j.021, 1
				%exitcond = icmp eq i32 %inc, %J
				br i1 %exitcond, label %for.latch, label %for.inner, !llvm.loop !1

				for.latch:
				%arrayidx6 = getelementptr inbounds i32, i32* %A, i32 %i.022
				store i32 %add, i32* %arrayidx6, align 4, !tbaa !5
				%add8 = add nuw nsw i32 %i.022, 1
				%exitcond23 = icmp eq i32 %add8, 1
				br i1 %exitcond23, label %for.end, label %for.outer

				for.end:
				ret void
				}





				; CHECK-LABEL: test5
				; Multiple SubLoopBlocks
				; CHECK: for.outer:
				; CHECK: br label %for.inner
				; CHECK: for.inner:
				; CHECK: %inc8.sink15 = phi i32 [ 0, %for.outer ], [ %inc8, %for.inc.1 ]
				; CHECK: %inc8.sink15.1 = phi i32 [ 0, %for.outer ], [ %inc8.1, %for.inc.1 ]
				; CHECK: br label %for.inner2
				; CHECK: for.inner2:
				; CHECK: br i1 %tobool, label %for.cond4, label %for.inc
				; CHECK: for.cond4:
				; CHECK: br i1 %tobool.1, label %for.cond4a, label %for.inc
				; CHECK: for.cond4a:
				; CHECK: br label %for.inc
				; CHECK: for.inc:
				; CHECK: br i1 %tobool.11, label %for.cond4.1, label %for.inc.1
				; CHECK: for.latch:
				; CHECK: br label %for.end
				; CHECK: for.end:
				; CHECK: ret i32 0
				; CHECK: for.cond4.1:
				; CHECK: br i1 %tobool.1.1, label %for.cond4a.1, label %for.inc.1
				; CHECK: for.cond4a.1:
				; CHECK: br label %for.inc.1
				; CHECK: for.inc.1:
				; CHECK: br i1 %exitcond.1, label %for.latch, label %for.inner
				@a = hidden local_unnamed_addr global [1 x i32] zeroinitializer, align 4
				define i32 @test5() local_unnamed_addr #0 {
				entry:
				br label %for.outer

				for.outer:
				%.sink16 = phi i32 [ 0, %entry ], [ %add, %for.latch ]
				br label %for.inner

				for.inner:
				%inc8.sink15 = phi i32 [ 0, %for.outer ], [ %inc8, %for.inc ]
				br label %for.inner2

				for.inner2:
				%l1 = load i32, i32* getelementptr inbounds ([1 x i32], [1 x i32]* @a, i32 0, i32 0), align 4
				%tobool = icmp eq i32 %l1, 0
				br i1 %tobool, label %for.cond4, label %for.inc

				for.cond4:
				%l0 = load i32, i32* getelementptr inbounds ([1 x i32], [1 x i32]* @a, i32 1, i32 0), align 4
				%tobool.1 = icmp eq i32 %l0, 0
				br i1 %tobool.1, label %for.cond4a, label %for.inc

				for.cond4a:
				br label %for.inc

				for.inc:
				%l2 = phi i32 [ 0, %for.inner2 ], [ 1, %for.cond4 ], [ 2, %for.cond4a ]
				%inc8 = add nuw nsw i32 %inc8.sink15, 1
				%exitcond = icmp eq i32 %inc8, 3
				br i1 %exitcond, label %for.latch, label %for.inner, !llvm.loop !1

				for.latch:
				%.lcssa = phi i32 [ %l2, %for.inc ]
				%conv11 = and i32 %.sink16, 255
				%add = add nuw nsw i32 %conv11, 4
				%cmp = icmp eq i32 %add, 8
				br i1 %cmp, label %for.end, label %for.outer

				for.end:
				%.lcssa.lcssa = phi i32 [ %.lcssa, %for.latch ]
				ret i32 0
				}




				; CHECK-LABEL: test6
				; Test odd uses of phi nodes
				; CHECK: for.outer:
				; CHECK: br label %for.inner
				; CHECK: for.inner:
				; CHECK: br i1 %exitcond.4, label %for.inner, label %for.latch
				; CHECK: for.latch:
				; CHECK: br label %for.end
				; CHECK: for.end:
				; CHECK: ret i32 0
				@f = hidden local_unnamed_addr global i32 0, align 4
				define i32 @test6() local_unnamed_addr #0 {
				entry:
				%f.promoted10 = load i32, i32* @f, align 4, !tbaa !5
				br label %for.outer

				for.outer:
				%p0 = phi i32 [ %f.promoted10, %entry ], [ 2, %for.latch ]
				%inc5.sink9 = phi i32 [ 2, %entry ], [ %inc5, %for.latch ]
				br label %for.inner

				for.inner:
				%p1 = phi i32 [ %p0, %for.outer ], [ 2, %for.inner ]
				%inc.sink8 = phi i32 [ 0, %for.outer ], [ %inc, %for.inner ]
				%inc = add nuw nsw i32 %inc.sink8, 1
				%exitcond = icmp ne i32 %inc, 7
				br i1 %exitcond, label %for.inner, label %for.latch, !llvm.loop !1

				for.latch:
				%.lcssa = phi i32 [ %p1, %for.inner ]
				%inc5 = add nuw nsw i32 %inc5.sink9, 1
				%exitcond11 = icmp ne i32 %inc5, 7
				br i1 %exitcond11, label %for.outer, label %for.end

				for.end:
				%.lcssa.lcssa = phi i32 [ %.lcssa, %for.latch ]
				%inc.lcssa.lcssa = phi i32 [ 7, %for.latch ]
				ret i32 0
				}



				; CHECK-LABEL: test7
				; Has a positive dependency between two stores. Still valid.
				; The negative dependecy is in unroll-and-jam-disabled.ll
				; CHECK: for.outer:
				; CHECK: %i = phi i32 [ %add.3, %for.latch ], [ 0, %for.preheader.new ]
				; CHECK: %niter = phi i32 [ %unroll_iter, %for.preheader.new ], [ %niter.nsub.3, %for.latch ]
				; CHECK: br label %for.inner
				; CHECK: for.latch:
				; CHECK: %add9.lcssa = phi i32 [ %add9, %for.inner ]
				; CHECK: %add9.lcssa.1 = phi i32 [ %add9.1, %for.inner ]
				; CHECK: %add9.lcssa.2 = phi i32 [ %add9.2, %for.inner ]
				; CHECK: %add9.lcssa.3 = phi i32 [ %add9.3, %for.inner ]
				; CHECK: br i1 %niter.ncmp.3, label %for.end.loopexit.unr-lcssa.loopexit, label %for.outer
				; CHECK: for.inner:
				; CHECK: %sum = phi i32 [ 0, %for.outer ], [ %add9, %for.inner ]
				; CHECK: %j = phi i32 [ 0, %for.outer ], [ %add10, %for.inner ]
				; CHECK: %sum.1 = phi i32 [ 0, %for.outer ], [ %add9.1, %for.inner ]
				; CHECK: %j.1 = phi i32 [ 0, %for.outer ], [ %add10.1, %for.inner ]
				; CHECK: %sum.2 = phi i32 [ 0, %for.outer ], [ %add9.2, %for.inner ]
				; CHECK: %j.2 = phi i32 [ 0, %for.outer ], [ %add10.2, %for.inner ]
				; CHECK: %sum.3 = phi i32 [ 0, %for.outer ], [ %add9.3, %for.inner ]
				; CHECK: %j.3 = phi i32 [ 0, %for.outer ], [ %add10.3, %for.inner ]
				; CHECK: br i1 %exitcond.3, label %for.latch, label %for.inner
				; CHECK: for.end.loopexit.unr-lcssa.loopexit:
				define void @test7(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) local_unnamed_addr #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp128 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp128, %cmp
				br i1 %or.cond, label %for.preheader, label %for.end

				for.preheader:
				br label %for.outer

				for.outer:
				%i = phi i32 [ %add, %for.latch ], [ 0, %for.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i
				store i32 0, i32* %arrayidx, align 4, !tbaa !5
				%add = add nuw i32 %i, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i32 %add
				store i32 2, i32* %arrayidx2, align 4, !tbaa !5
				br label %for.inner

				for.latch:
				store i32 %add9, i32* %arrayidx, align 4, !tbaa !5
				%exitcond30 = icmp eq i32 %add, %I
				br i1 %exitcond30, label %for.end, label %for.outer

				for.inner:
				%sum = phi i32 [ 0, %for.outer ], [ %add9, %for.inner ]
				%j = phi i32 [ 0, %for.outer ], [ %add10, %for.inner ]
				%arrayidx7 = getelementptr inbounds i32, i32* %B, i32 %j
				%l1 = load i32, i32* %arrayidx7, align 4, !tbaa !5
				%add9 = add i32 %l1, %sum
				%add10 = add nuw i32 %j, 1
				%exitcond = icmp eq i32 %add10, %J
				br i1 %exitcond, label %for.latch, label %for.inner, !llvm.loop !1

				for.end:
				ret void
				}



				; CHECK-LABEL: test8
				; Same as test7 with an extra outer loop nest
				; CHECK: for.outest:
				; CHECK: br label %for.outer
				; CHECK: for.outer:
				; CHECK: %i = phi i32 [ %add.3, %for.latch ], [ 0, %for.outest.new ]
				; CHECK: %niter = phi i32 [ %unroll_iter, %for.outest.new ], [ %niter.nsub.3, %for.latch ]
				; CHECK: br label %for.inner
				; CHECK: for.inner:
				; CHECK: %sum = phi i32 [ 0, %for.outer ], [ %add9, %for.inner ]
				; CHECK: %j = phi i32 [ 0, %for.outer ], [ %add10, %for.inner ]
				; CHECK: %sum.1 = phi i32 [ 0, %for.outer ], [ %add9.1, %for.inner ]
				; CHECK: %j.1 = phi i32 [ 0, %for.outer ], [ %add10.1, %for.inner ]
				; CHECK: %sum.2 = phi i32 [ 0, %for.outer ], [ %add9.2, %for.inner ]
				; CHECK: %j.2 = phi i32 [ 0, %for.outer ], [ %add10.2, %for.inner ]
				; CHECK: %sum.3 = phi i32 [ 0, %for.outer ], [ %add9.3, %for.inner ]
				; CHECK: %j.3 = phi i32 [ 0, %for.outer ], [ %add10.3, %for.inner ]
				; CHECK: br i1 %exitcond.3, label %for.latch, label %for.inner
				; CHECK: for.latch:
				; CHECK: %add9.lcssa = phi i32 [ %add9, %for.inner ]
				; CHECK: %add9.lcssa.1 = phi i32 [ %add9.1, %for.inner ]
				; CHECK: %add9.lcssa.2 = phi i32 [ %add9.2, %for.inner ]
				; CHECK: %add9.lcssa.3 = phi i32 [ %add9.3, %for.inner ]
				; CHECK: br i1 %niter.ncmp.3, label %for.cleanup.unr-lcssa.loopexit, label %for.outer
				; CHECK: for.cleanup.epilog-lcssa:
				; CHECK: br label %for.cleanup
				; CHECK: for.cleanup:
				; CHECK: br i1 %exitcond41, label %for.end.loopexit, label %for.outest
				; CHECK: for.end.loopexit:
				; CHECK: br label %for.end
				define void @test8(i32 %I, i32 %J, i32* noalias nocapture %A, i32* noalias nocapture readonly %B) local_unnamed_addr #0 {
				entry:
				%cmp = icmp eq i32 %J, 0
				%cmp336 = icmp eq i32 %I, 0
				%or.cond = or i1 %cmp, %cmp336
				br i1 %or.cond, label %for.end, label %for.preheader

				for.preheader:
				br label %for.outest

				for.outest:
				%x.038 = phi i32 [ %inc, %for.cleanup ], [ 0, %for.preheader ]
				br label %for.outer

				for.outer:
				%i = phi i32 [ %add, %for.latch ], [ 0, %for.outest ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i
				store i32 0, i32* %arrayidx, align 4, !tbaa !5
				%add = add nuw i32 %i, 1
				%arrayidx6 = getelementptr inbounds i32, i32* %A, i32 %add
				store i32 2, i32* %arrayidx6, align 4, !tbaa !5
				br label %for.inner

				for.inner:
				%sum = phi i32 [ 0, %for.outer ], [ %add9, %for.inner ]
				%j = phi i32 [ 0, %for.outer ], [ %add10, %for.inner ]
				%arrayidx11 = getelementptr inbounds i32, i32* %B, i32 %j
				%l1 = load i32, i32* %arrayidx11, align 4, !tbaa !5
				%add9 = add i32 %l1, %sum
				%add10 = add nuw i32 %j, 1
				%exitcond = icmp eq i32 %add10, %J
				br i1 %exitcond, label %for.latch, label %for.inner, !llvm.loop !1

				for.latch:
				store i32 %add9, i32* %arrayidx, align 4, !tbaa !5
				%exitcond39 = icmp eq i32 %add, %I
				br i1 %exitcond39, label %for.cleanup, label %for.outer

				for.cleanup:
				%inc = add nuw nsw i32 %x.038, 1
				%exitcond41 = icmp eq i32 %inc, 5
				br i1 %exitcond41, label %for.end, label %for.outest


				for.end:
				ret void
				}

				; CHECK-LABEL: test9
				; Same as test1 with tbaa, not noalias
				; CHECK: for.outer:
				; CHECK: %i.us = phi i32 [ %add8.us.3, %for.latch ], [ 0, %for.outer.preheader.new ]
				; CHECK: %niter = phi i32 [ %unroll_iter, %for.outer.preheader.new ], [ %niter.nsub.3, %for.latch ]
				; CHECK: br label %for.inner
				; CHECK: for.inner:
				; CHECK: %j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				; CHECK: %sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				; CHECK: %j.us.1 = phi i32 [ 0, %for.outer ], [ %inc.us.1, %for.inner ]
				; CHECK: %sum1.us.1 = phi i32 [ 0, %for.outer ], [ %add.us.1, %for.inner ]
				; CHECK: %j.us.2 = phi i32 [ 0, %for.outer ], [ %inc.us.2, %for.inner ]
				; CHECK: %sum1.us.2 = phi i32 [ 0, %for.outer ], [ %add.us.2, %for.inner ]
				; CHECK: %j.us.3 = phi i32 [ 0, %for.outer ], [ %inc.us.3, %for.inner ]
				; CHECK: %sum1.us.3 = phi i32 [ 0, %for.outer ], [ %add.us.3, %for.inner ]
				; CHECK: br i1 %exitcond.3, label %for.latch, label %for.inner
				; CHECK: for.latch:
				; CHECK: %add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				; CHECK: %add.us.lcssa.1 = phi i32 [ %add.us.1, %for.inner ]
				; CHECK: %add.us.lcssa.2 = phi i32 [ %add.us.2, %for.inner ]
				; CHECK: %add.us.lcssa.3 = phi i32 [ %add.us.3, %for.inner ]
				; CHECK: br i1 %niter.ncmp.3, label %for.end.loopexit.unr-lcssa.loopexit, label %for.outer
				; CHECK: for.end.loopexit.unr-lcssa.loopexit:
				define void @test9(i32 %I, i32 %J, i32* nocapture %A, i16* nocapture readonly %B) #0 {
				entry:
				%cmp = icmp ne i32 %J, 0
				%cmp122 = icmp ne i32 %I, 0
				%or.cond = and i1 %cmp, %cmp122
				br i1 %or.cond, label %for.outer.preheader, label %for.end

				for.outer.preheader:
				br label %for.outer

				for.outer:
				%i.us = phi i32 [ %add8.us, %for.latch ], [ 0, %for.outer.preheader ]
				br label %for.inner

				for.inner:
				%j.us = phi i32 [ 0, %for.outer ], [ %inc.us, %for.inner ]
				%sum1.us = phi i32 [ 0, %for.outer ], [ %add.us, %for.inner ]
				%arrayidx.us = getelementptr inbounds i16, i16* %B, i32 %j.us
				%0 = load i16, i16* %arrayidx.us, align 4, !tbaa !9
				%sext = sext i16 %0 to i32
				%add.us = add i32 %sext, %sum1.us
				%inc.us = add nuw i32 %j.us, 1
				%exitcond = icmp eq i32 %inc.us, %J
				br i1 %exitcond, label %for.latch, label %for.inner, !llvm.loop !1

				for.latch:
				%add.us.lcssa = phi i32 [ %add.us, %for.inner ]
				%arrayidx6.us = getelementptr inbounds i32, i32* %A, i32 %i.us
				store i32 %add.us.lcssa, i32* %arrayidx6.us, align 4, !tbaa !5
				%add8.us = add nuw i32 %i.us, 1
				%exitcond25 = icmp eq i32 %add8.us, %I
				br i1 %exitcond25, label %for.end.loopexit, label %for.outer

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}



				; CHECK-LABEL: test10
				; Be careful not to incorrectly update the exit phi nodes
				; CHECK: %dec19.lcssa.lcssa.lcssa = phi i64 [ 0, %for.inc24 ]
				%struct.a = type { i64 }
				@g = common local_unnamed_addr global %struct.a zeroinitializer, align 8
				@c = common local_unnamed_addr global [1 x i8] zeroinitializer, align 1
				; Function Attrs: noinline norecurse nounwind uwtable
				define signext i16 @test10(i32 %k) local_unnamed_addr #0 {
				entry:
				%0 = load i8, i8* getelementptr inbounds ([1 x i8], [1 x i8]* @c, i64 0, i64 0), align 1
				%tobool9 = icmp eq i8 %0, 0
				%tobool13 = icmp ne i32 %k, 0
				br label %for.body

				for.body: ; preds = %entry, %for.inc24
				%storemerge82 = phi i64 [ 0, %entry ], [ %inc25, %for.inc24 ]
				br label %for.body2

				for.body2: ; preds = %for.body, %for.inc21
				%storemerge2881 = phi i64 [ 4, %for.body ], [ %dec22, %for.inc21 ]
				br i1 %tobool9, label %for.body2.split, label %for.body2.split.us

				for.body2.split.us: ; preds = %for.body2
				br i1 %tobool13, label %for.inc21, label %for.inc21.loopexit83

				for.body2.split: ; preds = %for.body2
				br i1 %tobool13, label %for.inc21, label %for.inc21.loopexit85

				for.inc21.loopexit83: ; preds = %for.body2.split.us
				%storemerge31.us37.lcssa.lcssa = phi i64 [ 0, %for.body2.split.us ]
				br label %for.inc21

				for.inc21.loopexit85: ; preds = %for.body2.split
				%storemerge31.lcssa.lcssa87 = phi i64 [ 0, %for.body2.split ]
				%storemerge30.lcssa.lcssa86 = phi i32 [ 0, %for.body2.split ]
				br label %for.inc21

				for.inc21: ; preds = %for.body2.split, %for.body2.split.us, %for.inc21.loopexit85, %for.inc21.loopexit83
				%storemerge31.lcssa.lcssa = phi i64 [ %storemerge31.us37.lcssa.lcssa, %for.inc21.loopexit83 ], [ %storemerge31.lcssa.lcssa87, %for.inc21.loopexit85 ], [ 4, %for.body2.split.us ], [ 4, %for.body2.split ]
				%storemerge30.lcssa.lcssa = phi i32 [ 0, %for.inc21.loopexit83 ], [ %storemerge30.lcssa.lcssa86, %for.inc21.loopexit85 ], [ 0, %for.body2.split.us ], [ 0, %for.body2.split ]
				%dec22 = add nsw i64 %storemerge2881, -1
				%tobool = icmp eq i64 %dec22, 0
				br i1 %tobool, label %for.inc24, label %for.body2, !llvm.loop !1

				for.inc24: ; preds = %for.inc21
				%storemerge31.lcssa.lcssa.lcssa = phi i64 [ %storemerge31.lcssa.lcssa, %for.inc21 ]
				%storemerge30.lcssa.lcssa.lcssa = phi i32 [ %storemerge30.lcssa.lcssa, %for.inc21 ]
				%inc25 = add nuw nsw i64 %storemerge82, 1
				%exitcond = icmp ne i64 %inc25, 5
				br i1 %exitcond, label %for.body, label %for.end26

				for.end26: ; preds = %for.inc24
				%dec19.lcssa.lcssa.lcssa = phi i64 [ 0, %for.inc24 ]
				%storemerge31.lcssa.lcssa.lcssa.lcssa = phi i64 [ %storemerge31.lcssa.lcssa.lcssa, %for.inc24 ]
				%storemerge30.lcssa.lcssa.lcssa.lcssa = phi i32 [ %storemerge30.lcssa.lcssa.lcssa, %for.inc24 ]
				store i64 %dec19.lcssa.lcssa.lcssa, i64* getelementptr inbounds (%struct.a, %struct.a* @g, i64 0, i32 0), align 8
				ret i16 0
				}


				attributes #0 = { noinline norecurse nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "denormal-fp-math"="preserve-sign" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="cortex-m33" "target-features"="+d16,+dsp,+fp-armv8,+fp-only-sp,+hwdiv,+thumb-mode,-crc,-dotprod,-hwdiv-arm,-ras" "unsafe-fp-math"="false" "use-soft-float"="false" }

				!1 = !{!1, !2}
				!2 = !{!"llvm.loop.unroll.disable"}
				!5 = !{!6, !6, i64 0}
				!6 = !{!"int", !7, i64 0}
				!7 = !{!"omnipotent char", !8, i64 0}
				!8 = !{!"Simple C/C++ TBAA"}
				!9 = !{!10, !10, i64 0}
				!10 = !{!"short", !7, i64 0}

This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Unroll and JamClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 129294

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Transforms/Utils/UnrollLoop.h

lib/Analysis/DependenceAnalysis.cpp

lib/Target/ARM/ARMTargetTransformInfo.cpp

lib/Transforms/Scalar/LICM.cpp

lib/Transforms/Scalar/LoopUnrollPass.cpp

lib/Transforms/Utils/CMakeLists.txt

lib/Transforms/Utils/LoopUnroll.cpp

lib/Transforms/Utils/LoopUnrollAndJam.cpp

lib/Transforms/Utils/LoopUtils.cpp

test/Other/new-pm-defaults.ll

test/Other/new-pm-thinlto-defaults.ll

test/Other/pass-pipelines.ll

test/Transforms/LoopUnroll/unroll-and-jam-disabled.ll

test/Transforms/LoopUnroll/unroll-and-jam-unprofitable.ll

test/Transforms/LoopUnroll/unroll-and-jam.ll

[LoopUnroll] Unroll and Jam
ClosedPublic