This is an archive of the discontinued LLVM Phabricator instance.

[LoopFlatten] Add a loop-flattening pass
ClosedPublic

Authored by SjoerdMeijer on Jan 22 2018, 3:45 AM.

Details

Summary

This is a simple pass that flattens nested loops.

The intention is to optimise loop nests like this, which together
access an array linearly:

for (int i = 0; i < N; ++i)
  for (int j = 0; j < M; ++j)
    f(A[i*M+j]);

into one loop:

for (int i = 0; i < (N*M); ++i)
  f(A[i]);

It can also flatten loops where the induction variables are not used
in the loop. This can help with codesize and runtime, especially on
simple cpus without advanced branch prediction.

This is only worth flattening if the induction variables are only used
in an expression like i*M+j. If they had any other uses, we would have
to insert a div/mod to reconstruct the original values, so this
wouldn't be profitable.

This pass was originally written by Oliver Stannard, but he had to
go do other super important things. This have been changed a little
since then. There is a patch to enable loop versioning that will
follow.

Diff Detail

Event Timeline

dmgreen created this revision.Jan 22 2018, 3:45 AM
fhahn added a subscriber: fhahn.Jan 22 2018, 3:47 AM

Hi Dave,

Do you have any performance and compile time overhead numbers?

cheers,
sam

samparker added inline comments.Jan 22 2018, 5:35 AM
lib/Transforms/Scalar/LoopFlatten.cpp
115

Why are these predicates valid and not any others?

samparker added inline comments.Jan 23 2018, 1:35 AM
lib/Transforms/Scalar/LoopFlatten.cpp
423

Couldn't you just use 'OuterLoop->isLoopInvariant(InnerLimit)?

Hello.

So when this comes up, it helps to simplify loops a little. The fourinarow test in the testsuite is helped out by this, decreasing the total time at -Os on a M23 by 1.5% (at -O3 the whole loop is just unrolled either way). It also helps knock off 32bytes off the total filesize. Both these results are a little higher than I would expect, considering it's only flattening a single loop.

For an M33, something like this loop takes 880 cycles when flattened, compared to 1350 cycles from before. Both LSR and the registry allocator seem to have a better time when they don't have to deal with outer loops, along with the obvious benefits of removing an outer level of compare/branches/IV etc.

for (int i=0; i<N; i++)
  for (int j=0; j<N; j++)
    V[i*N+j] += val;

I don't have compile time numbers yet, but I don't expect this pass to be very heavy with the checks we are doing. I'll look into it to be sure.

When it comes up, it helps a little. Not much, but hopefully not an expensive pass to do so :)

lib/Transforms/Scalar/LoopFlatten.cpp
115

My understanding is that indvars simplify will canonicalise this check to NE. I'll make sure this is actually correct and update things if I can get other forms to come up.

423

Nice. Like it.

dmgreen edited the summary of this revision. (Show Details)Jan 24 2018, 9:35 AM
dmgreen updated this revision to Diff 131970.Jan 30 2018, 8:12 AM

Added some isLoopInvariant's and updated the checks for compare predicates a little, as per review comments.

bkramer added a subscriber: bkramer.Feb 7 2020, 4:56 AM

Was this ever committed? Loop flattening looks like something that's useful to have.

Was this ever committed?

No

Loop flattening looks like something that's useful to have.

+1, https://bugs.llvm.org/show_bug.cgi?id=40581.

dmgreen marked an inline comment as done.Feb 7 2020, 6:33 AM

Nope. Reviews welcome :)

I guess this is super old and could do with a rebase. I'll try and give that a go.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2020, 7:21 AM
samparker added inline comments.Feb 10 2020, 2:11 AM
llvm/lib/Transforms/Scalar/LoopFlatten.cpp
141 ↗(On Diff #243166)

How are the limits decided for the number of users? My counting seems to be off-by-one for both the compare and the increment.

150 ↗(On Diff #243166)

Doesn't the pattern matching make this check redundant?

213 ↗(On Diff #243166)

Do we check for LCSSA? Or is this covered by it being a 'simple' loop?

260 ↗(On Diff #243166)

reduce nesting here?

390 ↗(On Diff #243166)

Is inbounds not enough? Also, why do we return on the first valid gep, don't we have to check all the users?

dmgreen marked 5 inline comments as done.May 3 2020, 1:32 AM

Sorry for the delay here. I asked for review when I got some suddenly realized it had been years since I actually looked at this (and didn't write it to begin with!) And I apparently didn't have the time right now to look into the details again.

On a conceptual level, it would also be good if it wasn't just me and "the person sat opposite me" (current situation withstanding) that looked at this. I believe large(ish) target independent changes like this should preferably get looked at by more than just one group.

If there is anyone else out there that is interested in it, please feel free to take a look or take it off my hands :)

llvm/lib/Transforms/Scalar/LoopFlatten.cpp
141 ↗(On Diff #243166)

Increment will be used by Phi and Cmp, Cmp will be used by Br I think.

150 ↗(On Diff #243166)

You are likely correct, but safer to keep it around?

213 ↗(On Diff #243166)

This being a LoopPass implies LCSSA form.

260 ↗(On Diff #243166)

Sounds good.

390 ↗(On Diff #243166)

I believe the logic is that if we find a GEP that is inbounds, from that we can prove that we don't overflow.

hintonda removed a subscriber: hintonda.May 4 2020, 12:34 PM
zzheng added a subscriber: zzheng.Sep 23 2020, 12:20 PM
reames added a subscriber: reames.Sep 28 2020, 2:10 PM

Just some random thoughts after glancing at this for a few moments. Don't take them too seriously.

You mention the alternate urem/udiv form and reject it as non-profitable. While I see your point, I also see code that is structured in a way which ties legality and profitability. I'd suggest considering whether you could separate cost model and mechanism. As one for instance, could you have a transform which produced the urem/udiv form and then simplified them away if possible? And then a default cost model which only allowed the transform in that case? The advantage of structuring it that way would be ease of testing, and later extension.

Another alternative to the urem/udiv form would be to introduce 3 IVs - one overall, one current outer, and one current inner. Instead of using an explicit udiv/urem, you could generate code which looks something like:
int inner = 0, outer = 0;
for (int i = 0; i < M * N; i++) {

{ /* body of inner */}
inner++;
if (inner == N) {
  inner = 0;
  outer++;
}

}

This is, of course, just a weird way of writing a nested loop, and LoopSimplify will probably try to turn it back into a nested loop. However, if you knew part of inner was going to simplify, the transformation might be profitable. The advantage to this form is that since it just a weirdly written nested loop, the cost modelling could be slightly fuzzier as there's little danger of getting horrible performance either way.

Of course, an ideal world, LSR would reliably produce the later form from the former. So, that might be worth exploring too. :)

If you can, please make sure that at least the inner loop allows multiple exiting blocks. Multiple exits are rather common in some workloads, so supporting them is appreciated if it's straight forward. (For this transform, it should be, at least in theory.)

Finally, make sure you consider the nest formation code in LoopSimplify. We need to make sure that the two places agree on what profitable nests are. We certainly wouldn't want them fighting each other!

SjoerdMeijer commandeered this revision.Sep 30 2020, 1:45 AM
SjoerdMeijer added a reviewer: dmgreen.

With Dave's and Oliver's permission I am commandeering this because I really would like to see this getting committed soonish and I have some bandwidth to progress this.

I will address the minor, earlier comments.
Thanks @reames for your thoughts. I will have a look at passes potentially fighting each other. Agreed about separating analysis and transforming, will look at this as a follow up once we've go something in.

Rebased

You seem to have lost the pass itself during rebase.

Arg, silly! Thanks for letting me know.

That's rebased, am now addressing minor issues.

This addresses minor issues from @samparker.
@dmgreen had already answered other question.

I am not aware of any other outstanding comments/concerns, but will double check.

I've added the test cases from PR40581. Test v0 does not trigger yet, test v1 triggers. I propose adding support for v0 once we've got something in-tree.

samparker accepted this revision.Oct 1 2020, 3:04 AM

Maybe we could use SCEV for the overflow checks? ;P
Code looks good to me and we've been running this downstream for a few years.

This revision is now accepted and ready to land.Oct 1 2020, 3:04 AM

Thanks Sam. And in addition to what you wrote, this is not enabled by default.
I plan to iterate on this in-tree, and then we can start thinking about the default, but that's TBD I guess.

This revision was automatically updated to reflect the committed changes.