Page MenuHomePhabricator

[LoopFlatten] Enable it by default
Needs ReviewPublic

Authored by SjoerdMeijer on Sep 17 2021, 3:31 AM.

Details

Summary

LoopFlatten improves a well known embedded benchmark with highly-popular industry applications with a few percentage points. But it is not restricted to just optimise a single benchmark case. Find below results for the llvm test suite and the number of loops it flattened:

Test                                                                       # Loops flattened
--------------------------------------------------------------------------------------------
MultiSource/Applications/JM/lencod/lencod                                  3
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg                       1
MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg                 3
MultiSource/Applications/JM/ldecod/ldecod                                  1       
MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000                         3       
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4                              17      
SingleSource/Benchmarks/Misc/himenobmtxpa                                  2       
MicroBenchmarks/ImageProcessing/AnisotropicDiffusion/AnisotropicDiffusion  2
MicroBenchmarks/ImageProcessing/BilateralFiltering/BilateralFilter         2
MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG                    20
MultiSource/Benchmarks/Rodinia/pathfinder/pathfinder                       1
MicroBenchmarks/ImageProcessing/Blur/blur                                  2
MicroBenchmarks/ImageProcessing/Dither/Dither                              2
MicroBenchmarks/ImageProcessing/Dilate/Dilate                              2
MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG                       1
MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC                 1
MicroBenchmarks/ImageProcessing/Interpolation/Interpolation                2
MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk                             2
MultiSource/Benchmarks/Rodinia/backprop/backprop                           1
----------------------------------------------------------------------------------------------
Total                                                                     68

While the implementation of LoopFlatten recognises a few patterns and could be made more generic, I believe these numbers show that it's generic enough to trigger on a wide variety of code bases, making it worthwile to enable it by default.

LoopFlatten is a relatively simple pass, it e.g. doesn't implement a computationally expensive algorithm, and doesn't require more analysis than a
typical loop pass. Compile-times for the llvm test suite (ClamAV, 7zip, tramp3d-v4, kimwitu++, sqlite3, mafft, SPASS, lencod, Bullet) show a very minor increase of ~0.04% to 0.28%. There are cases that improve compile times, but I haven't analysed that and don't want to claim of course that in general it will improve compile-times.

We have LoopFlatten enable by default downstream for many years now, thus it should have had a lot of exposure and usage and we are not aware of any problems.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Sep 17 2021, 3:31 AM
SjoerdMeijer requested review of this revision.Sep 17 2021, 3:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 17 2021, 3:31 AM

Compile-time: https://llvm-compile-time-tracker.com/compare.php?from=0fc624f029f568e91caf74d90abc5d8d971151c2&to=60d33cf095caea659c46e6a026ecf8899b3d60be&stat=instructions

I wasn't able to get numbers for the legacy PM, because this seems to either cause or expose a crash in IndVarSimplify on z51.c from consumer-typeset: https://llvm-compile-time-tracker.com/show_error.php?commit=60d33cf095caea659c46e6a026ecf8899b3d60be That should be resolved before landing this.

llvm/test/Other/new-pm-defaults.ll
181

As a drive-by comment for @asbirlea and @aeubanks, it looks like there's an opportunity here to not rerun LoopSimplify/LCSSA if we run multiple Loop/LoopNest pass managers back to back?

Compile-time: https://llvm-compile-time-tracker.com/compare.php?from=0fc624f029f568e91caf74d90abc5d8d971151c2&to=60d33cf095caea659c46e6a026ecf8899b3d60be&stat=instructions

I wasn't able to get numbers for the legacy PM, because this seems to either cause or expose a crash in IndVarSimplify on z51.c from consumer-typeset: https://llvm-compile-time-tracker.com/show_error.php?commit=60d33cf095caea659c46e6a026ecf8899b3d60be That should be resolved before landing this.

Many thanks for running and confirming those numbers @nikic!
And yes, I will have a look to see what's going on with that crash.

Nice! I think this is a good thing.

I have checked this on some code i care about and as far as i can tell
it pretty much always fails with Loop::isCanonical(), e.g. because
the loops produced by OpenMP lowering are non-canonical by definition.
This is probably not a blocker, but it does mean that the true impact
of this change (aka, does it regress cases and require costmodel tuning?)
will not be apparent until later.

SjoerdMeijer added a comment.EditedSep 17 2021, 5:54 AM

Nice! I think this is a good thing.

I have checked this on some code i care about and as far as i can tell
it pretty much always fails with Loop::isCanonical(), e.g. because
the loops produced by OpenMP lowering are non-canonical by definition.
This is probably not a blocker, but it does mean that the true impact
of this change (aka, does it regress cases and require costmodel tuning?)
will not be apparent until later.

Yeah, by design it is fairly specific in the cases it supports (but still triggers a lot). For example, PR40581 is about a case we don't yet support. If you have a different case, I would be happy if you can raise a PR, then I can have a look after this lands.

aeubanks added inline comments.Sep 17 2021, 9:20 AM
llvm/test/Other/new-pm-defaults.ll
181

we need to fix where we add LoopFlattenPass in PassBuilderPipelines.cpp

instead of

if (EnableLoopFlatten)
  FPM.addPass(createFunctionToLoopPassAdaptor(LoopFlattenPass()));

we should move it earlier to

if (EnableLoopFlatten)
  LPM2.addPass(LoopFlattenPass());

It would be good to have some performance testing for this too.

llvm/test/Other/new-pm-defaults.ll
181

+1 it should be added to a LPM.

nikic added inline comments.Sep 17 2021, 2:12 PM
llvm/test/Other/new-pm-defaults.ll
181

Oh, I wasn't aware that loop passes and loop nest passes can run in the same pass manager. That makes more sense indeed.

fwiw, I'm seeing a crash during bootstrap build, so this needs more testing for sure.

SjoerdMeijer added inline comments.Sep 20 2021, 12:19 AM
llvm/test/Other/new-pm-defaults.ll
181

Thanks for commenting on this and the suggestions! LoopFlatten was intentionally added to where it currently lives for the LPM. LoopFlatten removes an inner-loop, and a loop pass was not able to deal with that very well under the LPM. But this is probably out of date for the NPM, and LoopFlatten was made a LoopNest pass in D102904, so I guess that means we can just add to where you suggested.

I am addressing the crash in D110234.

With the bootstrap failure fixed in D110234 and another recently raised issue D110712, and having tested this more, I would like to pick this up again.

It would be good to have some performance testing for this too.

Like I mentioned in the description, this gives a really good improvement on an embedded benchmark, but is generic enough to trigger a lot in for example the llvm test suite (and other). Because LoopFlatten removes an inner-loop, it is unlikely LoopFlatten makes things worse, and should be a case of "it should give the same or better performance". Supporting this with some data:

Test# flattened loops% diff
MultiSource/Applications/JM/lencod/lencod.test3-0.28
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test1-9.04
MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test3-5.75
MultiSource/Applications/JM/ldecod/ldecod.test10.97
MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test30.29
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test17-0.37
SingleSource/Benchmarks/Misc/himenobmtxpa.test20.09
MicroBenchmarks/ImageProcessing/AnisotropicDiffusion/AnisotropicDiffusion.test/322-1.27
MicroBenchmarks/ImageProcessing/AnisotropicDiffusion/AnisotropicDiffusion.test/642-0.47
MicroBenchmarks/ImageProcessing/AnisotropicDiffusion/AnisotropicDiffusion.test/1282-0.24
MicroBenchmarks/ImageProcessing/AnisotropicDiffusion/AnisotropicDiffusion.test/2562-0.18
MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test20-0.21
MultiSource/Benchmarks/Rodinia/pathfinder/pathfinder.test1-0.84
MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG.test10.50
MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test10.11
MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test2-1.39
MultiSource/Benchmarks/Rodinia/backprop/backprop.test1-0.45

Negative numbers are reductions in exec times, so is better.
Take these numbers this with a little bit of salt because the test suite can be a bit noisy. But like I said, I think the take away message is that LoopFlatten is a nice simplification doing some good here and there (I actually haven't paid attention to it, but should help code-size a bit too I guess).

What do we think of this?

Try to reland this?

AMDGPU test fails? Please fix.

Try to reland this?

Thanks for looking again at this. This was never committed, so I am looking for an LGTM to do so. :)
But perhaps I didn't understand.

AMDGPU test fails? Please fix.

Ah yeah, thanks. My mistake, I didn't build with the amdgpu backend enabled, so didn't catch this. Will fix.

Fixed the amdgpu test.

Can you also update release notes?

Updated the ReleaseNotes.

Many thanks @nikic and @aeubanks for sorting out the "preserved analysis" side of things here, really appreciate it! With this fixed, i.e. D111350 and D111328, are we happy for this go in too?

nikic added inline comments.Oct 11 2021, 1:24 AM
llvm/test/CodeGen/AMDGPU/opt-pipeline.ll
164

I'm somewhat confused by what is going on here. Why do we now calculate MemorySSA and why does LoopUnroll get split into a separate LPM?

SjoerdMeijer added inline comments.Oct 11 2021, 2:26 AM
llvm/test/CodeGen/AMDGPU/opt-pipeline.ll
164

I have no idea and accepted this as something the pass manager decided to do.... I am also confused about both things: I have no idea why we need to rerun Memory SSA, and don't see why LoopUnroll is now run separately. I will look into this, see if I can get any wiser here....

SjoerdMeijer added inline comments.Oct 11 2021, 6:06 AM
llvm/test/CodeGen/AMDGPU/opt-pipeline.ll
164

In lib/Passes/PassBuilderPipelines.cpp we have this:

if (EnableLoopFlatten)
  FPM.addPass(createFunctionToLoopPassAdaptor(LoopFlattenPass()));
// The loop passes in LPM2 (LoopFullUnrollPass) do not preserve MemorySSA.
// *All* loop passes must preserve it, in order to be able to use it.
FPM.addPass(createFunctionToLoopPassAdaptor(std::move(LPM2),
                                            /*UseMemorySSA=*/false,
                                            /*UseBlockFrequencyInfo=*/false));

Since this is talking about MemorySSA this might be related, but there are so many things going on here and I am still looking, so don't know for certain. If e.g. @aeubanks has some tips or suggestions here, I would be happy to receive them. :)

aeubanks added inline comments.Oct 11 2021, 10:31 AM
llvm/test/CodeGen/AMDGPU/opt-pipeline.ll
164

This is a legacy PM test.
I'm not super familiar with the legacy PM, but it's probably something to do with the fact that the legacy loop flatten pass is a function pass and perhaps doesn't preserve some analyses?
Could we perhaps just make this change for the new PM?

SjoerdMeijer added inline comments.Oct 11 2021, 11:10 AM
llvm/test/CodeGen/AMDGPU/opt-pipeline.ll
164

I'm not super familiar with the legacy PM, but it's probably something to do with the fact that the legacy loop flatten pass is a function pass and perhaps doesn't preserve some analyses?

Yep, that's what I thought too, but didn't get to the bottom of it and can't "prove" it.

Could we perhaps just make this change for the new PM?

Thanks for the suggestion, I am happy to enable LoopFlatten by default only for the new PM, so will look into that. Diverging is probably not ideal, but since the legacy PM is deprecated I am happy to do that. The alternative is to "accept" that this is what the legacy PM does, but looks like we are not happy with that, so again, will do this only for the new PM.

Thanks for the suggestion, I am happy to enable LoopFlatten by default only for the new PM, so will look into that.

My team presently relies on loop flattening with a downstream compiler, and we presently use the LPM and will move to the NPM after the 14.x branch. It would be great if this functionality didn't diverge between the two, although I understand the frustration. At the very least, I would like to understand what is different, and we could perhaps enable by default downstream until we migrate to the NPM.

Thanks for the suggestion, I am happy to enable LoopFlatten by default only for the new PM, so will look into that.

My team presently relies on loop flattening with a downstream compiler, and we presently use the LPM and will move to the NPM after the 14.x branch. It would be great if this functionality didn't diverge between the two, although I understand the frustration. At the very least, I would like to understand what is different, and we could perhaps enable by default downstream until we migrate to the NPM.

There are already a good number of divergences between the two pass managers, I don't see the harm in keeping the LPM pipeline the same. Do you specifically care about this pass?

The changes to the test show that this would affect execution order of passes which could actually affect optimizations. People have requested that the LPM not regress even though it's deprecated, and I'm hesitant to touch LPM pass execution order.
I'd say that somebody cares about this pass and is using the LPM, they can investigate the weirdness and turn it on in a separate change for the LPM.

There are already a good number of divergences between the two pass managers, I don't see the harm in keeping the LPM pipeline the same. Do you specifically care about this pass?

Yes, we asked about loop flattening a year ago along with a jump threading improvement in the context of improving the Coremark, and @SjoerdMeijer indicated that there was a pending code review. I can appreciate that there are already divergences between the two pass managers. If there is a possibility of regression on the LPM in this case, then I can understand not enabling it by default. We can explore enabling it downstream, but any other information about impact to the LPM would be helpful and appreciated. Thanks!

There are already a good number of divergences between the two pass managers, I don't see the harm in keeping the LPM pipeline the same. Do you specifically care about this pass?

Yes, we asked about loop flattening a year ago along with a jump threading improvement in the context of improving the Coremark, and @SjoerdMeijer indicated that there was a pending code review. I can appreciate that there are already divergences between the two pass managers. If there is a possibility of regression on the LPM in this case, then I can understand not enabling it by default. We can explore enabling it downstream, but any other information about impact to the LPM would be helpful and appreciated. Thanks!

We could make the flag only affect the LPM and forcibly enable loop flatten in the NPM without a flag to turn it off. That way it'd be easier for downstream users to evaluate it with the LPM.

I am not sure I entirely follow, @alanphipps . We also rely on this pass, and for many years we have been using this with the LPM, and since the switch we are now using it under the NPM. This helps a benchmark case (and the other things I mentioned), and both under the LPM and the NPM this will trigger, so I am not sure what regressions we are talking about it. Also, if you're using this downstream, like us, then I assume you have a downstream change to enable it by default, just like us. So if we are going to enable this by default under the NPM, I am not sure if anything will change for you. But either way, I am happy to make this change:

We could make the flag only affect the LPM and forcibly enable loop flatten in the NPM without a flag to turn it off. That way it'd be easier for downstream users to evaluate it with the LPM.

thanks for the suggestion.

asbirlea added inline comments.Oct 11 2021, 1:58 PM
llvm/test/CodeGen/AMDGPU/opt-pipeline.ll
164

The reason for this change in pipeline is due to where LoopFlatten is added in legacy pass manager , along with LoopSimplifyCFG (llvm/lib/Transforms/IPO/PassManagerBuilder.cpp:472)

// We resume loop passes creating a second loop pipeline here.
if (EnableLoopFlatten) {
  MPM.add(createLoopFlattenPass()); // Flatten loops
  MPM.add(createLoopSimplifyCFGPass());
}

It looks like LoopSimplifyCFG requires MemorySSA, which explains why that is being computed and why the loop pipeline is split (LoopUnroll does not preserve MemorySSA do it's getting split out)

I don't think LoopSimplifyCFG should require MemorySSA, just preserve it if available. D111578 should simplify these tests.

I assume you have a downstream change to enable it by default, just like us. So if we are going to enable this by default under the NPM, I am not sure if anything will change for you

Yes, fair enough -- rereading the discussion, I was concerned that the MemorySSA issue for the LPM wasn't well understood and perhaps something was being overlooked in the implementation, but it sounds like the concern is more pertinent to performance impact/fluctuation on the LPM and not the implementation itself, which is complete. I think we are good enabling it downstream.

xbolva00 added a comment.EditedDec 29 2021, 8:39 AM

Reping.

Enable only for npm? lpm is dead and will be removed anyway.

nikic requested changes to this revision.Dec 29 2021, 9:39 AM

Marking this as blocked on D110057 for now.

This revision now requires changes to proceed.Dec 29 2021, 9:39 AM

Thanks for the ping. I really do want to finish this, but need to find some courage to deal with the pass manager business in D110057. ;-) But am going to take a look now.

With D110057 committed, is this now good to go too?

(I will check the test changes here after landing D110057, haven’t done that yet)

nikic added a comment.Dec 31 2021, 2:25 AM

New compile-time: https://llvm-compile-time-tracker.com/compare.php?from=6f45fe9851c673883b3a258351ee4997aa2c028c&to=439d1c9613828db24ad03eb415baad2e76b8913c&stat=instructions Looks ok to me.

We may want to drop the LoopSimplifyCFG run from the LegacyPM pipeline, as that doesn't match NewPM and seems to account for the actual codegen impact of this change (LoopFlatten itself doesn't seem to be doing much in practice).

Update of the tests after D110057.

Nice, thank you for sharing!
That indeed looks good I think.

We may want to drop the LoopSimplifyCFG run from the LegacyPM pipeline, as that doesn't match NewPM and seems to account for the actual codegen impact of this change (LoopFlatten itself doesn't seem to be doing much in practice).

I will look into that. I think I can do that separately.

I think all points have been addressed, so am looking for an LGTM .... anyone interested? :-)

Matt added a subscriber: Matt.Jan 5 2022, 3:21 PM

With your help, we have fixed quite a few things, like updating the MemorySSA state and moving it a LoopManager (and fixing a performance regression related to that).

With these things fixed, are we happy to enable LoopFlatten by default?

Thank you for resolving all the issues raised! Could you rebase this patch on ToT? (to include the MSSA update and the move to LPM1?)

I'm planning to run a performance check on this, to see what the expectation is for enabling LoopFlatten by default. Kindly allow this week for me to do the testing.

Thank you for resolving all the issues raised! Could you rebase this patch on ToT? (to include the MSSA update and the move to LPM1?)

Done.

I'm planning to run a performance check on this, to see what the expectation is for enabling LoopFlatten by default. Kindly allow this week for me to do the testing.

Okay, excellent, thank you very much for this @asbirlea!

ormris removed a subscriber: ormris.Jan 24 2022, 11:10 AM

Quick update: I'm not seeing any major run time regressions, but there are a few unrelated issues that may make the results I'm seeing incomplete.
I'm seeing some compile-time regressions, I'd be curious to see what the updated compiler-tracker results look like.

Hi @asbirlea, thanks for checking, much appreciated! About:

I'm seeing some compile-time regressions, I'd be curious to see what the updated compiler-tracker results look like.

Nikic ran compile time numbers, and they should still be the relevant/representative numbers as I don't think anything (relevant) has changed since then:

New compile-time: https://llvm-compile-time-tracker.com/compare.php?from=6f45fe9851c673883b3a258351ee4997aa2c028c&to=439d1c9613828db24ad03eb415baad2e76b8913c&stat=instructions Looks ok to me.

We may want to drop the LoopSimplifyCFG run from the LegacyPM pipeline, as that doesn't match NewPM and seems to account for the actual codegen impact of this change (LoopFlatten itself doesn't seem to be doing much in practice).

But I will double check.

Fresh numbers: https://llvm-compile-time-tracker.com/compare.php?from=0776f6e04d8c3823c31b8fa6fa7d1376a7009af1&to=9522ad62a3f461bc989d2cca3c2bf09b6dab4632&stat=instructions

I am a bit surprised it looks like to be a bit higher than @nikic last measurements as I don't think anything functionally has changed (https://llvm-compile-time-tracker.com/compare.php?from=6f45fe9851c673883b3a258351ee4997aa2c028c&to=439d1c9613828db24ad03eb415baad2e76b8913c&stat=instructions).

We made it a LPM and it is preserving MSSA since Nikita's measurements, but again, was not expecting this to change things. Will do some more experiments.

Gentle ping. Any further thoughts on this?