This is an archive of the discontinued LLVM Phabricator instance.

[OldPM] Pass manager: run SROA after (simple) loop unrolling
ClosedPublic

Authored by lebedev.ri on Sep 19 2020, 11:25 AM.

Details

Summary

I have stumbled into this pretty accidentally, when rewriting some spaghetti-like code
into something more structured, which involved using some std::array<>s.
And to my surprise, the allocas remained, causing about +160% perf regression.

https://llvm-compile-time-tracker.com/compare.php?from=bb6f4d32aac3eecb51909f4facc625219307ee68&to=d563e66f40f9d4d145cb2050e41cb961e2b37785&stat=instructions
suggests that this has geomean compile-time cost of +0.08%.

This fixes PR40011, PR42794 and probably some other reports.

Diff Detail

Event Timeline

lebedev.ri created this revision.Sep 19 2020, 11:25 AM
lebedev.ri requested review of this revision.Sep 19 2020, 11:25 AM

Resolves also https://bugs.llvm.org/show_bug.cgi?id=42794?

But +1 for this change, we should have late sroa.

Also check llvm/test/Other/unroll-sroa.ll

Looks like it does, yes.

But +1 for this change, we should have late sroa.

Also check llvm/test/Other/unroll-sroa.ll

check-llvm passes, what should i look for specifically?
(didn't look at clang tests yet)

I meant added run line for OLDPM in unroll-sroa.ll.

lebedev.ri edited the summary of this revision. (Show Details)Sep 19 2020, 11:54 AM

geomean compile-time cost of +0.08%

Thanks. This is totally fine.

lebedev.ri edited the summary of this revision. (Show Details)Sep 19 2020, 11:58 AM

I'm afraid it doesn't seem to help that one, no.

Fixing a few clang tests and updating one more llvm test to check this also.

Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2020, 12:37 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
fhahn added a comment.Sep 20 2020, 3:50 AM

Surprising this causes only such a small perf regression. I guess it should be OK given that, but here probably are some pathological cases out there where this may cause some noticeable compile-time regressions.

IIUC the additional cases this catches come mainly from fully unrolled loops. If only we had a better way to conditionally run passes :) Then we would ideally only run SROA (and other additional simplification passes) late on functions that had loops fully unrolled. One lightweight way to do so would be to have loop unroll add an 'additional-simplification' attribute to functions that contain loops which it full unrolled and have SROA just run again late if the attribute is present. A similar approach may be helpful in other places too (e.g. the off-by-default -extra-vectorizer-passes option, https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp#L774)

lebedev.ri edited the summary of this revision. (Show Details)Sep 20 2020, 3:56 AM
lebedev.ri edited the summary of this revision. (Show Details)Sep 20 2020, 4:01 AM

https://reviews.llvm.org/D68593 added late SROA to NPM so it would be good to enable it for LPM as well.

I have tested this patch internally and seen gains and losses. On one document search related benchmark 3~5% improvement. One zippy (snappy) there is 3~5% regression. Perhaps we do need a conditional extra SROA run.

I have tested this patch internally and seen gains and losses. On one document search related benchmark 3~5% improvement. One zippy (snappy) there is 3~5% regression. Perhaps we do need a conditional extra SROA run.

Should be same story for NPM since NPM also enables SROA after unrolling.

A) Commit this patch and start working on general solution for LPM and NPM.

B) Ignore this patch. But after LLVM switches to NPM, you have same issue to solve anyway.

(I'm guessing that we are talking about run-time performance here.)

I have tested this patch internally and seen gains and losses. On one document search related benchmark 3~5% improvement. One zippy (snappy) there is 3~5% regression.

Yep, as usual.

Perhaps we do need a conditional extra SROA run.

I think i don't understand the gist

If we don't run it in the cases we expect it wouldn't do anything,
it should still be run in the cases where it *does* do something,
so i'm not sure how conditioning it's run helps with anything.
(well, other than compile-time)

I have tested this patch internally and seen gains and losses. On one document search related benchmark 3~5% improvement. One zippy (snappy) there is 3~5% regression. Perhaps we do need a conditional extra SROA run.

Snappy - you mean public https://github.com/google/snappy?

Well, it should be possible to analyze it...

@lebedev.ri any perf data from testsuite/rawspeed?

spatel added a subscriber: spatel.Sep 20 2020, 10:43 AM

I have tested this patch internally and seen gains and losses. On one document search related benchmark 3~5% improvement. One zippy (snappy) there is 3~5% regression. Perhaps we do need a conditional extra SROA run.

Snappy - you mean public https://github.com/google/snappy?

Well, it should be possible to analyze it...

@lebedev.ri any perf data from testsuite/rawspeed?

I did look.


This suggests that geomean is -0.8% runtime improvement,
with ups&downs.

But as i have said in the patch's description, i stumbled into this when writing new code, where the effect is much larger.

I assume this makes 1f4e7463b5e3ff654c84371527767830e51db10d redundant?

Yes, see llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp change.

xbolva00 added inline comments.Sep 21 2020, 5:55 AM
clang/test/Misc/loop-opt-setup.c
2

OLDPM?

hliao added a comment.Sep 21 2020, 7:44 AM

I have tested this patch internally and seen gains and losses. On one document search related benchmark 3~5% improvement. One zippy (snappy) there is 3~5% regression. Perhaps we do need a conditional extra SROA run.

Snappy - you mean public https://github.com/google/snappy?

Well, it should be possible to analyze it...

@lebedev.ri any perf data from testsuite/rawspeed?

I did look.


This suggests that geomean is -0.8% runtime improvement,
with ups&downs.

But as i have said in the patch's description, i stumbled into this when writing new code, where the effect is much larger.

We probably need to collect more performance data of more benchmarks on more platforms (different targets) to understand the impact. I hesitate to add https://reviews.llvm.org/rG1f4e7463b5e3ff654c84371527767830e51db10d as a generic one as some targets may have regressions due to potentially very different memory access patterns.

This is obviously LGTM from the AMDGPU BE point of view, we did it ourselves.

xbolva00 added a comment.EditedSep 21 2020, 12:02 PM

X86 data collected by @lebedev.ri looks good as well.

@dmgreen for arm?:)

fhahn added a comment.Sep 21 2020, 1:21 PM

I have tested this patch internally and seen gains and losses. On one document search related benchmark 3~5% improvement. One zippy (snappy) there is 3~5% regression. Perhaps we do need a conditional extra SROA run.

That's a bit of a surprise to me! It would be great to know how/why running SROA later makes things worse in some cases.

@dmgreen for arm?:)

This would seem more like a good general codegen cleanup than something that would be target dependent. It would probably be more dependent on the code that is being run, than the exact target. But yeah, I ran some baremetal tests. Only one changed (including in all the codesize tests), which was a nasty complicated state machine. It changed between -5% and +3.5%, depend on the cpu it ran on. (Unfortunately it went down on the cores I was more interested in).

I have run into this exact problem recently and very nearly put up a very similar patch for it. In that case it was making intrinsic MVE code much easier to write, as you could rely on loops not clogging up stack array uses after they were unrolled. The differences were not quite in the 160% range, but they would be nice improvements.

So, a little reluctantly, this does sound like a good idea to me. I asked Sanne to run Spec too if he has the time.

SPEC 2017 on AArch64 is neutral on the geomean. The only slight worry is omnetpp with a 1% regression, but this is balanced by a .8% improvement on mcf. Other changes are in the noise.

RKSimon resigned from this revision.Sep 23 2020, 2:18 AM

@MaskRay, @dmgreen & @sanwou01 thank you for running perf experiment!

I think all the results are consistent along the lines of "this sounds
generally reasonable (esp. given that new-pm does it already),
as usual results in ups&downs, but seems to be a (small) geomean win overall".

Does that sound reasonable?
What are the next suggested steps?

Does that sound reasonable?

Yes IMHO.

What are the next suggested steps?

It would be great to isolate and check the cases which regressed a bit.

Does that sound reasonable?

Yes IMHO.

What are the next suggested steps?

It would be great to isolate and check the cases which regressed a bit.

I've rerun my benchmark, and while the results are still the same (runtime geomean -0.53%/-0.40%,
but that obviously depends on the benchmarks), there are some obvious outliers:



I'll try to take a look at that, assuming it's not noise.

lebedev.ri updated this revision to Diff 295817.Oct 2 2020, 6:51 AM

Rebased, NFC

lebedev.ri marked an inline comment as done.Oct 2 2020, 6:53 AM

Does that sound reasonable?

Yes IMHO.

What are the next suggested steps?

It would be great to isolate and check the cases which regressed a bit.

I've rerun my benchmark, and while the results are still the same (runtime geomean -0.53%/-0.40%,
but that obviously depends on the benchmarks), there are some obvious outliers:



I'll try to take a look at that, assuming it's not noise.

Hmm. So i did just take a look, manually re-benchmarking each of these, and while i still see a few small improvements,
the regressions there are all appear to be basically noise. Not what i was hoping for :/

I have tested this patch internally and seen gains and losses. On one document search related benchmark 3~5% improvement. One zippy (snappy) there is 3~5% regression. Perhaps we do need a conditional extra SROA run.

Does it look like one of the scary "branch predictor got confused"/"code layout changed causing different alignment"?

I'm not really sure what are my potential next steps here.

I'm not really sure what are my potential next steps here.

Maybe just add option to disable late SROA?

lebedev.ri updated this revision to Diff 296005.Oct 3 2020, 1:53 PM

Re-fix clang tests

nikic accepted this revision.Oct 4 2020, 1:10 AM

I'll just say this LGTM as it establishes parity with what NewPM has been doing for a while already.

Reviewers, in the future, please reject any patches that only change the NewPM pipeline or only change the LegacyPM pipeline, unless there is some good technical reason to do so. If there was one here, it was not mentioned in the original patch.

clang/test/CodeGenCXX/union-tbaa2.cpp
1

Remove -fno-experimental-new-pass-manager ? It was added to work around the NewPM/LegacyPM discrepancy.

clang/test/Misc/loop-opt-setup.c
2

Remove the NewPM/OldPM tests now that behavior is the same?

This revision is now accepted and ready to land.Oct 4 2020, 1:10 AM
xbolva00 accepted this revision.Oct 4 2020, 1:15 AM

I'll just say this LGTM as it establishes parity with what NewPM has been doing for a while already.

+1

! In D87972#2310595, @nikic wrote:

I'll just say this LGTM as it establishes parity with what NewPM has been doing for a while already.

+1

Thank you.
I'm gonna just land this as is then.