This is an archive of the discontinued LLVM Phabricator instance.

[Pipelines] Introduce SROA after (final, full) loop unrolling
ClosedPublic

Authored by lebedev.ri on Oct 26 2022, 5:53 PM.

Details

Summary

I am surprised that we didn't do this already.

While it would be good to not introduce one more SROA invocation,
but instead move the one from PassBuilder::buildFunctionSimplificationPipeline(),
the existing test coverage says that is a bad idea,
though it would be fine compile-time wise: https://llvm-compile-time-tracker.com/compare.php?from=b150d34c47efbd8fa09604bce805c0920360f8d7&to=5a9a5c855158b482552be8c7af3e73d67fa44805&stat=instructions

So instead, i add yet another SROA run.
I have checked, and it needs to be at least after said full loop unrolling,
but i suppose placing it after InstCombine does not hurt.
Surprisingly, this is still fine compile-time wise: https://llvm-compile-time-tracker.com/compare.php?from=70324cd88328c0924e605fa81b696572560aa5c9&to=fb489bbef687ad821c3173a931709f9cad9aee8a&stat=instructions

Now, something in that link makes me cringe.
Compare it with https://llvm-compile-time-tracker.com/compare.php?from=70324cd88328c0924e605fa81b696572560aa5c9&to=d534377c0324a63ac26b990967577562e28a148f&stat=instructions
Those two commits share base, and don't functionally differ,
yet that +0.23% 'regression' is there in one of them.

I've encountered this in a real code, SROA-after-final-loop-unrolling.ll has been reduced from https://godbolt.org/z/fsdMhETh3

Diff Detail

Event Timeline

lebedev.ri created this revision.Oct 26 2022, 5:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 26 2022, 5:53 PM
lebedev.ri requested review of this revision.Oct 26 2022, 5:53 PM

Some context here: LLVM performs unrolling in two places. There is a full unroll as part of the function simplification pipeline (i.e. interleaved with inlining) and a runtime unroll at the end of the module simplification pipeline. The general expectation is that if a loop is going to be fully unrolled, then this should happen during the full unroll pass. We run SROA after full unroll, as well as a significant further optimization pipeline, and do this pre-inlining, so the inlining cost model is correct. Conversely, the final runtime unroll is not supposed to expose significant further optimization opportunities -- it is a late pre-codegen pass.

Of course, runtime unrolling can end up performing full unrolling, for example if the trip count could not yet be determined at the time of full unrolling, but can be determined at the time of runtime unrolling, and this patch can help in such situations. I have encountered cases like this quite a few times with Rust iterator adaptors, but my conclusion from analyzing them was generally that the proper way to address this is to make the trip count computable at full unroll, because this integrates properly with the remaining pipeline (and is the reason why we have that separate full unroll pass in the first place). For example, I've had cases where the code after unrolling reduced to the moral equivalent of "a + b", which is something we want to happen before inlining, not after. The patch at https://reviews.llvm.org/D133192 is motivated by one such case (it allows LICM to do more scalar promotion, which allows SCEV to compute the trip count before full unroll, which ends up reducing the loop to something trivial).

As such, I'm not convinced that this is the right way to go about it. I would suggest to at least analyze the pre-full-unroll IR in your motivating cases and check whether there is anything obvious that can be done to already enable unrolling at that stage, rather than in the late pipeline.

nikic added a comment.Oct 27 2022, 1:02 AM

Those two commits share base, and don't functionally differ, yet that +0.23% 'regression' is there in one of them.

Yes, the instructions metric is more noisy than I would like nowadays. The consumer-typeset, ClamAV and SPASS benchmarks in particular have this kind of bimodal noise, for whatever reason. I should probably switch the default metric to instructions:u, which doesn't have this problem.

Some context here: LLVM performs unrolling in two places. There is a full unroll as part of the function simplification pipeline (i.e. interleaved with inlining) and a runtime unroll at the end of the module simplification pipeline. The general expectation is that if a loop is going to be fully unrolled, then this should happen during the full unroll pass. We run SROA after full unroll, as well as a significant further optimization pipeline, and do this pre-inlining, so the inlining cost model is correct. Conversely, the final runtime unroll is not supposed to expose significant further optimization opportunities -- it is a late pre-codegen pass.

Of course, runtime unrolling can end up performing full unrolling, for example if the trip count could not yet be determined at the time of full unrolling, but can be determined at the time of runtime unrolling, and this patch can help in such situations. I have encountered cases like this quite a few times with Rust iterator adaptors, but my conclusion from analyzing them was generally that the proper way to address this is to make the trip count computable at full unroll, because this integrates properly with the remaining pipeline (and is the reason why we have that separate full unroll pass in the first place). For example, I've had cases where the code after unrolling reduced to the moral equivalent of "a + b", which is something we want to happen before inlining, not after. The patch at https://reviews.llvm.org/D133192 is motivated by one such case (it allows LICM to do more scalar promotion, which allows SCEV to compute the trip count before full unroll, which ends up reducing the loop to something trivial).

As such, I'm not convinced that this is the right way to go about it. I would suggest to at least analyze the pre-full-unroll IR in your motivating cases and check whether there is anything obvious that can be done to already enable unrolling at that stage, rather than in the late pipeline.

I did. This is not a SCEV/LoopUnroll limitation, but the usual genius of our arbitrarily random cut-offs.
The alternative is to bump "unroll-max-iteration-count-to-analyze" to at least 17 and "unroll-threshold-aggressive" to at least 337.
https://godbolt.org/z/PMe6qhaKs
I can't imagine this won't have worse implications for the compile time. Is that preferred?

lebedev.ri added a comment.EditedOct 27 2022, 7:23 AM

Some context here: LLVM performs unrolling in two places. There is a full unroll as part of the function simplification pipeline (i.e. interleaved with inlining) and a runtime unroll at the end of the module simplification pipeline. The general expectation is that if a loop is going to be fully unrolled, then this should happen during the full unroll pass. We run SROA after full unroll, as well as a significant further optimization pipeline, and do this pre-inlining, so the inlining cost model is correct. Conversely, the final runtime unroll is not supposed to expose significant further optimization opportunities -- it is a late pre-codegen pass.

Of course, runtime unrolling can end up performing full unrolling, for example if the trip count could not yet be determined at the time of full unrolling, but can be determined at the time of runtime unrolling, and this patch can help in such situations. I have encountered cases like this quite a few times with Rust iterator adaptors, but my conclusion from analyzing them was generally that the proper way to address this is to make the trip count computable at full unroll, because this integrates properly with the remaining pipeline (and is the reason why we have that separate full unroll pass in the first place). For example, I've had cases where the code after unrolling reduced to the moral equivalent of "a + b", which is something we want to happen before inlining, not after. The patch at https://reviews.llvm.org/D133192 is motivated by one such case (it allows LICM to do more scalar promotion, which allows SCEV to compute the trip count before full unroll, which ends up reducing the loop to something trivial).

As such, I'm not convinced that this is the right way to go about it. I would suggest to at least analyze the pre-full-unroll IR in your motivating cases and check whether there is anything obvious that can be done to already enable unrolling at that stage, rather than in the late pipeline.

I did. This is not a SCEV/LoopUnroll limitation, but the usual genius of our arbitrarily random cut-offs.
The alternative is to bump "unroll-max-iteration-count-to-analyze" to at least 17 and "unroll-threshold-aggressive" to at least 337.
https://godbolt.org/z/PMe6qhaKs
I can't imagine this won't have worse implications for the compile time. Is that preferred?

Also, why "unroll-max-iteration-count-to-analyze" is iteration-based?
After a brief look, i'm not seeing a complexity/instruction count cut-off (other than MaxUnrolledLoopSize)
which means it would discard 11-iteration 2-instruction loop, but accept 10-iteration 1000-instruction loop?

Some context here: LLVM performs unrolling in two places. There is a full unroll as part of the function simplification pipeline (i.e. interleaved with inlining) and a runtime unroll at the end of the module simplification pipeline. The general expectation is that if a loop is going to be fully unrolled, then this should happen during the full unroll pass. We run SROA after full unroll, as well as a significant further optimization pipeline, and do this pre-inlining, so the inlining cost model is correct. Conversely, the final runtime unroll is not supposed to expose significant further optimization opportunities -- it is a late pre-codegen pass.

Of course, runtime unrolling can end up performing full unrolling, for example if the trip count could not yet be determined at the time of full unrolling, but can be determined at the time of runtime unrolling, and this patch can help in such situations. I have encountered cases like this quite a few times with Rust iterator adaptors, but my conclusion from analyzing them was generally that the proper way to address this is to make the trip count computable at full unroll, because this integrates properly with the remaining pipeline (and is the reason why we have that separate full unroll pass in the first place). For example, I've had cases where the code after unrolling reduced to the moral equivalent of "a + b", which is something we want to happen before inlining, not after. The patch at https://reviews.llvm.org/D133192 is motivated by one such case (it allows LICM to do more scalar promotion, which allows SCEV to compute the trip count before full unroll, which ends up reducing the loop to something trivial).

As such, I'm not convinced that this is the right way to go about it. I would suggest to at least analyze the pre-full-unroll IR in your motivating cases and check whether there is anything obvious that can be done to already enable unrolling at that stage, rather than in the late pipeline.

Also, this reasoning is not correct, full unrolling does not perform partial unrolling,
and yet partial unrolling absolutely can obviously expose further SROA opportunities,
see SROA-after-final-loop-unrolling-2.ll.

I did. This is not a SCEV/LoopUnroll limitation, but the usual genius of our arbitrarily random cut-offs.
The alternative is to bump "unroll-max-iteration-count-to-analyze" to at least 17 and "unroll-threshold-aggressive" to at least 337.
https://godbolt.org/z/PMe6qhaKs
I can't imagine this won't have worse implications for the compile time. Is that preferred?

Also, why "unroll-max-iteration-count-to-analyze" is iteration-based?
After a brief look, i'm not seeing a complexity/instruction count cut-off (other than MaxUnrolledLoopSize)
which means it would discard 11-iteration 2-instruction loop, but accept 10-iteration 1000-instruction loop?

Post-meeting ping.
It would seem nobody has strong feelings on this.
Would anyone like to block this based on the compile time impact?

arsenm added a subscriber: arsenm.Nov 14 2022, 3:53 PM

I think SROA after unroll is important. It's practically the main reason to unroll

lebedev.ri retitled this revision from [Pipelines] Introduce SROA after (full) loop unrolling to [Pipelines] Introduce SROA after (final, full) loop unrolling.
lebedev.ri added a reviewer: arsenm.

I think SROA after unroll is important. It's practically the main reason to unroll

Yay, finally some recognition.
Would you like to stamp? :)

Herald added a project: Restricted Project. · View Herald TranscriptNov 14 2022, 4:53 PM

I think SROA after unroll is important. It's practically the main reason to unroll

llvm/lib/Passes/PassBuilderPipelines.cpp
1201–1202

As the cross-revision diff shows, running sroa before instcombine, non-unexpectedly, makes more sense.

IIUC compile time impact for adding another SROA (the one outside LTO) is negligible?

Regarding the principle of adding another pass and where in the pipeline, we're still at a case by case basis. We had a discussion/round table at LLVM Dev on the documenting the current new pass manager pipeline precisely for understanding pass dependencies and being able to get resolutions on cases such as this. I'm still putting together the summary for that but it should be up on discourse this week.

Testing this change in parallel - I'll follow up tomorrow.

IIUC compile time impact for adding another SROA (the one outside LTO) is negligible?

Yup.

Regarding the principle of adding another pass and where in the pipeline, we're still at a case by case basis. We had a discussion/round table at LLVM Dev on the documenting the current new pass manager pipeline precisely for understanding pass dependencies and being able to get resolutions on cases such as this. I'm still putting together the summary for that but it should be up on discourse this week.

We really can't achieve the same effect here in any other reasonable way,
without penalizing the inliner pipeline.

Testing this change in parallel - I'll follow up tomorrow.

Thanks.

spatel accepted this revision.Nov 16 2022, 1:55 PM

LGTM based on the timing, results, and alternatives discussed
There's some testing in progress according to previous comments, so best to wait for that to finish in case it turns up anything new.

llvm/lib/Passes/PassBuilderPipelines.cpp
1123

Is there a reason to put this down here vs. tacking it on the end of the previous IsFullLTO block?
If LoopUnroll is the reason for adding SROA, then mention that specifically in the comment?

IIRC, all of the FullLTO predicates in this set of passes were questionable (see TODO comment above this function). They just accumulated because the code was duplicated and diverged over time.

This revision is now accepted and ready to land.Nov 16 2022, 1:55 PM

LGTM based on the timing, results, and alternatives discussed

Thank you for the review.

There's some testing in progress according to previous comments, so best to wait for that to finish in case it turns up anything new.

llvm/lib/Passes/PassBuilderPipelines.cpp
1123

Is there a reason to put this down here vs. tacking it on the end of the previous IsFullLTO block?

I don't have any particular reason for this. Will change.

If LoopUnroll is the reason for adding SROA, then mention that specifically in the comment?

Will do.

IIRC, all of the FullLTO predicates in this set of passes were questionable (see TODO comment above this function). They just accumulated because the code was duplicated and diverged over time.

Yeah, i remember all that. This is indeed ugly.

lebedev.ri marked an inline comment as done.

Adjust full LTO pipeline, for symmetry with non-full LTO pipeline.
Looks like there is test coverage shortage.

As far as performance, this looks fine. Not seeing measurable gains.

As far as performance, this looks fine. Not seeing measurable gains.

Thank you for checking!
Is the evaluation still ongoing, or is this the green light to go ahead?

Green light perf-wise.
I cannot comment on whether the position is "the right" one though. I'm deferring to the other reviewers.

Green light perf-wise.
I cannot comment on whether the position is "the right" one though. I'm deferring to the other reviewers.

Thank you!

This revision was landed with ongoing or failed builds.Nov 17 2022, 10:35 AM
This revision was automatically updated to reflect the committed changes.
dyung added a subscriber: dyung.Nov 21 2022, 9:03 PM

Hi @lebedev.ri, we recently noticed a failure in one of our internal tests which I bisected back to your change. I have filed the details in Issue #59122. Can you take a look?