This is an archive of the discontinued LLVM Phabricator instance.

Don't disable loop unroll for vectorized loops on AMDGPU target
ClosedPublic

Authored by alex-t on Apr 26 2023, 12:10 PM.

Details

Summary

We've got a performance regression after the https://reviews.llvm.org/D115261.
Despite the loop being vectorized unroll is still required.

Diff Detail

Event Timeline

alex-t created this revision.Apr 26 2023, 12:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2023, 12:10 PM
alex-t requested review of this revision.Apr 26 2023, 12:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2023, 12:10 PM

Add a test?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7748

Original code had CanonicalIVStartValue check. So to restore it this should look: if (CanonicalIVStartValue || !UP.unrollLoopEpilogues).

Add a test?

That would be helpful. It would be good to understand why runtime unrolling is needed here. Does the interleave heuristic not kick in?

Add a test?

That would be helpful. It would be good to understand why runtime unrolling is needed here. Does the interleave heuristic not kick in?

I actually meant a simpler formal test to check that unroll works on AMDGPU. I am not sure @alex-t will be able to produce a reasonable test showing performance problem. The loop in question triggered the regression is quite complex to begin with.

alex-t retitled this revision from Must unroll epilogue loops after vectorization on AMDGPU target to Not disable loop unroll for vectorized loops on AMDGPU target.May 3 2023, 11:27 AM

Add a test?

That would be helpful. It would be good to understand why runtime unrolling is needed here. Does the interleave heuristic not kick in?

As far as I understand, before https://reviews.llvm.org/D115261 unroll was disabled for the epilogue loops only.
https://reviews.llvm.org/D115261 introduces disabling unroll for any loop which was vectorized.

The regression is caused by the nature of the GPU target architecture. It has a massive parallel HW and very costly branches.
The loop vectorization on a high-level IR yields almost nothing but loop unroll is always crucial for the performance.

alex-t updated this revision to Diff 519844.May 5 2023, 7:11 AM

The review title was changed to better reflect the changes being submitted.
The variable name was changed accordingly.
The test was added to check that

llvm.loop.unroll.runtime.disable

metadata were not added to the main loop but were added to the scalar epilogue.
This exactly reflects the desired behavior.

nikic added a subscriber: nikic.May 5 2023, 7:25 AM

@alex-t The premise of the change was that if a loop is vectorized, the vectorizer is also responsible for unrolling it via interleaving. The question is why the vectorizer failed to unroll the loop in your workload. We should understand that before adding additional options.

bcahoon added a subscriber: bcahoon.May 6 2023, 8:13 AM
alex-t added a comment.EditedMay 8 2023, 6:14 PM
In D149281#4322079, @nikic wrote: The question is why the vectorizer failed to unroll the loop in your workload.

It did not fail in fact. The "unrolling via interleaving" was deliberately disabled for the AMDGPU target since it led to the uncontrolled RP increase.
The corresponding change was addressed in https://reviews.llvm.org/D122850.

For CPU you decide about the interleave count by subtracting the loop invariants number from the number of the available registers and dividing the result by the RP for the given class. This allows us to estimate the number of computation flows that may run simultaneously.

For GPU, which is natively a SIMT machine this estimation on the high level merely does not make sense.
The LoopUnroll is controllable and lets us reasonably trade-off between the unroll size and the RP.

fhahn added a comment.May 9 2023, 4:43 AM
In D149281#4322079, @nikic wrote: The question is why the vectorizer failed to unroll the loop in your workload.

It did not fail in fact. The "unrolling via interleaving" was deliberately disabled for the AMDGPU target since it led to the uncontrolled RP increase.
The corresponding change was addressed in https://reviews.llvm.org/D122850.

For CPU you decide about the interleave count by subtracting the loop invariants number from the number of the available registers and dividing the result by the RP for the given class. This allows us to estimate the number of computation flows that may run simultaneously.

For GPU, which is natively a SIMT machine this estimation on the high level merely does not make sense.
The LoopUnroll is controllable and lets us reasonably trade-off between the unroll size and the RP.

Thanks for sharing. It still sounds to me that the underlying issue is proper interleaving cost-modeling in LV for those cases and that should be the long-term fix. Allowing unrolling again seems mostly like a short-term workaround, rather than a proper fix. Which is probably fine, but we should aim to improve the cost-modeling. This would need someone familiar with AMDGPU to drive.

Here are some observations about the performance regression that we see once the unrolling is disabled by the vectorizer.

In a specific case, the best performance occurs when the loop is unrolled 8 iterations (I haven't tried more than 8). It's a small loop, and after unrolling the loads in the unrolled loop are all moved to the top of the loop, which increases the time between the def and uses.

AMDGPU enables run-time loop unrolling only in specific cases as enabling run-time unrolling unconditionally is not profitable in many cases.

Increasing interleaving doesn't improve vectorization for this loop. The only effect is to perform unrolling, which would be done by the run-time loop unrolling pass if it weren't disabled. Since a target hook is used to control run-time unrolling, AMDGPU can be very specific about determining the unrolling profitability.

In order to regain performance through interleaving, then getNumberOfRegisters() needs to be at least 20, as the best performance is achieved by unrolling 8 iterations. In the past, AMDGPU changed getNumberOfRegisters() to be a smaller value due to performance regressions due to register pressure. The challenge with increasing getNumberOfRegisters() to enable interleaving is that it will introduce other regressions, as more loops will be unrolled and register pressure will increase.

Thanks for sharing. It still sounds to me that the underlying issue is proper interleaving cost-modeling in LV for those cases and that should be the long-term fix. Allowing unrolling again seems mostly like a short-term workaround, rather than a proper fix. Which is probably fine, but we should aim to improve the cost-modeling. This would need someone familiar with AMDGPU to drive.

Fairly speaking, the whole idea of loop vectorization for GPU seems nonsense to me. Although I am not an expert in loop optimizations.
GPU has no wide vector registers which may be used to process several scala values at one HW cycle and, by this, unroll the loop by the vector factor. Instead, each thread in a wavefront operates on its own separate value in a 32-bit wide lane for the divergent values and all threads operate on the same shared scalar value in case it is uniform.
If we have a completely uniform input program (no dependence on thread ID) we could not get any better benefit than from the usual unroll performed by the loop unroll pass.
So, IMO the LV is just a complicated and error-prone way to do loop unroll.
Once again, I may not understand some subtle matters as I have no large experience with the LV.

Fairly speaking, the whole idea of loop vectorization for GPU seems nonsense to me. Although I am not an expert in loop optimizations.
GPU has no wide vector registers which may be used to process several scala values at one HW cycle and, by this, unroll the loop by the vector factor. Instead, each thread in a wavefront operates on its own separate value in a 32-bit wide lane for the divergent values and all threads operate on the same shared scalar value in case it is uniform.
If we have a completely uniform input program (no dependence on thread ID) we could not get any better benefit than from the usual unroll performed by the loop unroll pass.
So, IMO the LV is just a complicated and error-prone way to do loop unroll.
Once again, I may not understand some subtle matters as I have no large experience with the LV.

Right, we do not have vector operations in the same sense as a SIMD machine, neither do we have vector registers in the same sense. Well, almost. We have vector (meaning wide, multicomponent) loads and stores and we have SOME v2 16-bit packed and v2 32-bit packed operations on some subtargets. Although this is not a full set of packed ALU as needed for vectorization and may not warrant a full loop vectorizer (SLP vectorizer is still in use). In any way there are no SIMD-style vector registers to control the interleave. Even though we are using register tuples for wide operations, these are not separate registers like for example XMMs on x86, and cut into the same register budget. So increasing 'number of vector registers' does not only mean much for our target but also inevitably leads to performance regressions.

All in all loop vectorizer does not do much for AMDGPU and cannot be a pass driving unroll. It still does something as said above, so whenever to keep it on or off is a completely separate question, but certainly it is no an unroll driver here.

llvm/include/llvm/Analysis/TargetTransformInfo.h
572

"Don't disable"...

573

Capitalize 'unroll'.

llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll
16

I'd expect the test to show no unroll disable metadata?

alex-t updated this revision to Diff 524654.May 23 2023, 4:49 AM

Minor changes: comment corrected, class member name capitalized

alex-t added inline comments.May 23 2023, 5:00 AM
llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll
16

I'd expect the test to show no unroll disable metadata?

This checks that the epilogue loop unroll is still disabled as it was before. As I remember you insisted on keeping this behavior.

"Original code had CanonicalIVStartValue check. So to restore it this should look: if (CanonicalIVStartValue || !UP.unrollLoopEpilogues)"

So we check not the absence/presence of metadata but their structure. We also check that "!0" (w/o unroll disable) is set on the middle block (main loop vectorized) but "!2" (with unroll disable) is set on the scalar epilogue loop.

alex-t retitled this revision from Not disable loop unroll for vectorized loops on AMDGPU target to Don't disable loop unroll for vectorized loops on AMDGPU target.May 23 2023, 5:01 AM
alex-t updated this revision to Diff 524671.May 23 2023, 5:54 AM

clang-format error corrected

This revision is now accepted and ready to land.May 23 2023, 12:17 PM

Fairly speaking, the whole idea of loop vectorization for GPU seems nonsense to me. Although I am not an expert in loop optimizations.
GPU has no wide vector registers which may be used to process several scala values at one HW cycle and, by this, unroll the loop by the vector factor. Instead, each thread in a wavefront operates on its own separate value in a 32-bit wide lane for the divergent values and all threads operate on the same shared scalar value in case it is uniform.
If we have a completely uniform input program (no dependence on thread ID) we could not get any better benefit than from the usual unroll performed by the loop unroll pass.
So, IMO the LV is just a complicated and error-prone way to do loop unroll.
Once again, I may not understand some subtle matters as I have no large experience with the LV.

Right, we do not have vector operations in the same sense as a SIMD machine, neither do we have vector registers in the same sense. Well, almost. We have vector (meaning wide, multicomponent) loads and stores and we have SOME v2 16-bit packed and v2 32-bit packed operations on some subtargets. Although this is not a full set of packed ALU as needed for vectorization and may not warrant a full loop vectorizer (SLP vectorizer is still in use). In any way there are no SIMD-style vector registers to control the interleave. Even though we are using register tuples for wide operations, these are not separate registers like for example XMMs on x86, and cut into the same register budget. So increasing 'number of vector registers' does not only mean much for our target but also inevitably leads to performance regressions.

All in all loop vectorizer does not do much for AMDGPU and cannot be a pass driving unroll. It still does something as said above, so whenever to keep it on or off is a completely separate question, but certainly it is no an unroll driver here.

Interesting, thanks for elaborating. It seems like the role/benefit of LV here is a bit unclear and in the long term it would be good to disable it entirely if it can do more harm then good.

llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll
2

Can you precommit the test to show have only the functional diff in the actual patch here>

25

is it possible to fold the whole loop body in a single block to simplify the test?

alex-t added inline comments.May 24 2023, 6:58 AM
llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll
2

Can you precommit the test to show have only the functional diff in the actual patch here>

I am not sure if I understand correctly. Do you want me to split the commit into two parts?
commit the test in a separate commit to only leave the diff in LoopVectorize.cpp for yet one more commit?

alex-t updated this revision to Diff 525167.May 24 2023, 7:12 AM

Test simplified

alex-t marked 3 inline comments as done.May 24 2023, 7:13 AM