We've got a performance regression after the https://reviews.llvm.org/D115261.
Despite the loop being vectorized unroll is still required.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Add a test?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
7748 | Original code had CanonicalIVStartValue check. So to restore it this should look: if (CanonicalIVStartValue || !UP.unrollLoopEpilogues). |
That would be helpful. It would be good to understand why runtime unrolling is needed here. Does the interleave heuristic not kick in?
I actually meant a simpler formal test to check that unroll works on AMDGPU. I am not sure @alex-t will be able to produce a reasonable test showing performance problem. The loop in question triggered the regression is quite complex to begin with.
As far as I understand, before https://reviews.llvm.org/D115261 unroll was disabled for the epilogue loops only.
https://reviews.llvm.org/D115261 introduces disabling unroll for any loop which was vectorized.
The regression is caused by the nature of the GPU target architecture. It has a massive parallel HW and very costly branches.
The loop vectorization on a high-level IR yields almost nothing but loop unroll is always crucial for the performance.
The review title was changed to better reflect the changes being submitted.
The variable name was changed accordingly.
The test was added to check that
llvm.loop.unroll.runtime.disable
metadata were not added to the main loop but were added to the scalar epilogue.
This exactly reflects the desired behavior.
@alex-t The premise of the change was that if a loop is vectorized, the vectorizer is also responsible for unrolling it via interleaving. The question is why the vectorizer failed to unroll the loop in your workload. We should understand that before adding additional options.
It did not fail in fact. The "unrolling via interleaving" was deliberately disabled for the AMDGPU target since it led to the uncontrolled RP increase.
The corresponding change was addressed in https://reviews.llvm.org/D122850.
For CPU you decide about the interleave count by subtracting the loop invariants number from the number of the available registers and dividing the result by the RP for the given class. This allows us to estimate the number of computation flows that may run simultaneously.
For GPU, which is natively a SIMT machine this estimation on the high level merely does not make sense.
The LoopUnroll is controllable and lets us reasonably trade-off between the unroll size and the RP.
Thanks for sharing. It still sounds to me that the underlying issue is proper interleaving cost-modeling in LV for those cases and that should be the long-term fix. Allowing unrolling again seems mostly like a short-term workaround, rather than a proper fix. Which is probably fine, but we should aim to improve the cost-modeling. This would need someone familiar with AMDGPU to drive.
Here are some observations about the performance regression that we see once the unrolling is disabled by the vectorizer.
In a specific case, the best performance occurs when the loop is unrolled 8 iterations (I haven't tried more than 8). It's a small loop, and after unrolling the loads in the unrolled loop are all moved to the top of the loop, which increases the time between the def and uses.
AMDGPU enables run-time loop unrolling only in specific cases as enabling run-time unrolling unconditionally is not profitable in many cases.
Increasing interleaving doesn't improve vectorization for this loop. The only effect is to perform unrolling, which would be done by the run-time loop unrolling pass if it weren't disabled. Since a target hook is used to control run-time unrolling, AMDGPU can be very specific about determining the unrolling profitability.
In order to regain performance through interleaving, then getNumberOfRegisters() needs to be at least 20, as the best performance is achieved by unrolling 8 iterations. In the past, AMDGPU changed getNumberOfRegisters() to be a smaller value due to performance regressions due to register pressure. The challenge with increasing getNumberOfRegisters() to enable interleaving is that it will introduce other regressions, as more loops will be unrolled and register pressure will increase.
Thanks for sharing. It still sounds to me that the underlying issue is proper interleaving cost-modeling in LV for those cases and that should be the long-term fix. Allowing unrolling again seems mostly like a short-term workaround, rather than a proper fix. Which is probably fine, but we should aim to improve the cost-modeling. This would need someone familiar with AMDGPU to drive.
Fairly speaking, the whole idea of loop vectorization for GPU seems nonsense to me. Although I am not an expert in loop optimizations.
GPU has no wide vector registers which may be used to process several scala values at one HW cycle and, by this, unroll the loop by the vector factor. Instead, each thread in a wavefront operates on its own separate value in a 32-bit wide lane for the divergent values and all threads operate on the same shared scalar value in case it is uniform.
If we have a completely uniform input program (no dependence on thread ID) we could not get any better benefit than from the usual unroll performed by the loop unroll pass.
So, IMO the LV is just a complicated and error-prone way to do loop unroll.
Once again, I may not understand some subtle matters as I have no large experience with the LV.
Right, we do not have vector operations in the same sense as a SIMD machine, neither do we have vector registers in the same sense. Well, almost. We have vector (meaning wide, multicomponent) loads and stores and we have SOME v2 16-bit packed and v2 32-bit packed operations on some subtargets. Although this is not a full set of packed ALU as needed for vectorization and may not warrant a full loop vectorizer (SLP vectorizer is still in use). In any way there are no SIMD-style vector registers to control the interleave. Even though we are using register tuples for wide operations, these are not separate registers like for example XMMs on x86, and cut into the same register budget. So increasing 'number of vector registers' does not only mean much for our target but also inevitably leads to performance regressions.
All in all loop vectorizer does not do much for AMDGPU and cannot be a pass driving unroll. It still does something as said above, so whenever to keep it on or off is a completely separate question, but certainly it is no an unroll driver here.
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
572 | "Don't disable"... | |
573 | Capitalize 'unroll'. | |
llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll | ||
16 | I'd expect the test to show no unroll disable metadata? |
llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll | ||
---|---|---|
16 |
This checks that the epilogue loop unroll is still disabled as it was before. As I remember you insisted on keeping this behavior. "Original code had CanonicalIVStartValue check. So to restore it this should look: if (CanonicalIVStartValue || !UP.unrollLoopEpilogues)" So we check not the absence/presence of metadata but their structure. We also check that "!0" (w/o unroll disable) is set on the middle block (main loop vectorized) but "!2" (with unroll disable) is set on the scalar epilogue loop. |
Interesting, thanks for elaborating. It seems like the role/benefit of LV here is a bit unclear and in the long term it would be good to disable it entirely if it can do more harm then good.
llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll | ||
---|---|---|
2 | Can you precommit the test to show have only the functional diff in the actual patch here> | |
25 | is it possible to fold the whole loop body in a single block to simplify the test? |
llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll | ||
---|---|---|
2 |
I am not sure if I understand correctly. Do you want me to split the commit into two parts? |
"Don't disable"...