Download Raw Diff

Details

Reviewers

rampitec
fhahn

Commits

rGbad4de1ae7fa: Don't disable loop unroll for vectorized loops on AMDGPU target

Summary

We've got a performance regression after the https://reviews.llvm.org/D115261.
Despite the loop being vectorized unroll is still required.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

alex-t created this revision.Apr 26 2023, 12:10 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2023, 12:10 PM

Herald added subscribers: kosarev, foad, StephenFan and 8 others. · View Herald Transcript

alex-t requested review of this revision.Apr 26 2023, 12:10 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2023, 12:10 PM

Herald added subscribers: • pcwang-thead, wdng. · View Herald Transcript

Add a test?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7748	Original code had `CanonicalIVStartValue` check. So to restore it this should look: `if (CanonicalIVStartValue \|\| !UP.unrollLoopEpilogues)`.

In D149281#4299890, @rampitec wrote:

Add a test?

That would be helpful. It would be good to understand why runtime unrolling is needed here. Does the interleave heuristic not kick in?

Harbormaster completed remote builds in B228356: Diff 517259.Apr 26 2023, 1:03 PM

In D149281#4299934, @fhahn wrote:

In D149281#4299890, @rampitec wrote:

Add a test?

That would be helpful. It would be good to understand why runtime unrolling is needed here. Does the interleave heuristic not kick in?

I actually meant a simpler formal test to check that unroll works on AMDGPU. I am not sure @alex-t will be able to produce a reasonable test showing performance problem. The loop in question triggered the regression is quite complex to begin with.

alex-t retitled this revision from Must unroll epilogue loops after vectorization on AMDGPU target to Not disable loop unroll for vectorized loops on AMDGPU target.May 3 2023, 11:27 AM

In D149281#4299934, @fhahn wrote:

In D149281#4299890, @rampitec wrote:

Add a test?

That would be helpful. It would be good to understand why runtime unrolling is needed here. Does the interleave heuristic not kick in?

As far as I understand, before https://reviews.llvm.org/D115261 unroll was disabled for the epilogue loops only.
https://reviews.llvm.org/D115261 introduces disabling unroll for any loop which was vectorized.

The regression is caused by the nature of the GPU target architecture. It has a massive parallel HW and very costly branches.
The loop vectorization on a high-level IR yields almost nothing but loop unroll is always crucial for the performance.

The review title was changed to better reflect the changes being submitted.
The variable name was changed accordingly.
The test was added to check that

llvm.loop.unroll.runtime.disable

metadata were not added to the main loop but were added to the scalar epilogue.
This exactly reflects the desired behavior.

Herald added a subscriber: zzheng. · View Herald TranscriptMay 5 2023, 7:11 AM

@alex-t The premise of the change was that if a loop is vectorized, the vectorizer is also responsible for unrolling it via interleaving. The question is why the vectorizer failed to unroll the loop in your workload. We should understand that before adding additional options.

Harbormaster completed remote builds in B230233: Diff 519844.May 5 2023, 7:51 AM

bcahoon added a subscriber: bcahoon.May 6 2023, 8:13 AM

In D149281#4322079, @nikic wrote: The question is why the vectorizer failed to unroll the loop in your workload.

It did not fail in fact. The "unrolling via interleaving" was deliberately disabled for the AMDGPU target since it led to the uncontrolled RP increase.
The corresponding change was addressed in https://reviews.llvm.org/D122850.

For CPU you decide about the interleave count by subtracting the loop invariants number from the number of the available registers and dividing the result by the RP for the given class. This allows us to estimate the number of computation flows that may run simultaneously.

For GPU, which is natively a SIMT machine this estimation on the high level merely does not make sense.
The LoopUnroll is controllable and lets us reasonably trade-off between the unroll size and the RP.

In D149281#4328386, @alex-t wrote:

In D149281#4322079, @nikic wrote: The question is why the vectorizer failed to unroll the loop in your workload.

It did not fail in fact. The "unrolling via interleaving" was deliberately disabled for the AMDGPU target since it led to the uncontrolled RP increase.
The corresponding change was addressed in https://reviews.llvm.org/D122850.

For CPU you decide about the interleave count by subtracting the loop invariants number from the number of the available registers and dividing the result by the RP for the given class. This allows us to estimate the number of computation flows that may run simultaneously.

For GPU, which is natively a SIMT machine this estimation on the high level merely does not make sense.
The LoopUnroll is controllable and lets us reasonably trade-off between the unroll size and the RP.

Thanks for sharing. It still sounds to me that the underlying issue is proper interleaving cost-modeling in LV for those cases and that should be the long-term fix. Allowing unrolling again seems mostly like a short-term workaround, rather than a proper fix. Which is probably fine, but we should aim to improve the cost-modeling. This would need someone familiar with AMDGPU to drive.

Here are some observations about the performance regression that we see once the unrolling is disabled by the vectorizer.

In a specific case, the best performance occurs when the loop is unrolled 8 iterations (I haven't tried more than 8). It's a small loop, and after unrolling the loads in the unrolled loop are all moved to the top of the loop, which increases the time between the def and uses.

AMDGPU enables run-time loop unrolling only in specific cases as enabling run-time unrolling unconditionally is not profitable in many cases.

Increasing interleaving doesn't improve vectorization for this loop. The only effect is to perform unrolling, which would be done by the run-time loop unrolling pass if it weren't disabled. Since a target hook is used to control run-time unrolling, AMDGPU can be very specific about determining the unrolling profitability.

In order to regain performance through interleaving, then getNumberOfRegisters() needs to be at least 20, as the best performance is achieved by unrolling 8 iterations. In the past, AMDGPU changed getNumberOfRegisters() to be a smaller value due to performance regressions due to register pressure. The challenge with increasing getNumberOfRegisters() to enable interleaving is that it will introduce other regressions, as more loops will be unrolled and register pressure will increase.

In D149281#4329306, @fhahn wrote:

Thanks for sharing. It still sounds to me that the underlying issue is proper interleaving cost-modeling in LV for those cases and that should be the long-term fix. Allowing unrolling again seems mostly like a short-term workaround, rather than a proper fix. Which is probably fine, but we should aim to improve the cost-modeling. This would need someone familiar with AMDGPU to drive.

Fairly speaking, the whole idea of loop vectorization for GPU seems nonsense to me. Although I am not an expert in loop optimizations.
GPU has no wide vector registers which may be used to process several scala values at one HW cycle and, by this, unroll the loop by the vector factor. Instead, each thread in a wavefront operates on its own separate value in a 32-bit wide lane for the divergent values and all threads operate on the same shared scalar value in case it is uniform.
If we have a completely uniform input program (no dependence on thread ID) we could not get any better benefit than from the usual unroll performed by the loop unroll pass.
So, IMO the LV is just a complicated and error-prone way to do loop unroll.
Once again, I may not understand some subtle matters as I have no large experience with the LV.

In D149281#4350286, @alex-t wrote:

Fairly speaking, the whole idea of loop vectorization for GPU seems nonsense to me. Although I am not an expert in loop optimizations.
GPU has no wide vector registers which may be used to process several scala values at one HW cycle and, by this, unroll the loop by the vector factor. Instead, each thread in a wavefront operates on its own separate value in a 32-bit wide lane for the divergent values and all threads operate on the same shared scalar value in case it is uniform.
If we have a completely uniform input program (no dependence on thread ID) we could not get any better benefit than from the usual unroll performed by the loop unroll pass.
So, IMO the LV is just a complicated and error-prone way to do loop unroll.
Once again, I may not understand some subtle matters as I have no large experience with the LV.

Right, we do not have vector operations in the same sense as a SIMD machine, neither do we have vector registers in the same sense. Well, almost. We have vector (meaning wide, multicomponent) loads and stores and we have SOME v2 16-bit packed and v2 32-bit packed operations on some subtargets. Although this is not a full set of packed ALU as needed for vectorization and may not warrant a full loop vectorizer (SLP vectorizer is still in use). In any way there are no SIMD-style vector registers to control the interleave. Even though we are using register tuples for wide operations, these are not separate registers like for example XMMs on x86, and cut into the same register budget. So increasing 'number of vector registers' does not only mean much for our target but also inevitably leads to performance regressions.

All in all loop vectorizer does not do much for AMDGPU and cannot be a pass driving unroll. It still does something as said above, so whenever to keep it on or off is a completely separate question, but certainly it is no an unroll driver here.

llvm/include/llvm/Analysis/TargetTransformInfo.h
572	"Don't disable"...
573	Capitalize 'unroll'.
llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll
16	I'd expect the test to show no unroll disable metadata?

Minor changes: comment corrected, class member name capitalized

alex-t added inline comments.May 23 2023, 5:00 AM

llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll
16	I'd expect the test to show no unroll disable metadata? This checks that the epilogue loop unroll is still disabled as it was before. As I remember you insisted on keeping this behavior. "Original code had CanonicalIVStartValue check. So to restore it this should look: if (CanonicalIVStartValue \|\| !UP.unrollLoopEpilogues)" So we check not the absence/presence of metadata but their structure. We also check that "!0" (w/o unroll disable) is set on the middle block (main loop vectorized) but "!2" (with unroll disable) is set on the scalar epilogue loop.

alex-t retitled this revision from Not disable loop unroll for vectorized loops on AMDGPU target to Don't disable loop unroll for vectorized loops on AMDGPU target.May 23 2023, 5:01 AM

clang-format error corrected

Harbormaster completed remote builds in B233844: Diff 524671.May 23 2023, 7:55 AM

LGTM

This revision is now accepted and ready to land.May 23 2023, 12:17 PM

In D149281#4361600, @rampitec wrote:

In D149281#4350286, @alex-t wrote:

Fairly speaking, the whole idea of loop vectorization for GPU seems nonsense to me. Although I am not an expert in loop optimizations.
GPU has no wide vector registers which may be used to process several scala values at one HW cycle and, by this, unroll the loop by the vector factor. Instead, each thread in a wavefront operates on its own separate value in a 32-bit wide lane for the divergent values and all threads operate on the same shared scalar value in case it is uniform.
If we have a completely uniform input program (no dependence on thread ID) we could not get any better benefit than from the usual unroll performed by the loop unroll pass.
So, IMO the LV is just a complicated and error-prone way to do loop unroll.
Once again, I may not understand some subtle matters as I have no large experience with the LV.

Right, we do not have vector operations in the same sense as a SIMD machine, neither do we have vector registers in the same sense. Well, almost. We have vector (meaning wide, multicomponent) loads and stores and we have SOME v2 16-bit packed and v2 32-bit packed operations on some subtargets. Although this is not a full set of packed ALU as needed for vectorization and may not warrant a full loop vectorizer (SLP vectorizer is still in use). In any way there are no SIMD-style vector registers to control the interleave. Even though we are using register tuples for wide operations, these are not separate registers like for example XMMs on x86, and cut into the same register budget. So increasing 'number of vector registers' does not only mean much for our target but also inevitably leads to performance regressions.

All in all loop vectorizer does not do much for AMDGPU and cannot be a pass driving unroll. It still does something as said above, so whenever to keep it on or off is a completely separate question, but certainly it is no an unroll driver here.

Interesting, thanks for elaborating. It seems like the role/benefit of LV here is a bit unclear and in the long term it would be good to disable it entirely if it can do more harm then good.

llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll
2	Can you precommit the test to show have only the functional diff in the actual patch here>
25	is it possible to fold the whole loop body in a single block to simplify the test?

alex-t added inline comments.May 24 2023, 6:58 AM

llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll
2	Can you precommit the test to show have only the functional diff in the actual patch here> I am not sure if I understand correctly. Do you want me to split the commit into two parts? commit the test in a separate commit to only leave the diff in LoopVectorize.cpp for yet one more commit?

Test simplified

alex-t marked 3 inline comments as done.May 24 2023, 7:13 AM

Harbormaster completed remote builds in B234196: Diff 525167.May 24 2023, 7:59 AM

This revision was landed with ongoing or failed builds.May 25 2023, 1:55 PM

Closed by commit rGbad4de1ae7fa: Don't disable loop unroll for vectorized loops on AMDGPU target (authored by Alexander Timofeev <alexander.timofeev@amd.com>). · Explain Why

This revision was automatically updated to reflect the committed changes.

Alexander Timofeev <alexander.timofeev@amd.com> added a commit: rGbad4de1ae7fa: Don't disable loop unroll for vectorized loops on AMDGPU target.

Diff 525783

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 563 Lines • ▼ Show 20 Lines	struct UnrollingPreferences {
/// Threshold for unroll and jam, for inner loop size. The 'Threshold'		/// Threshold for unroll and jam, for inner loop size. The 'Threshold'
/// value above is used during unroll and jam for the outer loop size.		/// value above is used during unroll and jam for the outer loop size.
/// This value is used in the same manner to limit the size of the inner		/// This value is used in the same manner to limit the size of the inner
/// loop.		/// loop.
unsigned UnrollAndJamInnerLoopThreshold;		unsigned UnrollAndJamInnerLoopThreshold;
/// Don't allow loop unrolling to simulate more than this number of		/// Don't allow loop unrolling to simulate more than this number of
/// iterations when checking full unroll profitability		/// iterations when checking full unroll profitability
unsigned MaxIterationsCountToAnalyze;		unsigned MaxIterationsCountToAnalyze;
		/// Don't disable runtime unroll for the loops which were vectorized.
		rampitecUnsubmitted Not Done Reply Inline Actions "Don't disable"... rampitec: "Don't disable"...
		bool UnrollVectorizedLoop = false;
		rampitecUnsubmitted Not Done Reply Inline Actions Capitalize 'unroll'. rampitec: Capitalize 'unroll'.
};		};

/// Get target-customized preferences for the generic loop unrolling		/// Get target-customized preferences for the generic loop unrolling
/// transformation. The caller will initialize UP with the current		/// transformation. The caller will initialize UP with the current
/// target-independent defaults.		/// target-independent defaults.
void getUnrollingPreferences(Loop *L, ScalarEvolution &,		void getUnrollingPreferences(Loop *L, ScalarEvolution &,
UnrollingPreferences &UP,		UnrollingPreferences &UP,
OptimizationRemarkEmitter *ORE) const;		OptimizationRemarkEmitter *ORE) const;
▲ Show 20 Lines • Show All 2,224 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	UP.Threshold =
F.getFnAttributeAsParsedInteger("amdgpu-unroll-threshold", 300);		F.getFnAttributeAsParsedInteger("amdgpu-unroll-threshold", 300);
UP.MaxCount = std::numeric_limits<unsigned>::max();		UP.MaxCount = std::numeric_limits<unsigned>::max();
UP.Partial = true;		UP.Partial = true;

// Conditional branch in a loop back edge needs 3 additional exec		// Conditional branch in a loop back edge needs 3 additional exec
// manipulations in average.		// manipulations in average.
UP.BEInsns += 3;		UP.BEInsns += 3;

		// We want to run unroll even for the loops which have been vectorized.
		UP.UnrollVectorizedLoop = true;

// TODO: Do we want runtime unrolling?		// TODO: Do we want runtime unrolling?

// Maximum alloca size than can fit registers. Reserve 16 registers.		// Maximum alloca size than can fit registers. Reserve 16 registers.
const unsigned MaxAlloca = (256 - 16) * 4;		const unsigned MaxAlloca = (256 - 16) * 4;
unsigned ThresholdPrivate = UnrollThresholdPrivate;		unsigned ThresholdPrivate = UnrollThresholdPrivate;
unsigned ThresholdLocal = UnrollThresholdLocal;		unsigned ThresholdLocal = UnrollThresholdLocal;

// If this loop has the amdgpu.loop.unroll.threshold metadata we will use the		// If this loop has the amdgpu.loop.unroll.threshold metadata we will use the
▲ Show 20 Lines • Show All 1,152 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,737 Lines • ▼ Show 20 Lines	else {
// Keep all loop hints from the original loop on the vector loop (we'll		// Keep all loop hints from the original loop on the vector loop (we'll
// replace the vectorizer-specific hints below).		// replace the vectorizer-specific hints below).
if (MDNode *LID = OrigLoop->getLoopID())		if (MDNode *LID = OrigLoop->getLoopID())
L->setLoopID(LID);		L->setLoopID(LID);

LoopVectorizeHints Hints(L, true, *ORE);		LoopVectorizeHints Hints(L, true, *ORE);
Hints.setAlreadyVectorized();		Hints.setAlreadyVectorized();
}		}
		TargetTransformInfo::UnrollingPreferences UP;
		TTI.getUnrollingPreferences(L, *PSE.getSE(), UP, ORE);
		if (!UP.UnrollVectorizedLoop \|\| CanonicalIVStartValue)
		rampitecUnsubmitted Not Done Reply Inline Actions Original code had `CanonicalIVStartValue` check. So to restore it this should look: `if (CanonicalIVStartValue \|\| !UP.unrollLoopEpilogues)`. rampitec: Original code had `CanonicalIVStartValue` check. So to restore it this should look: `if…
AddRuntimeUnrollDisableMetaData(L);		AddRuntimeUnrollDisableMetaData(L);

// 3. Fix the vectorized code: take care of header phi's, live-outs,		// 3. Fix the vectorized code: take care of header phi's, live-outs,
// predication, updating analyses.		// predication, updating analyses.
ILV.fixVectorizedLoop(State, BestVPlan);		ILV.fixVectorizedLoop(State, BestVPlan);

ILV.printDebugTracesAtEnd();		ILV.printDebugTracesAtEnd();

return State.ExpandedSCEVs;		return State.ExpandedSCEVs;
▲ Show 20 Lines • Show All 2,871 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-- -mcpu=gfx90a -passes=loop-vectorize %s -S -o - \| FileCheck %s

				fhahnUnsubmitted Done Reply Inline Actions Can you precommit the test to show have only the functional diff in the actual patch here> fhahn: Can you precommit the test to show have only the functional diff in the actual patch here>
				alex-tAuthorUnsubmitted Done Reply Inline Actions Can you precommit the test to show have only the functional diff in the actual patch here> I am not sure if I understand correctly. Do you want me to split the commit into two parts? commit the test in a separate commit to only leave the diff in LoopVectorize.cpp for yet one more commit? alex-t: > Can you precommit the test to show have only the functional diff in the actual patch here> I…
				; CHECK-LABEL: @test
				; CHECK-LABEL: vector.body:
				; CHECK: br i1 %{{[0-9]+}}, label %middle.block, label %vector.body, !llvm.loop !0
				; CHECK-LABEL: middle.block:
				; CHECK-LABEL: scalar.ph:
				; CHECK-LABEL: loop.body:
				; CHECK: br i1 %cond, label %exit, label %loop.body, !llvm.loop !2
				; CHECK: !0 = distinct !{!0, !1}
				; CHECK: !1 = !{!"llvm.loop.isvectorized", i32 1}
				; CHECK: !2 = distinct !{!2, !3, !1}
				; CHECK: !3 = !{!"llvm.loop.unroll.runtime.disable"}


				define amdgpu_kernel void @test(ptr addrspace(1) %out, ptr addrspace(3) %lds, i32 %n) {
				rampitecUnsubmitted Done Reply Inline Actions I'd expect the test to show no unroll disable metadata? rampitec: I'd expect the test to show no unroll disable metadata?
				alex-tAuthorUnsubmitted Done Reply Inline Actions I'd expect the test to show no unroll disable metadata? This checks that the epilogue loop unroll is still disabled as it was before. As I remember you insisted on keeping this behavior. "Original code had CanonicalIVStartValue check. So to restore it this should look: if (CanonicalIVStartValue \|\| !UP.unrollLoopEpilogues)" So we check not the absence/presence of metadata but their structure. We also check that "!0" (w/o unroll disable) is set on the middle block (main loop vectorized) but "!2" (with unroll disable) is set on the scalar epilogue loop. alex-t: > I'd expect the test to show no unroll disable metadata? This checks that the epilogue loop…

				entry:
				br label %loop.body

				loop.body:
				%counter = phi i32 [0, %entry], [%inc, %loop.body]
				%ptr_lds = getelementptr i32, ptr addrspace(3) %lds, i32 %counter
				%val = load i32, ptr addrspace(3) %ptr_lds
				%ptr_out = getelementptr i32, ptr addrspace(1) %out, i32 %counter
				fhahnUnsubmitted Done Reply Inline Actions is it possible to fold the whole loop body in a single block to simplify the test? fhahn: is it possible to fold the whole loop body in a single block to simplify the test?
				store i32 %val, ptr addrspace(1) %ptr_out
				%inc = add i32 %counter, 1
				%cond = icmp sge i32 %counter, %n
				br i1 %cond, label %exit, label %loop.body

				exit:
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

Don't disable loop unroll for vectorized loops on AMDGPU target
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 525783

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll

This is an archive of the discontinued LLVM Phabricator instance.

Don't disable loop unroll for vectorized loops on AMDGPU targetClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 525783

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll

Don't disable loop unroll for vectorized loops on AMDGPU target
ClosedPublic