Download Raw Diff

Details

Reviewers

Ayal
gilr
dmgreen

Commits

rGa0fcf84a8c4d: [LV] Consider if scalar epilogue is required in getMaximizedVFForTarget.

Summary

When a scalar epilogue is required, at least one iteration of the scalar loop
has to execute. Adjust ConstTripCount accordingly to avoid picking a max VF
that results in a dead vector loop.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Jun 30 2023, 2:15 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2023, 2:15 PM

Herald added subscribers: artagnon, StephenFan, hiraditya. · View Herald Transcript

fhahn requested review of this revision.Jun 30 2023, 2:15 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2023, 2:15 PM

Harbormaster completed remote builds in B242555: Diff 536424.Jun 30 2023, 3:55 PM

Ayal added inline comments.Jul 2 2023, 8:38 AM

llvm/test/Transforms/LoopVectorize/X86/pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll
7–9	Comment above deserves updating? It refers to skipping over the main loop vectorized to VF=32 and jumping to the epilog loop vectorized to VF=8 (with original trip count of 32 and requires scalar epilog), whereas now the main loop is vectorized to VF=8 and the epilog loop is not vectorized at all.

fhahn mentioned this in rGe561edaaa56c: [LV] Prepare tests for D154261..Jul 3 2023, 9:49 AM

Rebase on top of e561edaaa56c9a8818d546774b141dead7224b50 which updates the test in pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll so it keeps testing for the original issue.

fhahn marked an inline comment as done.Jul 3 2023, 9:52 AM

fhahn added inline comments.

llvm/test/Transforms/LoopVectorize/X86/pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll
7–9	Thanks, I moved the test to `limit-vf-by-tripcount.ll` in e561edaaa56c9a8818d546774b141dead7224b50 and update the trip counts in the test so that there's an iteration of the main loop and epilogue loop, so it keeps testing for the original issue.

Harbormaster completed remote builds in B242829: Diff 536809.Jul 3 2023, 12:12 PM

In D154261#4468894, @fhahn wrote:

Rebase on top of e561edaaa56c9a8818d546774b141dead7224b50 which updates the test in pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll so it keeps testing for the original issue.

Ah, the original issue of pr56319 was to ensure that the epilog loop leaves at-least one iteration for the scalar loop (or actually epilogVF iterations) rather than vectorizing all remaining iterations? An odd trip count of 39 will always end up with a last scalar iteration, due to main and epilog VF's being even. Perhaps a trip count of 48 with (forced?) main-loop VF of 32 and epilog-loop VF of 8 should ensure the epilog loop runs only once.

llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll
285–286	nit: name of test is a bit misleading, the VF of the main loop is limited to avoid excessive values rather than avoiding epilog vectorization.

Herald added a subscriber: wangpc. · View Herald TranscriptJul 4 2023, 5:58 AM

In D154261#4471308, @Ayal wrote:

In D154261#4468894, @fhahn wrote:

Rebase on top of e561edaaa56c9a8818d546774b141dead7224b50 which updates the test in pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll so it keeps testing for the original issue.

Ah, the original issue of pr56319 was to ensure that the epilog loop leaves at-least one iteration for the scalar loop (or actually epilogVF iterations) rather than vectorizing all remaining iterations? An odd trip count of 39 will always end up with a last scalar iteration, due to main and epilog VF's being even. Perhaps a trip count of 48 with (forced?) main-loop VF of 32 and epilog-loop VF of 8 should ensure the epilog loop runs only once.

The original issue was that we simplified the branch in the main vector loop to exit in the first iteration (VF == vector TC) and because epilogue and main vector loops share the same VPlan, it was also simplified for the epilogue vector loop, but there where more than 1 vector iteration (fix was https://github.com/llvm/llvm-project/commit/0dddf04caba55a64f8534518d65311bdac05cf39).

Now, the test still checks that we don't simplify neither the main nor epilogue vector loop branch. But in this case, we could simplify both, because the epilogue vector loop only executes a single iteration; we should also have a variant where the epilogue vector loops executes multiple times (e.g. main VF=32, epilogue VF=8, TC = 49), but unfortunately my attempts so far weren't successful, because there's some code in isMoreProfitable that estimates the cost of the vector loop + scalar tail for the given VFs, which ignores possible epilogue vectorization( https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L5384). For all cases I tried, this leads to a small main VF ends up being chose (8 in that case) if there would be more than a single iteration of the epilogue vector loop.

fhahn marked an inline comment as done.Jul 4 2023, 7:41 AM

fhahn added inline comments.

llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll
285–286	adjusted the name, thanks!

Harbormaster completed remote builds in B243034: Diff 537095.Jul 4 2023, 8:34 AM

In D154261#4471533, @fhahn wrote:

In D154261#4471308, @Ayal wrote:

In D154261#4468894, @fhahn wrote:

Rebase on top of e561edaaa56c9a8818d546774b141dead7224b50 which updates the test in pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll so it keeps testing for the original issue.

Ah, the original issue of pr56319 was to ensure that the epilog loop leaves at-least one iteration for the scalar loop (or actually epilogVF iterations) rather than vectorizing all remaining iterations? An odd trip count of 39 will always end up with a last scalar iteration, due to main and epilog VF's being even. Perhaps a trip count of 48 with (forced?) main-loop VF of 32 and epilog-loop VF of 8 should ensure the epilog loop runs only once.

The original issue was that we simplified the branch in the main vector loop to exit in the first iteration (VF == vector TC) and because epilogue and main vector loops share the same VPlan, it was also simplified for the epilogue vector loop, but there where more than 1 vector iteration (fix was https://github.com/llvm/llvm-project/commit/0dddf04caba55a64f8534518d65311bdac05cf39).

Now, the test still checks that we don't simplify neither the main nor epilogue vector loop branch. But in this case, we could simplify both, because the epilogue vector loop only executes a single iteration; we should also have a variant where the epilogue vector loops executes multiple times (e.g. main VF=32, epilogue VF=8, TC = 49), but unfortunately my attempts so far weren't successful, because there's some code in isMoreProfitable that estimates the cost of the vector loop + scalar tail for the given VFs, which ignores possible epilogue vectorization( https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L5384). For all cases I tried, this leads to a small main VF ends up being chose (8 in that case) if there would be more than a single iteration of the epilogue vector loop.

Ah, ok, that optimization was later outlined by having LVP::execute_plan() do if (!IsEpilogueVectorization) VPlanTransforms::optimizeForVFAndUF()).
(And may explain the observed clean-up behavior observed in https://reviews.llvm.org/D154264#inline-1491906 :-)

The explanation for test @pr56319 may be misleading, though, as it specifically deals with having the epilog loop vectorization consider requiresScalarEpilog:
; Test case where the exit condition in the main vector loop can be optimized
; to true, but not in the epilogue vector loop. In the test the interleave
; group requires to execute at least one scalar iteration, meaning the last
; vector iteration of the epilogue vector loop cannot be executed.

Could a test with trip count of, say, 34, work - be vectorized with main VF=32 and epilog VF=2 w/o requiresScalarEpilog - when last member of the interleave-group is accessed, while with requiresScalarEpilog main VF=32 will continue but epilog vectorization will be abandoned?

In D154261#4472242, @Ayal wrote:

Ah, ok, that optimization was later outlined by having LVP::execute_plan() do if (!IsEpilogueVectorization) VPlanTransforms::optimizeForVFAndUF()).
(And may explain the observed clean-up behavior observed in https://reviews.llvm.org/D154264#inline-1491906 :-)

The explanation for test @pr56319 may be misleading, though, as it specifically deals with having the epilog loop vectorization consider requiresScalarEpilog:
; Test case where the exit condition in the main vector loop can be optimized
; to true, but not in the epilogue vector loop. In the test the interleave
; group requires to execute at least one scalar iteration, meaning the last
; vector iteration of the epilogue vector loop cannot be executed.

Agreed, I think the interleave group requiring one scalar iteration is a red herring with respect to the orinal issue. Tried to clarify it in this patch.

Could a test with trip count of, say, 34, work - be vectorized with main VF=32 and epilog VF=2 w/o requiresScalarEpilog - when last member of the interleave-group is accessed, while with requiresScalarEpilog main VF=32 will continue but epilog vectorization will be abandoned?

With 34, we could simplify the branch in the epilogue vector loop, as the epilogue vector loop would be dead (or shouldn't get generated, as in follow-on patches). I *think* TC = 37 with forced epilogue VF=2 should work (single main loop iteration with VF=32, 2 iterations of the epilogue vector loop (so branch cannot be simplified), final scalar iteration. Updated accoringly

Harbormaster completed remote builds in B243086: Diff 537171.Jul 4 2023, 2:27 PM

Ok, very well, then better drop the last requireScalarEpilog part from the explanation of the test?

llvm/test/Transforms/LoopVectorize/X86/pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll
9–10	The scalar loop will get to run the final iteration in any case, due to the trip-count being odd, regardless of any interleave group requiring it (aka red herring).
29	(Just noting that we're refraining from optimizing this single iteration main loop with an unconditional branch exiting to middle block, because that would also cause the double iteration epilog loop to bail out after its first iteration.)

This revision is now accepted and ready to land.Jul 4 2023, 2:37 PM

This revision was landed with ongoing or failed builds.Jul 6 2023, 5:36 AM

Closed by commit rGa0fcf84a8c4d: [LV] Consider if scalar epilogue is required in getMaximizedVFForTarget. (authored by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn marked an inline comment as done.

fhahn added a commit: rGa0fcf84a8c4d: [LV] Consider if scalar epilogue is required in getMaximizedVFForTarget..

fhahn marked an inline comment as done.Jul 6 2023, 5:36 AM

fhahn added inline comments.

llvm/test/Transforms/LoopVectorize/X86/pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll
9–10	Adjusted in the committed version.
29	Exactly

Diff 537676

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,282 Lines • ▼ Show 20 Lines	ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(

unsigned WidestRegisterMinEC = MaxVectorElementCount.getKnownMinValue();		unsigned WidestRegisterMinEC = MaxVectorElementCount.getKnownMinValue();
if (MaxVectorElementCount.isScalable() &&		if (MaxVectorElementCount.isScalable() &&
TheFunction->hasFnAttribute(Attribute::VScaleRange)) {		TheFunction->hasFnAttribute(Attribute::VScaleRange)) {
auto Attr = TheFunction->getFnAttribute(Attribute::VScaleRange);		auto Attr = TheFunction->getFnAttribute(Attribute::VScaleRange);
auto Min = Attr.getVScaleRangeMin();		auto Min = Attr.getVScaleRangeMin();
WidestRegisterMinEC *= Min;		WidestRegisterMinEC *= Min;
}		}

		// When a scalar epilogue is required, at least one iteration of the scalar
		// loop has to execute. Adjust ConstTripCount accordingly to avoid picking a
		// max VF that results in a dead vector loop.
		if (ConstTripCount > 0 && requiresScalarEpilogue(true))
		ConstTripCount -= 1;

if (ConstTripCount && ConstTripCount <= WidestRegisterMinEC &&		if (ConstTripCount && ConstTripCount <= WidestRegisterMinEC &&
(!FoldTailByMasking \|\| isPowerOf2_32(ConstTripCount))) {		(!FoldTailByMasking \|\| isPowerOf2_32(ConstTripCount))) {
// If loop trip count (TC) is known at compile time there is no point in		// If loop trip count (TC) is known at compile time there is no point in
// choosing VF greater than TC (as done in the loop below). Select maximum		// choosing VF greater than TC (as done in the loop below). Select maximum
// power of two which doesn't exceed TC.		// power of two which doesn't exceed TC.
// If MaxVectorElementCount is scalable, we only fall back on a fixed VF		// If MaxVectorElementCount is scalable, we only fall back on a fixed VF
// when the TC is less than or equal to the known number of lanes.		// when the TC is less than or equal to the known number of lanes.
auto ClampedConstTripCount = llvm::bit_floor(ConstTripCount);		auto ClampedConstTripCount = llvm::bit_floor(ConstTripCount);
▲ Show 20 Lines • Show All 5,435 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll

Show First 20 Lines • Show All 276 Lines • ▼ Show 20 Lines	loop:
store i8 %val, ptr %staddr, align 64		store i8 %val, ptr %staddr, align 64
%i.next = add i64 %i, 1		%i.next = add i64 %i, 1
%is.next = icmp ult i64 %i.next, 20		%is.next = icmp ult i64 %i.next, 20
br i1 %is.next, label %loop, label %exit		br i1 %is.next, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

define void @limit_main_loop_vf_to_avoid_epilogue_vectorization(ptr noalias %src, ptr noalias %dst) {		define void @limit_main_loop_vf_to_avoid_dead_main_vector_loop(ptr noalias %src, ptr noalias %dst) {
		AyalUnsubmitted Done Reply Inline Actions nit: name of test is a bit misleading, the VF of the main loop is limited to avoid excessive values rather than avoiding epilog vectorization. Ayal: nit: name of test is a bit misleading, the VF of the main loop is limited to avoid excessive…
		fhahnAuthorUnsubmitted Done Reply Inline Actions adjusted the name, thanks! fhahn: adjusted the name, thanks!
; CHECK-LABEL: @limit_main_loop_vf_to_avoid_epilogue_vectorization(		; CHECK-LABEL: @limit_main_loop_vf_to_avoid_dead_main_vector_loop(
; CHECK-NEXT: iter.check:		; CHECK-NEXT: entry:
; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]		; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.main.loop.iter.check:
; CHECK-NEXT: br i1 true, label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0		; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds [3 x i8], ptr [[SRC:%.]], i64 [[TMP0]], i64 0		; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds [3 x i8], ptr [[SRC:%.]], i64 [[TMP0]], i64 0
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0		; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0
; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <96 x i8>, ptr [[TMP2]], align 1		; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <24 x i8>, ptr [[TMP2]], align 1
; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <96 x i8> [[WIDE_VEC]], <96 x i8> poison, <32 x i32> <i32 0, i32 3, i32 6, i32 9, i32 12, i32 15, i32 18, i32 21, i32 24, i32 27, i32 30, i32 33, i32 36, i32 39, i32 42, i32 45, i32 48, i32 51, i32 54, i32 57, i32 60, i32 63, i32 66, i32 69, i32 72, i32 75, i32 78, i32 81, i32 84, i32 87, i32 90, i32 93>		; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <24 x i8> [[WIDE_VEC]], <24 x i8> poison, <8 x i32> <i32 0, i32 3, i32 6, i32 9, i32 12, i32 15, i32 18, i32 21>
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, ptr [[DST:%.]], i64 [[TMP0]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, ptr [[DST:%.]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[TMP3]], i32 0		; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[TMP3]], i32 0
; CHECK-NEXT: store <32 x i8> [[STRIDED_VEC]], ptr [[TMP4]], align 1		; CHECK-NEXT: store <8 x i8> [[STRIDED_VEC]], ptr [[TMP4]], align 1
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32		; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 0		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 24
; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP13:![0-9]+]]		; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP13:![0-9]+]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: br label [[VEC_EPILOG_ITER_CHECK:%.*]]		; CHECK-NEXT: br label [[SCALAR_PH]]
; CHECK: vec.epilog.iter.check:		; CHECK: scalar.ph:
; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]		; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 24, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
; CHECK: vec.epilog.ph:
; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ 0, [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; CHECK: vec.epilog.vector.body:
; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT4:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX1]], 0
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds [3 x i8], ptr [[SRC]], i64 [[TMP6]], i64 0
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0
; CHECK-NEXT: [[WIDE_VEC2:%.*]] = load <24 x i8>, ptr [[TMP8]], align 1
; CHECK-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <24 x i8> [[WIDE_VEC2]], <24 x i8> poison, <8 x i32> <i32 0, i32 3, i32 6, i32 9, i32 12, i32 15, i32 18, i32 21>
; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[TMP6]]
; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i32 0
; CHECK-NEXT: store <8 x i8> [[STRIDED_VEC3]], ptr [[TMP10]], align 1
; CHECK-NEXT: [[INDEX_NEXT4]] = add nuw i64 [[INDEX1]], 8
; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT4]], 24
; CHECK-NEXT: br i1 [[TMP11]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP14:![0-9]+]]
; CHECK: vec.epilog.middle.block:
; CHECK-NEXT: br label [[VEC_EPILOG_SCALAR_PH]]
; CHECK: vec.epilog.scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 24, [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ 0, [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK:%.]] ]
; CHECK-NEXT: br label [[LOOP:%.*]]		; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:		; CHECK: loop:
; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[LOOP]] ]		; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[LOOP]] ]
; CHECK-NEXT: [[GEP_SRC:%.*]] = getelementptr inbounds [3 x i8], ptr [[SRC]], i64 [[IV]], i64 0		; CHECK-NEXT: [[GEP_SRC:%.*]] = getelementptr inbounds [3 x i8], ptr [[SRC]], i64 [[IV]], i64 0
; CHECK-NEXT: [[L:%.*]] = load i8, ptr [[GEP_SRC]], align 1		; CHECK-NEXT: [[L:%.*]] = load i8, ptr [[GEP_SRC]], align 1
; CHECK-NEXT: [[GEP_DST:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[IV]]		; CHECK-NEXT: [[GEP_DST:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[IV]]
; CHECK-NEXT: store i8 [[L]], ptr [[GEP_DST]], align 1		; CHECK-NEXT: store i8 [[L]], ptr [[GEP_DST]], align 1
; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1		; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
; CHECK-NEXT: [[CMP:%.*]] = icmp eq i64 [[IV_NEXT]], 32		; CHECK-NEXT: [[CMP:%.*]] = icmp eq i64 [[IV_NEXT]], 32
; CHECK-NEXT: br i1 [[CMP]], label [[EXIT:%.*]], label [[LOOP]], !llvm.loop [[LOOP15:![0-9]+]]		; CHECK-NEXT: br i1 [[CMP]], label [[EXIT:%.*]], label [[LOOP]], !llvm.loop [[LOOP14:![0-9]+]]
; CHECK: exit:		; CHECK: exit:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
Show All 11 Lines

llvm/test/Transforms/LoopVectorize/X86/pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py

; RUN: opt -passes=loop-vectorize -mcpu=skylake-avx512 -S %s | FileCheck %s ; RUN: opt -passes=loop-vectorize -mcpu=skylake-avx512 -epilogue-vectorization-force-VF=2 -S %s | FileCheck %s

target triple = "x86_64-apple-macos" target triple = "x86_64-apple-macos"

; Test case where the exit condition in the main vector loop can be optimized ; Test case where the exit condition in the main vector loop can be optimized

; to true, but not in the epilogue vector loop. In the test the interleave ; to true, but not in the epilogue vector loop. A single iteration of the main

; group requires to execute at least one scalar iteration, meaning the last ; vector loop will be executed, followed by 2 iterations of the vector epilogue,

; vector iteration of the epilogue vector loop cannot be executed. ; followed by a final iteration of the scalar loop,

AyalUnsubmitted

Done

Comment above deserves updating? It refers to skipping over the main loop vectorized to VF=32 and jumping to the epilog loop vectorized to VF=8 (with original trip count of 32 and requires scalar epilog), whereas now the main loop is vectorized to VF=8 and the epilog loop is not vectorized at all.

Ayal: Comment above deserves updating? It refers to skipping over the main loop vectorized to VF=32…

fhahnAuthorUnsubmitted

Done

Thanks, I moved the test to limit-vf-by-tripcount.ll in e561edaaa56c9a8818d546774b141dead7224b50 and update the trip counts in the test so that there's an iteration of the main loop and epilogue loop, so it keeps testing for the original issue.

fhahn: Thanks, I moved the test to `limit-vf-by-tripcount.ll` in…

define void @pr56319(ptr noalias %src, ptr noalias %dst) { define void @pr56319(ptr noalias %src, ptr noalias %dst) {

AyalUnsubmitted

Done

; vector loop will be executed, followed by 2 iterations of the vector epilogue,

- ; followed by a final iteration of the scalar loop, because the interleave group

- ; requires to execute at least one scalar iteration.

+ ; followed by a final iteration of the scalar loop.

define void @pr56319(ptr noalias %src, ptr noalias %dst) {

The scalar loop will get to run the final iteration in any case, due to the trip-count being odd, regardless of any interleave group requiring it (aka red herring).

Ayal: The scalar loop will get to run the final iteration in any case, due to the trip-count being…

fhahnAuthorUnsubmitted

Done

Adjusted in the committed version.

fhahn: Adjusted in the committed version.

; CHECK-LABEL: @pr56319( ; CHECK-LABEL: @pr56319(

; CHECK-NEXT: iter.check: ; CHECK-NEXT: iter.check:

; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]] ; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]

; CHECK: vector.main.loop.iter.check: ; CHECK: vector.main.loop.iter.check:

; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_PH:%.*]], label [[VECTOR_PH:%.*]] ; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_PH:%.*]], label [[VECTOR_PH:%.*]]

; CHECK: vector.ph: ; CHECK: vector.ph:

; CHECK-NEXT: br label [[VECTOR_BODY:%.*]] ; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]

; CHECK: vector.body: ; CHECK: vector.body:

; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ] ; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]

; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0 ; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0

; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds [3 x i8], ptr [[SRC:%.*]], i64 [[TMP0]], i64 0 ; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds [3 x i8], ptr [[SRC:%.*]], i64 [[TMP0]], i64 0

; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0 ; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0

; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <96 x i8>, ptr [[TMP2]], align 1 ; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <96 x i8>, ptr [[TMP2]], align 1

; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <96 x i8> [[WIDE_VEC]], <96 x i8> poison, <32 x i32> <i32 0, i32 3, i32 6, i32 9, i32 12, i32 15, i32 18, i32 21, i32 24, i32 27, i32 30, i32 33, i32 36, i32 39, i32 42, i32 45, i32 48, i32 51, i32 54, i32 57, i32 60, i32 63, i32 66, i32 69, i32 72, i32 75, i32 78, i32 81, i32 84, i32 87, i32 90, i32 93> ; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <96 x i8> [[WIDE_VEC]], <96 x i8> poison, <32 x i32> <i32 0, i32 3, i32 6, i32 9, i32 12, i32 15, i32 18, i32 21, i32 24, i32 27, i32 30, i32 33, i32 36, i32 39, i32 42, i32 45, i32 48, i32 51, i32 54, i32 57, i32 60, i32 63, i32 66, i32 69, i32 72, i32 75, i32 78, i32 81, i32 84, i32 87, i32 90, i32 93>

; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP0]] ; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP0]]

; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[TMP3]], i32 0 ; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[TMP3]], i32 0

; CHECK-NEXT: store <32 x i8> [[STRIDED_VEC]], ptr [[TMP4]], align 1 ; CHECK-NEXT: store <32 x i8> [[STRIDED_VEC]], ptr [[TMP4]], align 1

; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32 ; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32

; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 32 ; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 32

AyalUnsubmitted

Not Done

(Just noting that we're refraining from optimizing this single iteration main loop with an unconditional branch exiting to middle block, because that would also cause the double iteration epilog loop to bail out after its first iteration.)

Ayal: (Just noting that we're refraining from optimizing this single iteration main loop with an…

fhahnAuthorUnsubmitted

Done

Exactly

fhahn: Exactly

; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]] ; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]

; CHECK: middle.block: ; CHECK: middle.block:

; CHECK-NEXT: br label [[VEC_EPILOG_ITER_CHECK:%.*]] ; CHECK-NEXT: br label [[VEC_EPILOG_ITER_CHECK:%.*]]

; CHECK: vec.epilog.iter.check: ; CHECK: vec.epilog.iter.check:

; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]] ; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]

; CHECK: vec.epilog.ph: ; CHECK: vec.epilog.ph:

; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ 32, [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ] ; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ 32, [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]

; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]] ; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]

; CHECK: vec.epilog.vector.body: ; CHECK: vec.epilog.vector.body:

; CHECK-NEXT: [[INDEX1:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT4:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ] ; CHECK-NEXT: [[INDEX1:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT4:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]

; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX1]], 0 ; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX1]], 0

; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds [3 x i8], ptr [[SRC]], i64 [[TMP6]], i64 0 ; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds [3 x i8], ptr [[SRC]], i64 [[TMP6]], i64 0

; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0 ; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0

; CHECK-NEXT: [[WIDE_VEC2:%.*]] = load <12 x i8>, ptr [[TMP8]], align 1 ; CHECK-NEXT: [[WIDE_VEC2:%.*]] = load <6 x i8>, ptr [[TMP8]], align 1

; CHECK-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <12 x i8> [[WIDE_VEC2]], <12 x i8> poison, <4 x i32> <i32 0, i32 3, i32 6, i32 9> ; CHECK-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <6 x i8> [[WIDE_VEC2]], <6 x i8> poison, <2 x i32> <i32 0, i32 3>

; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[TMP6]] ; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[TMP6]]

; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i32 0 ; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i32 0

; CHECK-NEXT: store <4 x i8> [[STRIDED_VEC3]], ptr [[TMP10]], align 1 ; CHECK-NEXT: store <2 x i8> [[STRIDED_VEC3]], ptr [[TMP10]], align 1

; CHECK-NEXT: [[INDEX_NEXT4]] = add nuw i64 [[INDEX1]], 4 ; CHECK-NEXT: [[INDEX_NEXT4]] = add nuw i64 [[INDEX1]], 2

; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT4]], 36 ; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT4]], 36

; CHECK-NEXT: br i1 [[TMP11]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]] ; CHECK-NEXT: br i1 [[TMP11]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]

; CHECK: vec.epilog.middle.block: ; CHECK: vec.epilog.middle.block:

; CHECK-NEXT: br label [[VEC_EPILOG_SCALAR_PH]] ; CHECK-NEXT: br label [[VEC_EPILOG_SCALAR_PH]]

; CHECK: vec.epilog.scalar.ph: ; CHECK: vec.epilog.scalar.ph:

; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 36, [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ 32, [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK:%.*]] ] ; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 36, [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ 32, [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK:%.*]] ]

; CHECK-NEXT: br label [[LOOP:%.*]] ; CHECK-NEXT: br label [[LOOP:%.*]]

; CHECK: loop: ; CHECK: loop:

; CHECK-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ] ; CHECK-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]

; CHECK-NEXT: [[GEP_SRC:%.*]] = getelementptr inbounds [3 x i8], ptr [[SRC]], i64 [[IV]], i64 0 ; CHECK-NEXT: [[GEP_SRC:%.*]] = getelementptr inbounds [3 x i8], ptr [[SRC]], i64 [[IV]], i64 0

; CHECK-NEXT: [[L:%.*]] = load i8, ptr [[GEP_SRC]], align 1 ; CHECK-NEXT: [[L:%.*]] = load i8, ptr [[GEP_SRC]], align 1

; CHECK-NEXT: [[GEP_DST:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[IV]] ; CHECK-NEXT: [[GEP_DST:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[IV]]

; CHECK-NEXT: store i8 [[L]], ptr [[GEP_DST]], align 1 ; CHECK-NEXT: store i8 [[L]], ptr [[GEP_DST]], align 1

; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1 ; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1

; CHECK-NEXT: [[CMP:%.*]] = icmp eq i64 [[IV_NEXT]], 39 ; CHECK-NEXT: [[CMP:%.*]] = icmp eq i64 [[IV_NEXT]], 37

; CHECK-NEXT: br i1 [[CMP]], label [[EXIT:%.*]], label [[LOOP]], !llvm.loop [[LOOP4:![0-9]+]] ; CHECK-NEXT: br i1 [[CMP]], label [[EXIT:%.*]], label [[LOOP]], !llvm.loop [[LOOP4:![0-9]+]]

; CHECK: exit: ; CHECK: exit:

; CHECK-NEXT: ret void ; CHECK-NEXT: ret void

; ;

entry: entry:

br label %loop br label %loop

loop: loop:

%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ] %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]

%gep.src = getelementptr inbounds [3 x i8], ptr %src, i64 %iv, i64 0 %gep.src = getelementptr inbounds [3 x i8], ptr %src, i64 %iv, i64 0

%l = load i8, ptr %gep.src, align 1 %l = load i8, ptr %gep.src, align 1

%gep.dst = getelementptr inbounds i8, ptr %dst, i64 %iv %gep.dst = getelementptr inbounds i8, ptr %dst, i64 %iv

store i8 %l, ptr %gep.dst, align 1 store i8 %l, ptr %gep.dst, align 1

%iv.next = add nuw nsw i64 %iv, 1 %iv.next = add nuw nsw i64 %iv, 1

%cmp = icmp eq i64 %iv.next, 39 %cmp = icmp eq i64 %iv.next, 37

br i1 %cmp, label %exit, label %loop br i1 %cmp, label %exit, label %loop

exit: exit:

ret void ret void

} }

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Consider if scalar epilogue is required in getMaximizedVFForTarget.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 537676

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll

llvm/test/Transforms/LoopVectorize/X86/pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Consider if scalar epilogue is required in getMaximizedVFForTarget.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 537676

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll

llvm/test/Transforms/LoopVectorize/X86/pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll

[LV] Consider if scalar epilogue is required in getMaximizedVFForTarget.
ClosedPublic