This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
3/3
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
extend-vectorization-factor-for-unprofitable-memops.ll
-
scalable-vectorization-cost-tuning.ll
-
scalable-vectorization.ll

Differential D96522

[LV] Try larger VFs if VF is unprofitable for small types.
AbandonedPublic

Authored by fhahn on Feb 11 2021, 9:26 AM.

Download Raw Diff

Details

Reviewers

Ayal
hfinkel
anemet
dmgreen

Summary

Currently LV can choose sub-optimal vectorization factors for loops with
memory accesses using different widths. At the moment, the largest type
limits the vectorization factor, but this is overly pessimistic on some
targets, which have memory instructions that require a certain minimum
VF for operations on narrow types.

The motivating example is AArch64, which requires a larger VFs for
vectorization to be profitable when narrow types are involved.

Currently code like below is not vectorized on AArch64, because the
chosen max VF of 4 (because the largest type is i32) is not profitable
(due to to type extensions).

int foo(unsigned char *len, unsigned size) {
   int maxLen = 0;
   int minLen = 0;
   for (unsigned i = 0; i < size; i++) {
     if (len[i] > maxLen) maxLen = len[i];
     if (len[i] < minLen) minLen = len[i];
  }
  return maxLen + minLen;
}

This patch addresses this issue by detecting cases where memory ops for
the narrowest type are more expensive than with larger VFs. For such
cases, it instead considers larger vectorization factors, limited by
estimated register usage. Loops like the above can be speed-up by ~4x
on AArch64.

This change should not introduce regressions; we only explore more
vectorization factors, but the cost model still picks the most
profitable one.

The impact on SPEC2000 & SPEC2006 is relatively small:

Tests: 31
Same hash: 18 (filtered out)
Remaining: 13
Metric: loop-vectorize.LoopsVectorized

test-suite...T2000/300.twolf/300.twolf.test 18.00 23.00 27.8%
test-suite...T2000/256.bzip2/256.bzip2.test 12.00 14.00 16.7%
test-suite...T2006/401.bzip2/401.bzip2.test 15.00 17.00 13.3%
test-suite...T2006/445.gobmk/445.gobmk.test 25.00 27.00 8.0%
test-suite...0/253.perlbmk/253.perlbmk.test 32.00 34.00 6.2%
test-suite...000/186.crafty/186.crafty.test 19.00 20.00 5.3%
test-suite...0.perlbench/400.perlbench.test 38.00 40.00 5.3%
test-suite...T2006/456.hmmer/456.hmmer.test 63.00 65.00 3.2%
test-suite...6/482.sphinx3/482.sphinx3.test 64.00 66.00 3.1%
test-suite.../CINT2000/176.gcc/176.gcc.test 43.00 44.00 2.3%
test-suite.../CINT2006/403.gcc/403.gcc.test 97.00 98.00 1.0%
test-suite...3.xalancbmk/483.xalancbmk.test 271.00 273.00 0.7%
test-suite...6/464.h264ref/464.h264ref.test 79.00 79.00 0.0%

There are a few small runtime improvements.

I also verified the changes to the vectorized loops in 300.twolf, 401.bzip2
& 445.gobmk. All changed loops are loops that the patch targets.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,070 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vloxseg.c
	60,160 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vluxseg.c
	60,110 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-overloaded::vloxseg.c
	60,090 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-overloaded::vluxseg.c
	60,340 ms	x64 debian > Clang.Driver::aarch64-cpus.c
		View Full Test Results (7 Failed)

Event Timeline

fhahn created this revision.Feb 11 2021, 9:26 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald TranscriptFeb 11 2021, 9:26 AM

fhahn requested review of this revision.Feb 11 2021, 9:26 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2021, 9:26 AM

I like the idea of this. I think it often makes sense to at least try a larger vector size, even if it's over the vector width.

Is your only target for this reductions? There is some code at the moment in getSmallestAndWidestTypes that limits the widest type to loads, stores and reductions, using RdxDesc.getRecurrenceType(). We already override that for InLoopReduction as they can sometimes handle larger than legal operations natively. Does that use the original reduction size, or the MinBWs size? Does the MinBWs help here, or does the cmp block that from working, and it doesn't realize the whole loop can be calculated using an i8?

In that example, the entire loop seems like it could become load; smin.16b, smax.16b under AArch64. That might be difficult to make happen though.

In D96522#2557608, @dmgreen wrote:

I like the idea of this. I think it often makes sense to at least try a larger vector size, even if it's over the vector width.

Agreed! ideally we would be confident enough in the cost-model to more liberally explore larger VFs and trust the cost-model to pick the best one, even in the presence of larger ones. I think this patch is a small first step in this direction, with potential to extend this to more cases in the future.

Is your only target for this reductions? There is some code at the moment in getSmallestAndWidestTypes that limits the widest type to loads, stores and reductions, using RdxDesc.getRecurrenceType(). We already override that for InLoopReduction as they can sometimes handle larger than legal operations natively. Does that use the original reduction size, or the MinBWs size? Does the MinBWs help here, or does the cmp block that from working, and it doesn't realize the whole loop can be calculated using an i8?

I just realized the example I choose was more complex than it needed to be. The main initial focus is narrow loads & stores, but it also applies to reductions. I don't think MinBWs helps directly with deciding whether to try bigger vector factors (the IR test cases do not have reductions and have either a load or store to i32).

In that example, the entire loop seems like it could become load; smin.16b, smax.16b under AArch64. That might be difficult to make happen though.

Yeah, I'm planning to take a look if we can use MinBWs more aggressively in the reduction case as follow-up.

In D96522#2557790, @fhahn wrote:

In D96522#2557608, @dmgreen wrote:

I like the idea of this. I think it often makes sense to at least try a larger vector size, even if it's over the vector width.

Agreed! ideally we would be confident enough in the cost-model to more liberally explore larger VFs and trust the cost-model to pick the best one, even in the presence of larger ones. I think this patch is a small first step in this direction, with potential to extend this to more cases in the future.

The approach seems sensible to me. Is the reason for not exploring larger VFs more liberally because the cost-model doesn't always accurately represent/consider the legalization costs?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5526	nit: Are you planning to handle more special cases like this in the future? If so, then it may be worth moving this to its own `shouldMaximizeBandwidth` function.
5535	nit: this condition is currently always true. Can you change this into an assert?

Harbormaster completed remote builds in B88840: Diff 323050.Feb 11 2021, 3:18 PM

OK, More about the loads being expensive than the reductions. If this is for 4 x i8 loads being expensive, should we just be overriding shouldMaximizeVectorBandwidth on AArch64? It could be based on SmallestType / WidestType if that makes it more precise, but I don't know that it needs to be.
https://godbolt.org/z/xG55W6

That would prevent us from putting this into the vectorizer costmodel directly, which may not be correct with the made up alignments/address space, and no indication of whether the loads are extended or not.

A blast from the past :)

Rebased and updated to work with scalable VFs as well.

In D96522#2559177, @dmgreen wrote:

OK, More about the loads being expensive than the reductions. If this is for 4 x i8 loads being expensive, should we just be overriding shouldMaximizeVectorBandwidth on AArch64? It could be based on SmallestType / WidestType if that makes it more precise, but I don't know that it needs to be.
https://godbolt.org/z/xG55W6

That would prevent us from putting this into the vectorizer costmodel directly, which may not be correct with the made up alignments/address space, and no indication of whether the loads are extended or not.

The extended loads case is interesting. To catch that, we effectively would have to look at all the loads. Where we do that wouldn't matter too much I think.

Herald added a subscriber: ctetreau. · View Herald TranscriptFeb 9 2022, 12:25 PM

fhahn added inline comments.Feb 9 2022, 12:26 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5535	I updated the code to support scalable vectors. But it seems like the cost model considers loads of `<vscale x 4 x i8>` as expensive as `<vscale x 16 x i8>` so the higher VF isn't chosen for scalable vectors. And the maximized fixed VF seems to be more profitable than the scalable one, so we end up choosing `<16 x i8>` (see the scalable tests)

fhahn mentioned this in D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.Feb 9 2022, 12:31 PM

Harbormaster completed remote builds in B148546: Diff 407245.Feb 9 2022, 3:12 PM

dmgreen mentioned this in D119469: [AArch64] Turn truncating buildvectors into truncates.Feb 10 2022, 12:10 PM

dmgreen mentioned this in D119887: [AArch64] Common patterns between UMULL and int_aarch64_neon_umull.Feb 15 2022, 1:03 PM

dmgreen mentioned this in D120018: [AArch64] Alter mull shuffle(ext(..)) combine to work on buildvectors.Feb 17 2022, 1:09 AM

fhahn mentioned this in D120481: [AArch64] Try to re-use extended operand for SETCC with vector ops..Feb 24 2022, 6:52 AM

fhahn mentioned this in D120571: [CGP,AArch64] Replace zexts with shuffle that can be lowered using tbl..Feb 25 2022, 9:30 AM

fhahn abandoned this revision.Jul 1 2022, 7:29 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 1 2022, 7:29 AM

fhahn mentioned this in rGafdedd405e49: [AArch64] Try to re-use extended operand for SETCC with vector ops..Jul 7 2022, 4:51 PM

fhahn mentioned this in rG81a11da76257: [CGP,AArch64] Replace zexts with shuffle that can be lowered using tbl..Sep 15 2022, 11:18 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

25 lines

test/

Transforms/

LoopVectorize/

AArch64/

extend-vectorization-factor-for-unprofitable-memops.ll

9 lines

scalable-vectorization-cost-tuning.ll

4 lines

scalable-vectorization.ll

16 lines

Diff 407245

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,517 Lines • ▼ Show 20 Lines	ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(

if (!MaxVectorElementCount) {		if (!MaxVectorElementCount) {
LLVM_DEBUG(dbgs() << "LV: The target has no "		LLVM_DEBUG(dbgs() << "LV: The target has no "
<< (ComputeScalableMaxVF ? "scalable" : "fixed")		<< (ComputeScalableMaxVF ? "scalable" : "fixed")
<< " vector registers.\n");		<< " vector registers.\n");
return ElementCount::getFixed(1);		return ElementCount::getFixed(1);
}		}

const auto TripCountEC = ElementCount::getFixed(ConstTripCount);		const auto TripCountEC = ElementCount::getFixed(ConstTripCount);
		sdesmalenUnsubmitted Done Reply Inline Actions nit: Are you planning to handle more special cases like this in the future? If so, then it may be worth moving this to its own `shouldMaximizeBandwidth` function. sdesmalen: nit: Are you planning to handle more special cases like this in the future? If so, then it may…
if (ConstTripCount &&		if (ConstTripCount &&
ElementCount::isKnownLE(TripCountEC, MaxVectorElementCount) &&		ElementCount::isKnownLE(TripCountEC, MaxVectorElementCount) &&
(!FoldTailByMasking \|\| isPowerOf2_32(ConstTripCount))) {		(!FoldTailByMasking \|\| isPowerOf2_32(ConstTripCount))) {
// If loop trip count (TC) is known at compile time there is no point in		// If loop trip count (TC) is known at compile time there is no point in
// choosing VF greater than TC (as done in the loop below). Select maximum		// choosing VF greater than TC (as done in the loop below). Select maximum
// power of two which doesn't exceed TC.		// power of two which doesn't exceed TC.
// If MaxVectorElementCount is scalable, we only fall back on a fixed VF		// If MaxVectorElementCount is scalable, we only fall back on a fixed VF
// when the TC is less than or equal to the known number of lanes.		// when the TC is less than or equal to the known number of lanes.
auto ClampedConstTripCount = PowerOf2Floor(ConstTripCount);		auto ClampedConstTripCount = PowerOf2Floor(ConstTripCount);
		sdesmalenUnsubmitted Done Reply Inline Actions nit: this condition is currently always true. Can you change this into an assert? sdesmalen: nit: this condition is currently always true. Can you change this into an assert?
		fhahnAuthorUnsubmitted Done Reply Inline Actions I updated the code to support scalable vectors. But it seems like the cost model considers loads of `<vscale x 4 x i8>` as expensive as `<vscale x 16 x i8>` so the higher VF isn't chosen for scalable vectors. And the maximized fixed VF seems to be more profitable than the scalable one, so we end up choosing `<16 x i8>` (see the scalable tests) fhahn: I updated the code to support scalable vectors. But it seems like the cost model considers…
LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to maximum power of two not "		LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to maximum power of two not "
"exceeding the constant trip count: "		"exceeding the constant trip count: "
<< ClampedConstTripCount << "\n");		<< ClampedConstTripCount << "\n");
return ElementCount::getFixed(ClampedConstTripCount);		return ElementCount::getFixed(ClampedConstTripCount);
}		}

ElementCount MaxVF = MaxVectorElementCount;		ElementCount MaxVF = MaxVectorElementCount;
		// The largest type limits the vectorization factor, but this can be too
		// limiting when smaller memory operations are present, which are not
		// legal/profitable with the chosen vectorization factor and are only
		// profitable with larger vectorization factors.
		//
		// Try to detect such cases and try increasing the VF in those cases.
		LLVMContext &Context = TheLoop->getHeader()->getContext();
		bool NarrowMemOpUnprofitable = false;
		if (SmallestType < WidestType) {
		Type *SmallVT =
		VectorType::get(IntegerType::get(Context, SmallestType), MaxVF);
		unsigned MaxVFForSmallTy = PowerOf2Floor(
		WidestRegister.divideCoefficientBy(SmallestType).getKnownMinValue());
		;
		Type *SmallMaxPossibleVT =
		VectorType::get(IntegerType::get(Context, SmallestType),
		MaxVFForSmallTy, MaxVF.isScalable());
		NarrowMemOpUnprofitable =
		TTI.getMemoryOpCost(Instruction::Load, SmallVT, Align(1), 0) >
		TTI.getMemoryOpCost(Instruction::Load, SmallMaxPossibleVT, Align(1), 0);
		}
if (TTI.shouldMaximizeVectorBandwidth() \|\|		if (TTI.shouldMaximizeVectorBandwidth() \|\|
(MaximizeBandwidth && isScalarEpilogueAllowed())) {		((MaximizeBandwidth \|\| NarrowMemOpUnprofitable) &&
		isScalarEpilogueAllowed())) {
auto MaxVectorElementCountMaxBW = ElementCount::get(		auto MaxVectorElementCountMaxBW = ElementCount::get(
PowerOf2Floor(WidestRegister.getKnownMinSize() / SmallestType),		PowerOf2Floor(WidestRegister.getKnownMinSize() / SmallestType),
ComputeScalableMaxVF);		ComputeScalableMaxVF);
MaxVectorElementCountMaxBW = MinVF(MaxVectorElementCountMaxBW, MaxSafeVF);		MaxVectorElementCountMaxBW = MinVF(MaxVectorElementCountMaxBW, MaxSafeVF);

// Collect all viable vectorization factors larger than the default MaxVF		// Collect all viable vectorization factors larger than the default MaxVF
// (i.e. MaxVectorElementCount).		// (i.e. MaxVectorElementCount).
SmallVector<ElementCount, 8> VFs;		SmallVector<ElementCount, 8> VFs;
for (ElementCount VS = MaxVectorElementCount * 2;		for (ElementCount VS = MaxVectorElementCount * 2;
ElementCount::isKnownLE(VS, MaxVectorElementCountMaxBW); VS *= 2)		ElementCount::isKnownLE(VS, MaxVectorElementCountMaxBW); VS *= 2)
VFs.push_back(VS);		VFs.push_back(VS);

// For each VF calculate its register usage.		// For each VF calculate its register usage.
▲ Show 20 Lines • Show All 5,215 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/extend-vectorization-factor-for-unprofitable-memops.ll

; RUN: opt -loop-vectorize -mtriple=arm64-apple-darwin -S %s \| FileCheck %s		; RUN: opt -loop-vectorize -mtriple=arm64-apple-darwin -S %s \| FileCheck %s

; Test cases for extending the vectorization factor, if small memory operations		; Test cases for extending the vectorization factor, if small memory operations
; are not profitable.		; are not profitable.

; Test with a loop that contains memory accesses of i8 and i32 types. The		; Test with a loop that contains memory accesses of i8 and i32 types. The
; default maximum VF for NEON is 4. And while we don't have an instruction to		; default maximum VF for NEON is 4. And while we don't have an instruction to
; load 4 x i8, vectorization might still be profitable.		; load 4 x i8, vectorization might still be profitable.
define void @test_load_i8_store_i32(i8* noalias %src, i32* noalias %dst, i32 %off, i64 %N) {		define void @test_load_i8_store_i32(i8* noalias %src, i32* noalias %dst, i32 %off, i64 %N) {
; CHECK-LABEL: @test_load_i8_store_i32(		; CHECK-LABEL: @test_load_i8_store_i32(
; CHECK: <4 x i8>		; CHECK: <16 x i8>
		; CHECK: <16 x i32>
;		;
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]		%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
%gep.src = getelementptr inbounds i8, i8* %src, i64 %iv		%gep.src = getelementptr inbounds i8, i8* %src, i64 %iv
%lv = load i8, i8* %gep.src, align 1		%lv = load i8, i8* %gep.src, align 1
%lv.ext = zext i8 %lv to i32		%lv.ext = zext i8 %lv to i32
%add = add i32 %lv.ext, %off		%add = add i32 %lv.ext, %off
%gep.dst = getelementptr inbounds i32, i32* %dst, i64 %iv		%gep.dst = getelementptr inbounds i32, i32* %dst, i64 %iv
store i32 %add, i32* %gep.dst		store i32 %add, i32* %gep.dst
%iv.next = add nuw nsw i64 %iv, 1		%iv.next = add nuw nsw i64 %iv, 1
%exitcond.not = icmp eq i64 %iv.next, %N		%exitcond.not = icmp eq i64 %iv.next, %N
br i1 %exitcond.not, label %exit, label %loop		br i1 %exitcond.not, label %exit, label %loop

exit:		exit:
ret void		ret void
}		}

; Same as test_load_i8_store_i32, but with types flipped for load and store.		; Same as test_load_i8_store_i32, but with types flipped for load and store.
define void @test_load_i32_store_i8(i32* noalias %src, i8* noalias %dst, i32 %off, i64 %N) {		define void @test_load_i32_store_i8(i32* noalias %src, i8* noalias %dst, i32 %off, i64 %N) {
; CHECK-LABEL: @test_load_i32_store_i8(		; CHECK-LABEL: @test_load_i32_store_i8(
; CHECK: <4 x i8>		; CHECK: <16 x i32>
		; CHECK: <16 x i8>
;		;
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]		%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
%gep.src = getelementptr inbounds i32, i32* %src, i64 %iv		%gep.src = getelementptr inbounds i32, i32* %src, i64 %iv
%lv = load i32, i32* %gep.src, align 1		%lv = load i32, i32* %gep.src, align 1
Show All 35 Lines	exit:
ret void		ret void
}		}

; Test with loop body that requires a large number of vector registers if the		; Test with loop body that requires a large number of vector registers if the
; vectorization factor is large. Make sure the register estimates limit the		; vectorization factor is large. Make sure the register estimates limit the
; vectorization factor.		; vectorization factor.
define void @test_load_i8_store_i64_large(i8* noalias %src, i64* noalias %dst, i64* noalias %dst.2, i64* noalias %dst.3, i64* noalias %dst.4, i64* noalias %dst.5, i64%off, i64 %off.2, i64 %N) {		define void @test_load_i8_store_i64_large(i8* noalias %src, i64* noalias %dst, i64* noalias %dst.2, i64* noalias %dst.3, i64* noalias %dst.4, i64* noalias %dst.5, i64%off, i64 %off.2, i64 %N) {
; CHECK-LABEL: @test_load_i8_store_i64_large		; CHECK-LABEL: @test_load_i8_store_i64_large
; CHECK: <2 x i64>		; CHECK: <8 x i8>
		; CHECK: <8 x i64>
;		;
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]		%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
%gep.src = getelementptr inbounds i8, i8* %src, i64 %iv		%gep.src = getelementptr inbounds i8, i8* %src, i64 %iv
%gep.dst.3 = getelementptr inbounds i64, i64* %dst.3, i64 %iv		%gep.dst.3 = getelementptr inbounds i64, i64* %dst.3, i64 %iv
Show All 28 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-vectorization-cost-tuning.ll

	Show All 22 Lines
	; GENERIC: LV: Vector loop of width vscale x 4 costs: 1 (assuming a minimum vscale of 2).			; GENERIC: LV: Vector loop of width vscale x 4 costs: 1 (assuming a minimum vscale of 2).

	; NEOVERSE-V1: LV: Vector loop of width vscale x 2 costs: 3 (assuming a minimum vscale of 2).			; NEOVERSE-V1: LV: Vector loop of width vscale x 2 costs: 3 (assuming a minimum vscale of 2).
	; NEOVERSE-V1: LV: Vector loop of width vscale x 4 costs: 1 (assuming a minimum vscale of 2).			; NEOVERSE-V1: LV: Vector loop of width vscale x 4 costs: 1 (assuming a minimum vscale of 2).

	; NEOVERSE-N2: LV: Vector loop of width vscale x 2 costs: 6 (assuming a minimum vscale of 1).			; NEOVERSE-N2: LV: Vector loop of width vscale x 2 costs: 6 (assuming a minimum vscale of 1).
	; NEOVERSE-N2: LV: Vector loop of width vscale x 4 costs: 3 (assuming a minimum vscale of 1).			; NEOVERSE-N2: LV: Vector loop of width vscale x 4 costs: 3 (assuming a minimum vscale of 1).

	; VF-4: <4 x i32>			; VF-4: <16 x i32>
	; VF-VSCALE4: <vscale x 4 x i32>			; VF-VSCALE4: <16 x i32>
	define void @test0(i32* %a, i8* %b, i32* %c) #0 {			define void @test0(i32* %a, i8* %b, i32* %c) #0 {
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
	%arrayidx = getelementptr inbounds i32, i32* %c, i64 %iv			%arrayidx = getelementptr inbounds i32, i32* %c, i64 %iv
	%0 = load i32, i32* %arrayidx, align 4			%0 = load i32, i32* %arrayidx, align 4
	Show All 14 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-vectorization.ll

; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=off < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_DISABLED		; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=off < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_DISABLED
; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON		; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON
; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -vectorizer-maximize-bandwidth -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON_MAXBW		; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -vectorizer-maximize-bandwidth -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON_MAXBW

; Test that the MaxVF for the following loop, that has no dependence distances,		; Test that the MaxVF for the following loop, that has no dependence distances,
; is calculated as vscale x 4 (max legal SVE vector size) or vscale x 16		; is calculated as vscale x 4 (max legal SVE vector size) or vscale x 16
; (maximized bandwidth for i8 in the loop).		; (maximized bandwidth for i8 in the loop).
define void @test0(i32* %a, i8* %b, i32* %c) #0 {		define void @test0(i32* %a, i8* %b, i32* %c) #0 {
; CHECK: LV: Checking a loop in "test0"		; CHECK: LV: Checking a loop in "test0"
; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4		; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4
; CHECK_SCALABLE_ON: LV: Selecting VF: vscale x 4		; CHECK_SCALABLE_ON: LV: Selecting VF: 16
; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF		; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4		; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 16
; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 16		; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 16
; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: vscale x 16		; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: vscale x 16
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%arrayidx = getelementptr inbounds i32, i32* %c, i64 %iv		%arrayidx = getelementptr inbounds i32, i32* %c, i64 %iv
Show All 12 Lines	exit:
ret void		ret void
}		}

; Test that the MaxVF for the following loop, with a dependence distance		; Test that the MaxVF for the following loop, with a dependence distance
; of 64 elements, is calculated as (maxvscale = 16) * 4.		; of 64 elements, is calculated as (maxvscale = 16) * 4.
define void @test1(i32* %a, i8* %b) #0 {		define void @test1(i32* %a, i8* %b) #0 {
; CHECK: LV: Checking a loop in "test1"		; CHECK: LV: Checking a loop in "test1"
; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4		; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4
; CHECK_SCALABLE_ON: LV: Selecting VF: vscale x 4		; CHECK_SCALABLE_ON: LV: Selecting VF: 16
; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF		; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4		; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 16
; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 4		; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 4
; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16		; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv		%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv
Show All 13 Lines	exit:
ret void		ret void
}		}

; Test that the MaxVF for the following loop, with a dependence distance		; Test that the MaxVF for the following loop, with a dependence distance
; of 32 elements, is calculated as (maxvscale = 16) * 2.		; of 32 elements, is calculated as (maxvscale = 16) * 2.
define void @test2(i32* %a, i8* %b) #0 {		define void @test2(i32* %a, i8* %b) #0 {
; CHECK: LV: Checking a loop in "test2"		; CHECK: LV: Checking a loop in "test2"
; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 2		; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 2
; CHECK_SCALABLE_ON: LV: Selecting VF: vscale x 2		; CHECK_SCALABLE_ON: LV: Selecting VF: 16
; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF		; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4		; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 16
; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 2		; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 2
; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16		; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv		%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv
Show All 13 Lines	exit:
ret void		ret void
}		}

; Test that the MaxVF for the following loop, with a dependence distance		; Test that the MaxVF for the following loop, with a dependence distance
; of 16 elements, is calculated as (maxvscale = 16) * 1.		; of 16 elements, is calculated as (maxvscale = 16) * 1.
define void @test3(i32* %a, i8* %b) #0 {		define void @test3(i32* %a, i8* %b) #0 {
; CHECK: LV: Checking a loop in "test3"		; CHECK: LV: Checking a loop in "test3"
; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 1		; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 1
; CHECK_SCALABLE_ON: LV: Selecting VF: 4		; CHECK_SCALABLE_ON: LV: Selecting VF: 16
; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF		; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4		; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 16
; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 1		; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 1
; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16		; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv		%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv
▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines