This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Set maximum vscale VF with shouldMaximizeVectorBandwidth
AbandonedPublic

Authored by Allen on Jul 14 2023, 9:48 PM.

Download Raw Diff

Details

Reviewers

dmgreen
paulwalker-arm
fhahn
sdesmalen
jaykang10

Summary

Set the maximum vscale VF of AArch64 with 128 / the size of smallest type in
loop if there is no register usage overflow, which is similar to the neon VF
done on D118979.

Diff Detail

Unit TestsFailed

	Time	Test
	60,070 ms	x64 debian > ThreadSanitizer-x86_64.ThreadSanitizer-x86_64::restore_stack.cpp

Event Timeline

Allen created this revision.Jul 14 2023, 9:48 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2023, 9:48 PM

Herald added subscribers: StephenFan, hiraditya, kristof.beyls. · View Herald Transcript

Allen requested review of this revision.Jul 14 2023, 9:48 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2023, 9:48 PM

Herald added subscribers: llvm-commits, wangpc. · View Herald Transcript

Harbormaster completed remote builds in B245554: Diff 540640.Jul 14 2023, 10:55 PM

For Neon we enabled shouldMaximizeVectorBandwidth so that the backend could make use of instructions like umull/umull2 and the narrowing instructions. Extending into larger types for Neon is quite natural in places, and can lead to less total instructions. SVE has instructions like UMULLB/T that work on the top/bottom lanes in a pair, but I don't believe the backend makes any use of them at the moment.

The description is a bit light on details. What is the reasoning behind enabling this for SVE too? And do you have any benchmark results?

I don't have a server with SVE to support run the performance of large benchmark spec2017.
But when I run the Lammp with intel mode (https://www.lammps.org/#gsc.tab=0) on emulator, I find the
hot function PairLJCutCoulLongIntel::eval in file pair_lj_cut_coul_long_intel.cpp:337 will enlarge the VF from 2 to 4
because there are float and double types in the kernel loop body, so choose a more widen VF will have
wider parallelism, and the performance gain about 16% (https://github.com/lammps/lammps/blob/develop/src/INTEL/pair_lj_cut_coul_long_intel.cpp#L337).

Matt added a subscriber: Matt.Jul 17 2023, 4:27 PM

Allen abandoned this revision.Jul 30 2023, 7:45 PM

For the record - In SVE2 there are a number of instructions that can use top/bottom lanes providing the backend does some sort of lane interleaving. Once that is done this might make a lot of sense but it might be better to address that first.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

3 lines

test/

Transforms/

LoopVectorize/

AArch64/

scalable-vectorization-cost-tuning.ll

2 lines

scalable-vectorization.ll

4 lines

type-shrinkage-zext-costs.ll

28 lines

Diff 540640

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

	Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines
	// Inline a callee if its target-features are a subset of the callers			// Inline a callee if its target-features are a subset of the callers
	// target-features.			// target-features.
	return (CallerBits & CalleeBits) == CalleeBits;			return (CallerBits & CalleeBits) == CalleeBits;
	}			}

	bool AArch64TTIImpl::shouldMaximizeVectorBandwidth(			bool AArch64TTIImpl::shouldMaximizeVectorBandwidth(
	TargetTransformInfo::RegisterKind K) const {			TargetTransformInfo::RegisterKind K) const {
	assert(K != TargetTransformInfo::RGK_Scalar);			assert(K != TargetTransformInfo::RGK_Scalar);
	return (K == TargetTransformInfo::RGK_FixedWidthVector &&			return ((K == TargetTransformInfo::RGK_FixedWidthVector \|\|
				K == TargetTransformInfo::RGK_ScalableVector) &&
	!ST->forceStreamingCompatibleSVE());			!ST->forceStreamingCompatibleSVE());
	}			}

	/// Calculate the cost of materializing a 64-bit value. This helper			/// Calculate the cost of materializing a 64-bit value. This helper
	/// method might only calculate a fraction of a larger immediate. Therefore it			/// method might only calculate a fraction of a larger immediate. Therefore it
	/// is valid to return a cost of ZERO.			/// is valid to return a cost of ZERO.
	InstructionCost AArch64TTIImpl::getIntImmCost(int64_t Val) {			InstructionCost AArch64TTIImpl::getIntImmCost(int64_t Val) {
	// Check if the immediate can be encoded within an instruction.			// Check if the immediate can be encoded within an instruction.
	▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-vectorization-cost-tuning.ll

	Show All 23 Lines

	; NEOVERSE-V1: LV: Vector loop of width vscale x 2 costs: 3 (assuming a minimum vscale of 2).			; NEOVERSE-V1: LV: Vector loop of width vscale x 2 costs: 3 (assuming a minimum vscale of 2).
	; NEOVERSE-V1: LV: Vector loop of width vscale x 4 costs: 1 (assuming a minimum vscale of 2).			; NEOVERSE-V1: LV: Vector loop of width vscale x 4 costs: 1 (assuming a minimum vscale of 2).

	; NEOVERSE-N2: LV: Vector loop of width vscale x 2 costs: 6 (assuming a minimum vscale of 1).			; NEOVERSE-N2: LV: Vector loop of width vscale x 2 costs: 6 (assuming a minimum vscale of 1).
	; NEOVERSE-N2: LV: Vector loop of width vscale x 4 costs: 3 (assuming a minimum vscale of 1).			; NEOVERSE-N2: LV: Vector loop of width vscale x 4 costs: 3 (assuming a minimum vscale of 1).

	; VF-4: <4 x i32>			; VF-4: <4 x i32>
	; VF-VSCALE4: <16 x i32>			; VF-VSCALE4: <vscale x 16 x i32>
	define void @test0(ptr %a, ptr %b, ptr %c) #0 {			define void @test0(ptr %a, ptr %b, ptr %c) #0 {
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
	%arrayidx = getelementptr inbounds i32, ptr %c, i64 %iv			%arrayidx = getelementptr inbounds i32, ptr %c, i64 %iv
	%0 = load i32, ptr %arrayidx, align 4			%0 = load i32, ptr %arrayidx, align 4
	Show All 14 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-vectorization.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -passes=loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=off < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_DISABLED			; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -passes=loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=off < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_DISABLED
	; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -passes=loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON			; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -passes=loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON
	; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -passes=loop-vectorize -S -debug-only=loop-vectorize -vectorizer-maximize-bandwidth -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON_MAXBW			; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -passes=loop-vectorize -S -debug-only=loop-vectorize -vectorizer-maximize-bandwidth -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON_MAXBW

	; Test that the MaxVF for the following loop, that has no dependence distances,			; Test that the MaxVF for the following loop, that has no dependence distances,
	; is calculated as vscale x 4 (max legal SVE vector size) or vscale x 16			; is calculated as vscale x 4 (max legal SVE vector size) or vscale x 16
	; (maximized bandwidth for i8 in the loop).			; (maximized bandwidth for i8 in the loop).
	define void @test0(ptr %a, ptr %b, ptr %c) #0 {			define void @test0(ptr %a, ptr %b, ptr %c) #0 {
	; CHECK: LV: Checking a loop in 'test0'			; CHECK: LV: Checking a loop in 'test0'
	; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4			; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 16
	; CHECK_SCALABLE_ON: LV: Selecting VF: 16			; CHECK_SCALABLE_ON: LV: Selecting VF: vscale x 16
	; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF			; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
	; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 16			; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 16
	; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 16			; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 16
	; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: vscale x 16			; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: vscale x 16
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/type-shrinkage-zext-costs.ll

	Show All 18 Lines
	; CHECK-COST: LV: Found an estimated cost of 0 for VF vscale x 8 For instruction: %conv = zext i8 %0 to i32			; CHECK-COST: LV: Found an estimated cost of 0 for VF vscale x 8 For instruction: %conv = zext i8 %0 to i32
	; CHECK-LABEL: define void @zext_i8_i16			; CHECK-LABEL: define void @zext_i8_i16
	; CHECK-SAME: (ptr noalias nocapture readonly [[P:%.]], ptr noalias nocapture [[Q:%.]], i32 [[LEN:%.*]]) #[[ATTR0:[0-9]+]] {			; CHECK-SAME: (ptr noalias nocapture readonly [[P:%.]], ptr noalias nocapture [[Q:%.]], i32 [[LEN:%.*]]) #[[ATTR0:[0-9]+]] {
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[LEN]], -1			; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[LEN]], -1
	; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64			; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
	; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1			; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
	; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 8			; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 16
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], [[TMP4]]			; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], [[TMP4]]
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 8			; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 16
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], [[TMP6]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], [[TMP6]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[INDEX]]
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 8 x i8>, ptr [[TMP7]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 16 x i8>, ptr [[TMP7]], align 1
	; CHECK-NEXT: [[TMP8:%.*]] = zext <vscale x 8 x i8> [[WIDE_LOAD]] to <vscale x 8 x i16>			; CHECK-NEXT: [[TMP8:%.*]] = zext <vscale x 16 x i8> [[WIDE_LOAD]] to <vscale x 16 x i16>
	; CHECK-NEXT: [[TMP9:%.*]] = add <vscale x 8 x i16> [[TMP8]], trunc (<vscale x 8 x i32> shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 2, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer) to <vscale x 8 x i16>)			; CHECK-NEXT: [[TMP9:%.*]] = add <vscale x 16 x i16> [[TMP8]], trunc (<vscale x 16 x i32> shufflevector (<vscale x 16 x i32> insertelement (<vscale x 16 x i32> poison, i32 2, i64 0), <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer) to <vscale x 16 x i16>)
	; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[INDEX]]
	; CHECK-NEXT: store <vscale x 8 x i16> [[TMP9]], ptr [[TMP10]], align 2			; CHECK-NEXT: store <vscale x 16 x i16> [[TMP9]], ptr [[TMP10]], align 2
	; CHECK-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 8			; CHECK-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 16
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]
	; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	; CHECK-COST: LV: Found an estimated cost of 0 for VF vscale x 8 For instruction: %conv = sext i8 %0 to i32			; CHECK-COST: LV: Found an estimated cost of 0 for VF vscale x 8 For instruction: %conv = sext i8 %0 to i32
	; CHECK-LABEL: define void @sext_i8_i16			; CHECK-LABEL: define void @sext_i8_i16
	; CHECK-SAME: (ptr noalias nocapture readonly [[P:%.]], ptr noalias nocapture [[Q:%.]], i32 [[LEN:%.*]]) #[[ATTR0]] {			; CHECK-SAME: (ptr noalias nocapture readonly [[P:%.]], ptr noalias nocapture [[Q:%.]], i32 [[LEN:%.*]]) #[[ATTR0]] {
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[LEN]], -1			; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[LEN]], -1
	; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64			; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
	; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1			; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
	; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 8			; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 16
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], [[TMP4]]			; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], [[TMP4]]
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 8			; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 16
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], [[TMP6]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], [[TMP6]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[INDEX]]
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 8 x i8>, ptr [[TMP7]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 16 x i8>, ptr [[TMP7]], align 1
	; CHECK-NEXT: [[TMP8:%.*]] = sext <vscale x 8 x i8> [[WIDE_LOAD]] to <vscale x 8 x i16>			; CHECK-NEXT: [[TMP8:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD]] to <vscale x 16 x i16>
	; CHECK-NEXT: [[TMP9:%.*]] = add <vscale x 8 x i16> [[TMP8]], trunc (<vscale x 8 x i32> shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 2, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer) to <vscale x 8 x i16>)			; CHECK-NEXT: [[TMP9:%.*]] = add <vscale x 16 x i16> [[TMP8]], trunc (<vscale x 16 x i32> shufflevector (<vscale x 16 x i32> insertelement (<vscale x 16 x i32> poison, i32 2, i64 0), <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer) to <vscale x 16 x i16>)
	; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[INDEX]]
	; CHECK-NEXT: store <vscale x 8 x i16> [[TMP9]], ptr [[TMP10]], align 2			; CHECK-NEXT: store <vscale x 16 x i16> [[TMP9]], ptr [[TMP10]], align 2
	; CHECK-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 8			; CHECK-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 16
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]
	; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	Show All 39 Lines