This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Correct lane zero optimization in insert/extract costs
AbandonedPublic

Authored by mssimpso on May 3 2017, 1:14 PM.

Download Raw Diff

Details

Reviewers

mcrosier
t.p.northover
kristof.beyls
evandro
anemet
sbaranga

Summary

In the TTI calculation of vector insert and extract costs, we have an optimization that returns a cost of zero if we are inserting into or extracting from vector lane zero. All other inserts and extracts cost the base amount specified by the sub-target. However, the lane zero optimization only makes sense for floating-point types (i.e., within-class moves). For integer types, we should incur a cost for moving data from vector to general purpose registers, even for lane zero.

This patch modifies the lane zero optimization so that it applies only to floating-point types. Additionally, we now fall back to the base TTI implementation for all other floating-point inserts and extracts. The existing sub-target specified insert/extract costs are used only for the cross-class moves, which I think was probably the original intent. Since the existing code looks like a bug to me, I checked the X86 target, and it implements something similar to what is in this patch.

I've added a new cost model test case in Analysis/CostModel/AArch64/inserts-extracts.ll. All other test case changes are trivial (e.g., they lower the SLP threshold to ensure tests still vectorize).

Diff Detail

Build Status

Buildable 6125
Build 6125: arc lint + arc unit

Event Timeline

mssimpso created this revision.May 3 2017, 1:14 PM

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptMay 3 2017, 1:14 PM

Hi Matt,

In the TTI calculation of vector insert and extract costs, we have an optimization that returns a cost of zero if we are inserting into or extracting from vector lane zero. All other inserts and extracts cost the base amount specified by the sub-target. However, the lane zero optimization only makes sense for floating-point types (i.e., within-class moves). For integer types, we should incur a cost for moving data from vector to general purpose registers, even for lane zero.

Actually, this is also true if the insert is fed by a load. In this case we can just directly load into the vector register. In my recent experience with SLP this seemed like a pretty important case which is changing with this patch. How does performance look?

This patch modifies the lane zero optimization so that it applies only to floating-point types. Additionally, we now fall back to the base TTI implementation for all other floating-point inserts and extracts. The existing sub-target specified insert/extract costs are used only for the cross-class moves, which I think was probably the original intent. Since the existing code looks like a bug to me, I checked the X86 target, and it implements something similar to what is in this patch.

What does this mean in terms of the change of cost? 3->1? Inserts are I think pretty expensive even within-class because they represent partial-writes. I was surprised to discover recently that extracts and inserts have the same costs.

Thanks,
Adam

Hi Adam,

Actually, this is also true if the insert is fed by a load. In this case we can just directly load into the vector register. In my recent experience with SLP this seemed like a pretty important case which is changing with this patch. How does performance look?

That's right. That actually should be true for all load/insert sequences of legal types, not just ones that insert into lane zero - we should generate LD1s for all lanes. So I think that could be an additional optimization, probably in a separate patch? Regarding performance, this is generally beneficial for our cores (Kryo, Falkor), but our base insert/extract cost (2) is already lower than the default (3), so the effect may be somewhat different.

What does this mean in terms of the change of cost? 3->1? Inserts are I think pretty expensive even within-class because they represent partial-writes. I was surprised to discover recently that extracts and inserts have the same costs.

The changes to the default costs can be seen in the new cost model test I added. Basically, for integer and pointer types, things stay the same at 3 (but now lane zero is also 3). For floating-point, lane zero stays the same at 0 (but now the other lanes are at 1). But I'm happy leaving the float-point non-zero lanes at 3 if you prefer, and considering them instead in a follow-on patch if necessary.

My priority here is fixing the "extracting from lane zero costs nothing" issue, which is not true for integer types. If this change is too scary all at once, we could start with that case and tackle the issues/optimizations one-by-one. What do you think?

In D32827#745349, @mssimpso wrote:

Hi Adam,

Actually, this is also true if the insert is fed by a load. In this case we can just directly load into the vector register. In my recent experience with SLP this seemed like a pretty important case which is changing with this patch. How does performance look?

That's right. That actually should be true for all load/insert sequences of legal types, not just ones that insert into lane zero - we should generate LD1s for all lanes. So I think that could be an additional optimization, probably in a separate patch?

Sounds good.

Regarding performance, this is generally beneficial for our cores (Kryo, Falkor), but our base insert/extract cost (2) is already lower than the default (3), so the effect may be somewhat different.

For perf changes like this, it would be great if we could have a more details analysis of the changed hotspots (like what I did for the 64-bit SLP). Opt-viewer is a great tool for this especially with opt-diff which will tell you the changes in SLP vectorization in the hotspots (if you run it with PGO). Unfortunately, I didn't have time to clean up my patches that add SLP vectorization remarks so you can't use it yet :(.

What does this mean in terms of the change of cost? 3->1? Inserts are I think pretty expensive even within-class because they represent partial-writes. I was surprised to discover recently that extracts and inserts have the same costs.

The changes to the default costs can be seen in the new cost model test I added. Basically, for integer and pointer types, things stay the same at 3 (but now lane zero is also 3). For floating-point, lane zero stays the same at 0 (but now the other lanes are at 1). But I'm happy leaving the float-point non-zero lanes at 3 if you prefer, and considering them instead in a follow-on patch if necessary.

These make sense to me.

I am wondering if a better abstraction would be if we just had a cost for cross-domain moves rather then using these magic values. I guess it doesn't matter if this is the only place where we have this logic.

My priority here is fixing the "extracting from lane zero costs nothing" issue, which is not true for integer types. If this change is too scary all at once, we could start with that case and tackle the issues/optimizations one-by-one. What do you think?

Yes, we should definitely commit this in two pieces.

evandro added a subscriber: az.May 8 2017, 3:41 PM

Hey Adam,

Thanks for the feedback. It sounds like we can start the with the "extracting from lane zero" piece and then follow up with patches for inserts fed by loads and then within-domain moves. Hopefully this sequence will minimize the affects of any temporary imprecision. And I will benchmark the changes with some of the hardware I have access to (Kryo, Falkor, Cortex-A57), and provide some analysis.

Thanks. I've added opt-remarks to SLP in rL302811. Hopefully you can use those for your analysis too.

I'm abandoning this change for now. The patch resulted in a few performance regressions that need to be investigated. Also, making the insert/extract cost more precise may require that we re-tune the default cost (3) as well.

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

14 lines

test/

Analysis/

CostModel/

AArch64/

10 lines

4 lines

70 lines

4 lines

Transforms/

LoopVectorize/

AArch64/

aarch64-predication.ll

2 lines

interleaved-vs-scalar.ll

2 lines

interleaved_cost.ll

6 lines

predication_costs.ll

32 lines

SLPVectorizer/

AArch64/

2 lines

2 lines

2 lines

2 lines

Diff 97719

lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 357 Lines • ▼ Show 20 Lines	if (Index != -1U) {
// This type is legalized to a scalar type.		// This type is legalized to a scalar type.
if (!LT.second.isVector())		if (!LT.second.isVector())
return 0;		return 0;

// The type may be split. Normalize the index to the new type.		// The type may be split. Normalize the index to the new type.
unsigned Width = LT.second.getVectorNumElements();		unsigned Width = LT.second.getVectorNumElements();
Index = Index % Width;		Index = Index % Width;

// The element at index zero is already inside the vector.		// Floating-point scalars are already located in index #0.
if (Index == 0)		if (Val->getScalarType()->isFloatingPointTy() && Index == 0)
return 0;		return 0;
}		}

// All other insert/extracts cost this much.		// For all other cross-class inserts/extracts, return the cost specified by
		// the sub-target.
		if (!Val->getScalarType()->isFloatingPointTy())
return ST->getVectorInsertExtractBaseCost();		return ST->getVectorInsertExtractBaseCost();

		// Fall back to the base TTI implementation for floating-point
		// inserts/extracts.
		return BaseT::getVectorInstrCost(Opcode, Val, Index);
}		}

int AArch64TTIImpl::getArithmeticInstrCost(		int AArch64TTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::OperandValueKind Opd1Info,		unsigned Opcode, Type *Ty, TTI::OperandValueKind Opd1Info,
TTI::OperandValueKind Opd2Info, TTI::OperandValueProperties Opd1PropInfo,		TTI::OperandValueKind Opd2Info, TTI::OperandValueProperties Opd1PropInfo,
TTI::OperandValueProperties Opd2PropInfo, ArrayRef<const Value *> Args) {		TTI::OperandValueProperties Opd2PropInfo, ArrayRef<const Value *> Args) {
// Legalize the type.		// Legalize the type.
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);
▲ Show 20 Lines • Show All 297 Lines • Show Last 20 Lines

test/Analysis/CostModel/AArch64/bswap.ll

	Show All 30 Lines
	; CHECK: 'Cost Model Analysis' for function 'bswap_i64':			; CHECK: 'Cost Model Analysis' for function 'bswap_i64':
	; CHECK: Found an estimated cost of 1 for instruction: %bswap			; CHECK: Found an estimated cost of 1 for instruction: %bswap
	%bswap = tail call i64 @llvm.bswap.i64(i64 %a)			%bswap = tail call i64 @llvm.bswap.i64(i64 %a)
	ret i64 %bswap			ret i64 %bswap
	}			}

	define <2 x i32> @bswap_v2i32(<2 x i32> %a) {			define <2 x i32> @bswap_v2i32(<2 x i32> %a) {
	; CHECK: 'Cost Model Analysis' for function 'bswap_v2i32':			; CHECK: 'Cost Model Analysis' for function 'bswap_v2i32':
	; CHECK: Found an estimated cost of 8 for instruction: %bswap			; CHECK: Found an estimated cost of 14 for instruction: %bswap
	%bswap = call <2 x i32> @llvm.bswap.v2i32(<2 x i32> %a)			%bswap = call <2 x i32> @llvm.bswap.v2i32(<2 x i32> %a)
	ret <2 x i32> %bswap			ret <2 x i32> %bswap
	}			}

	define <4 x i16> @bswap_v4i16(<4 x i16> %a) {			define <4 x i16> @bswap_v4i16(<4 x i16> %a) {
	; CHECK: 'Cost Model Analysis' for function 'bswap_v4i16':			; CHECK: 'Cost Model Analysis' for function 'bswap_v4i16':
	; CHECK: Found an estimated cost of 22 for instruction: %bswap			; CHECK: Found an estimated cost of 28 for instruction: %bswap
	%bswap = call <4 x i16> @llvm.bswap.v4i16(<4 x i16> %a)			%bswap = call <4 x i16> @llvm.bswap.v4i16(<4 x i16> %a)
	ret <4 x i16> %bswap			ret <4 x i16> %bswap
	}			}

	define <2 x i64> @bswap_v2i64(<2 x i64> %a) {			define <2 x i64> @bswap_v2i64(<2 x i64> %a) {
	; CHECK: 'Cost Model Analysis' for function 'bswap_v2i64':			; CHECK: 'Cost Model Analysis' for function 'bswap_v2i64':
	; CHECK: Found an estimated cost of 8 for instruction: %bswap			; CHECK: Found an estimated cost of 14 for instruction: %bswap
	%bswap = call <2 x i64> @llvm.bswap.v2i64(<2 x i64> %a)			%bswap = call <2 x i64> @llvm.bswap.v2i64(<2 x i64> %a)
	ret <2 x i64> %bswap			ret <2 x i64> %bswap
	}			}

	define <4 x i32> @bswap_v4i32(<4 x i32> %a) {			define <4 x i32> @bswap_v4i32(<4 x i32> %a) {
	; CHECK: 'Cost Model Analysis' for function 'bswap_v4i32':			; CHECK: 'Cost Model Analysis' for function 'bswap_v4i32':
	; CHECK: Found an estimated cost of 22 for instruction: %bswap			; CHECK: Found an estimated cost of 28 for instruction: %bswap
	%bswap = call <4 x i32> @llvm.bswap.v4i32(<4 x i32> %a)			%bswap = call <4 x i32> @llvm.bswap.v4i32(<4 x i32> %a)
	ret <4 x i32> %bswap			ret <4 x i32> %bswap
	}			}

	define <8 x i16> @bswap_v8i16(<8 x i16> %a) {			define <8 x i16> @bswap_v8i16(<8 x i16> %a) {
	; CHECK: 'Cost Model Analysis' for function 'bswap_v8i16':			; CHECK: 'Cost Model Analysis' for function 'bswap_v8i16':
	; CHECK: Found an estimated cost of 50 for instruction: %bswap			; CHECK: Found an estimated cost of 56 for instruction: %bswap
	%bswap = call <8 x i16> @llvm.bswap.v8i16(<8 x i16> %a)			%bswap = call <8 x i16> @llvm.bswap.v8i16(<8 x i16> %a)
	ret <8 x i16> %bswap			ret <8 x i16> %bswap
	}			}

test/Analysis/CostModel/AArch64/falkor.ll

	; RUN: opt < %s -cost-model -analyze -mcpu=falkor \| FileCheck %s			; RUN: opt < %s -cost-model -analyze -mcpu=falkor \| FileCheck %s

	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnu"			target triple = "aarch64--linux-gnu"

	; CHECK-LABEL: vectorInstrCost			; CHECK-LABEL: vectorInstrCost
	define void @vectorInstrCost() {			define void @vectorInstrCost() {

	; Vector extracts - extracting the first element should have a zero cost;			; Vector extracts - extracting the first element should have a zero cost;
	; all other elements should have a cost of two.			; all other elements should have a cost of two.
	;			;
	; CHECK: cost of 0 {{.*}} extractelement <2 x i64> undef, i32 0			; CHECK: cost of 2 {{.*}} extractelement <2 x i64> undef, i32 0
	; CHECK: cost of 2 {{.*}} extractelement <2 x i64> undef, i32 1			; CHECK: cost of 2 {{.*}} extractelement <2 x i64> undef, i32 1
	%t1 = extractelement <2 x i64> undef, i32 0			%t1 = extractelement <2 x i64> undef, i32 0
	%t2 = extractelement <2 x i64> undef, i32 1			%t2 = extractelement <2 x i64> undef, i32 1

	; Vector inserts - inserting the first element should have a zero cost; all			; Vector inserts - inserting the first element should have a zero cost; all
	; other elements should have a cost of two.			; other elements should have a cost of two.
	;			;
	; CHECK: cost of 0 {{.*}} insertelement <2 x i64> undef, i64 undef, i32 0			; CHECK: cost of 2 {{.*}} insertelement <2 x i64> undef, i64 undef, i32 0
	; CHECK: cost of 2 {{.*}} insertelement <2 x i64> undef, i64 undef, i32 1			; CHECK: cost of 2 {{.*}} insertelement <2 x i64> undef, i64 undef, i32 1
	%t3 = insertelement <2 x i64> undef, i64 undef, i32 0			%t3 = insertelement <2 x i64> undef, i64 undef, i32 0
	%t4 = insertelement <2 x i64> undef, i64 undef, i32 1			%t4 = insertelement <2 x i64> undef, i64 undef, i32 1

	ret void			ret void
	}			}

test/Analysis/CostModel/AArch64/inserts-extracts.ll

This file was added.

				; RUN: opt < %s -cost-model -analyze \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				define void @floating_point() {
				; CHECK-LABEL: floating_point
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = extractelement <4 x double> undef, i32 0
				; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp1 = extractelement <4 x double> undef, i32 1
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp2 = extractelement <4 x double> undef, i32 2
				; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp3 = extractelement <4 x double> undef, i32 3
				%tmp0 = extractelement <4 x double> undef, i32 0
				%tmp1 = extractelement <4 x double> undef, i32 1
				%tmp2 = extractelement <4 x double> undef, i32 2
				%tmp3 = extractelement <4 x double> undef, i32 3

				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp4 = insertelement <4 x double> undef, double undef, i32 0
				; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp5 = insertelement <4 x double> undef, double undef, i32 1
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp6 = insertelement <4 x double> undef, double undef, i32 2
				; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp7 = insertelement <4 x double> undef, double undef, i32 3
				%tmp4 = insertelement <4 x double> undef, double undef, i32 0
				%tmp5 = insertelement <4 x double> undef, double undef, i32 1
				%tmp6 = insertelement <4 x double> undef, double undef, i32 2
				%tmp7 = insertelement <4 x double> undef, double undef, i32 3
				ret void
				}

				define void @integer() {
				; CHECK-LABEL: integer
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp0 = extractelement <4 x i64> undef, i32 0
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp1 = extractelement <4 x i64> undef, i32 1
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp2 = extractelement <4 x i64> undef, i32 2
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp3 = extractelement <4 x i64> undef, i32 3
				%tmp0 = extractelement <4 x i64> undef, i32 0
				%tmp1 = extractelement <4 x i64> undef, i32 1
				%tmp2 = extractelement <4 x i64> undef, i32 2
				%tmp3 = extractelement <4 x i64> undef, i32 3

				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp4 = insertelement <4 x i64> undef, i64 undef, i32 0
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp5 = insertelement <4 x i64> undef, i64 undef, i32 1
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp6 = insertelement <4 x i64> undef, i64 undef, i32 2
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp7 = insertelement <4 x i64> undef, i64 undef, i32 3
				%tmp4 = insertelement <4 x i64> undef, i64 undef, i32 0
				%tmp5 = insertelement <4 x i64> undef, i64 undef, i32 1
				%tmp6 = insertelement <4 x i64> undef, i64 undef, i32 2
				%tmp7 = insertelement <4 x i64> undef, i64 undef, i32 3
				ret void
				}

				define void @pointer() {
				; CHECK-LABEL: pointer
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp0 = extractelement <4 x i8*> undef, i32 0
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp1 = extractelement <4 x i8*> undef, i32 1
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp2 = extractelement <4 x i8*> undef, i32 2
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp3 = extractelement <4 x i8*> undef, i32 3
				%tmp0 = extractelement <4 x i8*> undef, i32 0
				%tmp1 = extractelement <4 x i8*> undef, i32 1
				%tmp2 = extractelement <4 x i8*> undef, i32 2
				%tmp3 = extractelement <4 x i8*> undef, i32 3

				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp4 = insertelement <4 x i8> undef, i8 undef, i32 0
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp5 = insertelement <4 x i8> undef, i8 undef, i32 1
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp6 = insertelement <4 x i8> undef, i8 undef, i32 2
				; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp7 = insertelement <4 x i8> undef, i8 undef, i32 3
				%tmp4 = insertelement <4 x i8> undef, i8 undef, i32 0
				%tmp5 = insertelement <4 x i8> undef, i8 undef, i32 1
				%tmp6 = insertelement <4 x i8> undef, i8 undef, i32 2
				%tmp7 = insertelement <4 x i8> undef, i8 undef, i32 3
				ret void
				}

test/Analysis/CostModel/AArch64/kryo.ll

	; RUN: opt < %s -cost-model -analyze -mcpu=kryo \| FileCheck %s			; RUN: opt < %s -cost-model -analyze -mcpu=kryo \| FileCheck %s

	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnu"			target triple = "aarch64--linux-gnu"

	; CHECK-LABEL: vectorInstrCost			; CHECK-LABEL: vectorInstrCost
	define void @vectorInstrCost() {			define void @vectorInstrCost() {

	; Vector extracts - extracting the first element should have a zero cost;			; Vector extracts - extracting the first element should have a zero cost;
	; all other elements should have a cost of two.			; all other elements should have a cost of two.
	;			;
	; CHECK: cost of 0 {{.*}} extractelement <2 x i64> undef, i32 0			; CHECK: cost of 2 {{.*}} extractelement <2 x i64> undef, i32 0
	; CHECK: cost of 2 {{.*}} extractelement <2 x i64> undef, i32 1			; CHECK: cost of 2 {{.*}} extractelement <2 x i64> undef, i32 1
	%t1 = extractelement <2 x i64> undef, i32 0			%t1 = extractelement <2 x i64> undef, i32 0
	%t2 = extractelement <2 x i64> undef, i32 1			%t2 = extractelement <2 x i64> undef, i32 1

	; Vector inserts - inserting the first element should have a zero cost; all			; Vector inserts - inserting the first element should have a zero cost; all
	; other elements should have a cost of two.			; other elements should have a cost of two.
	;			;
	; CHECK: cost of 0 {{.*}} insertelement <2 x i64> undef, i64 undef, i32 0			; CHECK: cost of 2 {{.*}} insertelement <2 x i64> undef, i64 undef, i32 0
	; CHECK: cost of 2 {{.*}} insertelement <2 x i64> undef, i64 undef, i32 1			; CHECK: cost of 2 {{.*}} insertelement <2 x i64> undef, i64 undef, i32 1
	%t3 = insertelement <2 x i64> undef, i64 undef, i32 0			%t3 = insertelement <2 x i64> undef, i64 undef, i32 0
	%t4 = insertelement <2 x i64> undef, i64 undef, i32 1			%t4 = insertelement <2 x i64> undef, i64 undef, i32 1

	ret void			ret void
	}			}

test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt < %s -loop-vectorize -disable-output -debug-only=loop-vectorize 2>&1 \| FileCheck %s --check-prefix=COST			; RUN: opt < %s -loop-vectorize -disable-output -debug-only=loop-vectorize 2>&1 \| FileCheck %s --check-prefix=COST
	; RUN: opt < %s -loop-vectorize -force-vector-width=2 -instcombine -simplifycfg -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-width=2 -instcombine -simplifycfg -S \| FileCheck %s

	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnu"			target triple = "aarch64--linux-gnu"

	; This test checks that we correctly compute the scalarized operands for a			; This test checks that we correctly compute the scalarized operands for a
	; user-specified vectorization factor when interleaving is disabled. We use the			; user-specified vectorization factor when interleaving is disabled. We use the
	; "optsize" attribute to disable all interleaving calculations. A cost of 4			; "optsize" attribute to disable all interleaving calculations. A cost of 4
	; for %tmp4 indicates that we would scalarize it's operand (%tmp3), giving			; for %tmp4 indicates that we would scalarize it's operand (%tmp3), giving
	; %tmp4 a lower scalarization overhead.			; %tmp4 a lower scalarization overhead.
	;			;
	; COST-LABEL: predicated_udiv_scalarized_operand			; COST-LABEL: predicated_udiv_scalarized_operand
	; COST: LV: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i64 %tmp2, %tmp3			; COST: LV: Found an estimated cost of 7 for VF 2 For instruction: %tmp4 = udiv i64 %tmp2, %tmp3
	;			;
	; CHECK-LABEL: @predicated_udiv_scalarized_operand(			; CHECK-LABEL: @predicated_udiv_scalarized_operand(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %entry ], [ [[INDEX_NEXT:%.]], %[[PRED_UDIV_CONTINUE2:.*]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %entry ], [ [[INDEX_NEXT:%.]], %[[PRED_UDIV_CONTINUE2:.*]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <2 x i64> [ zeroinitializer, %entry ], [ [[TMP17:%.]], %[[PRED_UDIV_CONTINUE2]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi <2 x i64> [ zeroinitializer, %entry ], [ [[TMP17:%.]], %[[PRED_UDIV_CONTINUE2]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i64, i64 %a, i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i64, i64 %a, i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i64 [[TMP0]] to <2 x i64>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i64 [[TMP0]] to <2 x i64>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <2 x i64>, <2 x i64> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <2 x i64>, <2 x i64> [[TMP1]], align 4
	▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 \| FileCheck %s			; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 \| FileCheck %s

	; This test shows extremely high interleaving cost that, probably, should be fixed.			; This test shows extremely high interleaving cost that, probably, should be fixed.
	; Due to the high cost, interleaving is not beneficial and the cost model chooses to scalarize			; Due to the high cost, interleaving is not beneficial and the cost model chooses to scalarize
	; the load instructions.			; the load instructions.

	target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnu"			target triple = "aarch64--linux-gnu"

	%pair = type { i8, i8 }			%pair = type { i8, i8 }

	; CHECK-LABEL: test			; CHECK-LABEL: test
	; CHECK: Found an estimated cost of 20 for VF 2 For instruction: {{.*}} load i8			; CHECK: Found an estimated cost of 32 for VF 2 For instruction: {{.*}} load i8
	; CHECK: Found an estimated cost of 0 for VF 2 For instruction: {{.*}} load i8			; CHECK: Found an estimated cost of 0 for VF 2 For instruction: {{.*}} load i8
	; CHECK: vector.body			; CHECK: vector.body
	; CHECK: load i8			; CHECK: load i8
	; CHECK: load i8			; CHECK: load i8
	; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body			; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

	define void @test(%pair* %p, i64 %n) {			define void @test(%pair* %p, i64 %n) {
	entry:			entry:
	Show All 16 Lines

test/Transforms/LoopVectorize/AArch64/interleaved_cost.ll

	Show First 20 Lines • Show All 162 Lines • ▼ Show 20 Lines

	; The interleave factor in this test is 8, which is greater than the maximum			; The interleave factor in this test is 8, which is greater than the maximum
	; allowed factor for AArch64 (4). Thus, we will fall back to the basic TTI			; allowed factor for AArch64 (4). Thus, we will fall back to the basic TTI
	; implementation for determining the cost of the interleaved load group. The			; implementation for determining the cost of the interleaved load group. The
	; stores do not form a legal interleaved group because the group would contain			; stores do not form a legal interleaved group because the group would contain
	; gaps.			; gaps.
	;			;
	; VF_2-LABEL: Checking a loop in "i64_factor_8"			; VF_2-LABEL: Checking a loop in "i64_factor_8"
	; VF_2: Found an estimated cost of 6 for VF 2 For instruction: %tmp2 = load i64, i64* %tmp0, align 8			; VF_2: Found an estimated cost of 24 for VF 2 For instruction: %tmp2 = load i64, i64* %tmp0, align 8
	; VF_2-NEXT: Found an estimated cost of 0 for VF 2 For instruction: %tmp3 = load i64, i64* %tmp1, align 8			; VF_2-NEXT: Found an estimated cost of 0 for VF 2 For instruction: %tmp3 = load i64, i64* %tmp1, align 8
	; VF_2-NEXT: Found an estimated cost of 7 for VF 2 For instruction: store i64 0, i64* %tmp0, align 8			; VF_2-NEXT: Found an estimated cost of 10 for VF 2 For instruction: store i64 0, i64* %tmp0, align 8
	; VF_2-NEXT: Found an estimated cost of 7 for VF 2 For instruction: store i64 0, i64* %tmp1, align 8			; VF_2-NEXT: Found an estimated cost of 10 for VF 2 For instruction: store i64 0, i64* %tmp1, align 8
	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
	%tmp0 = getelementptr inbounds %i64.8, %i64.8* %data, i64 %i, i32 2			%tmp0 = getelementptr inbounds %i64.8, %i64.8* %data, i64 %i, i32 2
	%tmp1 = getelementptr inbounds %i64.8, %i64.8* %data, i64 %i, i32 6			%tmp1 = getelementptr inbounds %i64.8, %i64.8* %data, i64 %i, i32 6
	%tmp2 = load i64, i64* %tmp0, align 8			%tmp2 = load i64, i64* %tmp0, align 8
	%tmp3 = load i64, i64* %tmp1, align 8			%tmp3 = load i64, i64* %tmp1, align 8
	store i64 0, i64* %tmp0, align 8			store i64 0, i64* %tmp0, align 8
	store i64 0, i64* %tmp1, align 8			store i64 0, i64* %tmp1, align 8
	%i.next = add nuw nsw i64 %i, 1			%i.next = add nuw nsw i64 %i, 1
	%cond = icmp slt i64 %i.next, %n			%cond = icmp slt i64 %i.next, %n
	br i1 %cond, label %for.body, label %for.end			br i1 %cond, label %for.body, label %for.end

	for.end:			for.end:
	ret void			ret void
	}			}

test/Transforms/LoopVectorize/AArch64/predication_costs.ll

	Show All 10 Lines

	; CHECK-LABEL: predicated_udiv			; CHECK-LABEL: predicated_udiv
	;			;
	; This test checks that we correctly compute the cost of the predicated udiv			; This test checks that we correctly compute the cost of the predicated udiv
	; instruction. If we assume the block probability is 50%, we compute the cost			; instruction. If we assume the block probability is 50%, we compute the cost
	; as:			; as:
	;			;
	; Cost of udiv:			; Cost of udiv:
	; (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5			; (udiv(2) + extractelement(12) + insertelement(6)) / 2 = 10
	;			;
	; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3			; CHECK: Found an estimated cost of 10 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
	; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3			; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
	;			;
	define i32 @predicated_udiv(i32* %a, i32* %b, i1 %c, i64 %n) {			define i32 @predicated_udiv(i32* %a, i32* %b, i1 %c, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
	Show All 22 Lines

	; CHECK-LABEL: predicated_store			; CHECK-LABEL: predicated_store
	;			;
	; This test checks that we correctly compute the cost of the predicated store			; This test checks that we correctly compute the cost of the predicated store
	; instruction. If we assume the block probability is 50%, we compute the cost			; instruction. If we assume the block probability is 50%, we compute the cost
	; as:			; as:
	;			;
	; Cost of store:			; Cost of store:
	; (store(4) + extractelement(3)) / 2 = 3			; (store(4) + extractelement(6)) / 2 = 5
	;			;
	; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4			; CHECK: Found an estimated cost of 5 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
	; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4			; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4
	;			;
	define void @predicated_store(i32* %a, i1 %c, i32 %x, i64 %n) {			define void @predicated_store(i32* %a, i1 %c, i32 %x, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
	Show All 18 Lines
	; CHECK-LABEL: predicated_udiv_scalarized_operand			; CHECK-LABEL: predicated_udiv_scalarized_operand
	;			;
	; This test checks that we correctly compute the cost of the predicated udiv			; This test checks that we correctly compute the cost of the predicated udiv
	; instruction and the add instruction it uses. The add is scalarized and sunk			; instruction and the add instruction it uses. The add is scalarized and sunk
	; inside the predicated block. If we assume the block probability is 50%, we			; inside the predicated block. If we assume the block probability is 50%, we
	; compute the cost as:			; compute the cost as:
	;			;
	; Cost of add:			; Cost of add:
	; (add(2) + extractelement(3)) / 2 = 2			; (add(2) + extractelement(6)) / 2 = 4
	; Cost of udiv:			; Cost of udiv:
	; (udiv(2) + extractelement(3) + insertelement(3)) / 2 = 4			; (udiv(2) + extractelement(6) + insertelement(6)) / 2 = 7
	;			;
	; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x			; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x
	; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3			; CHECK: Found an estimated cost of 7 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
	; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x			; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x
	; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3			; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
	;			;
	define i32 @predicated_udiv_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {			define i32 @predicated_udiv_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	Show All 23 Lines
	; CHECK-LABEL: predicated_store_scalarized_operand			; CHECK-LABEL: predicated_store_scalarized_operand
	;			;
	; This test checks that we correctly compute the cost of the predicated store			; This test checks that we correctly compute the cost of the predicated store
	; instruction and the add instruction it uses. The add is scalarized and sunk			; instruction and the add instruction it uses. The add is scalarized and sunk
	; inside the predicated block. If we assume the block probability is 50%, we			; inside the predicated block. If we assume the block probability is 50%, we
	; compute the cost as:			; compute the cost as:
	;			;
	; Cost of add:			; Cost of add:
	; (add(2) + extractelement(3)) / 2 = 2			; (add(2) + extractelement(6)) / 2 = 4
	; Cost of store:			; Cost of store:
	; store(4) / 2 = 2			; store(4) / 2 = 2
	;			;
	; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x			; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x
	; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4			; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
	; CHECK: Scalarizing: %tmp2 = add nsw i32 %tmp1, %x			; CHECK: Scalarizing: %tmp2 = add nsw i32 %tmp1, %x
	; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4			; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4
	;			;
	define void @predicated_store_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {			define void @predicated_store_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	Show All 24 Lines
	; and predicated. The sub feeding the store is scalarized and sunk inside the			; and predicated. The sub feeding the store is scalarized and sunk inside the
	; store's predicated block. However, the add feeding the sdiv and udiv cannot			; store's predicated block. However, the add feeding the sdiv and udiv cannot
	; be sunk and is not scalarized. If we assume the block probability is 50%, we			; be sunk and is not scalarized. If we assume the block probability is 50%, we
	; compute the cost as:			; compute the cost as:
	;			;
	; Cost of add:			; Cost of add:
	; add(1) = 1			; add(1) = 1
	; Cost of sdiv:			; Cost of sdiv:
	; (sdiv(2) + extractelement(6) + insertelement(3)) / 2 = 5			; (sdiv(2) + extractelement(12) + insertelement(6)) / 2 = 10
	; Cost of udiv:			; Cost of udiv:
	; (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5			; (udiv(2) + extractelement(12) + insertelement(6)) / 2 = 10
	; Cost of sub:			; Cost of sub:
	; (sub(2) + extractelement(3)) / 2 = 2			; (sub(2) + extractelement(6)) / 2 = 4
	; Cost of store:			; Cost of store:
	; store(4) / 2 = 2			; store(4) / 2 = 2
	;			;
	; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x			; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x
	; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2			; CHECK: Found an estimated cost of 10 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2
	; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2			; CHECK: Found an estimated cost of 10 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2
	; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x			; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x
	; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4			; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4
	; CHECK-NOT: Scalarizing: %tmp2 = add i32 %tmp1, %x			; CHECK-NOT: Scalarizing: %tmp2 = add i32 %tmp1, %x
	; CHECK: Scalarizing and predicating: %tmp3 = sdiv i32 %tmp1, %tmp2			; CHECK: Scalarizing and predicating: %tmp3 = sdiv i32 %tmp1, %tmp2
	; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2			; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2
	; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x			; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x
	; CHECK: Scalarizing and predicating: store i32 %tmp5, i32* %tmp0, align 4			; CHECK: Scalarizing and predicating: store i32 %tmp5, i32* %tmp0, align 4
	;			;
	define void @predication_multi_context(i32* %a, i1 %c, i32 %x, i64 %n) {			define void @predication_multi_context(i32* %a, i1 %c, i32 %x, i64 %n) {
	Show All 25 Lines

test/Transforms/SLPVectorizer/AArch64/gather-root.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S \| FileCheck %s --check-prefix=DEFAULT			; RUN: opt < %s -slp-vectorizer -S \| FileCheck %s --check-prefix=DEFAULT
	; RUN: opt < %s -slp-schedule-budget=0 -slp-min-tree-size=0 -slp-threshold=-30 -slp-vectorizer -S \| FileCheck %s --check-prefix=GATHER			; RUN: opt < %s -slp-schedule-budget=0 -slp-min-tree-size=0 -slp-threshold=-37 -slp-vectorizer -S \| FileCheck %s --check-prefix=GATHER
	; RUN: opt < %s -slp-schedule-budget=0 -slp-threshold=-30 -slp-vectorizer -S \| FileCheck %s --check-prefix=MAX-COST			; RUN: opt < %s -slp-schedule-budget=0 -slp-threshold=-30 -slp-vectorizer -S \| FileCheck %s --check-prefix=MAX-COST

	target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnu"			target triple = "aarch64--linux-gnu"

	@a = common global [80 x i8] zeroinitializer, align 16			@a = common global [80 x i8] zeroinitializer, align 16

	; DEFAULT-LABEL: @PR28330(			; DEFAULT-LABEL: @PR28330(
	▲ Show 20 Lines • Show All 204 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/AArch64/getelementptr.ll

	; RUN: opt -S -slp-vectorizer -slp-threshold=-18 -dce -instcombine < %s \| FileCheck %s			; RUN: opt -S -slp-vectorizer -slp-threshold=-23 -dce -instcombine < %s \| FileCheck %s

	target datalayout = "e-m:e-i32:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i32:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnu"			target triple = "aarch64--linux-gnu"

	; These tests check that we remove from consideration pairs of seed			; These tests check that we remove from consideration pairs of seed
	; getelementptrs when they are known to have a constant difference. Such pairs			; getelementptrs when they are known to have a constant difference. Such pairs
	; are likely not good candidates for vectorization since one can be computed			; are likely not good candidates for vectorization since one can be computed
	; from the other. We use an unprofitable threshold to force vectorization.			; from the other. We use an unprofitable threshold to force vectorization.
	▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/AArch64/horizontal.ll

	; RUN: opt -slp-vectorizer -slp-threshold=-6 -S < %s \| FileCheck %s			; RUN: opt -slp-vectorizer -slp-threshold=-11 -S < %s \| FileCheck %s

	; FIXME: The threshold is changed to keep this test case a bit smaller.			; FIXME: The threshold is changed to keep this test case a bit smaller.
	; The AArch64 cost model should not give such high costs to select statements.			; The AArch64 cost model should not give such high costs to select statements.

	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux"			target triple = "aarch64--linux"

	; CHECK-LABEL: test_select			; CHECK-LABEL: test_select
	▲ Show 20 Lines • Show All 261 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/AArch64/sdiv-pow2.ll

	; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a57 \| FileCheck %s			; RUN: opt < %s -basicaa -slp-vectorizer -slp-threshold=-5 -S -mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a57 \| FileCheck %s
	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnu"			target triple = "aarch64--linux-gnu"

	; CHECK-LABEL: @test1			; CHECK-LABEL: @test1
	; CHECK: load <4 x i32>			; CHECK: load <4 x i32>
	; CHECK: add nsw <4 x i32>			; CHECK: add nsw <4 x i32>
	; CHECK: sdiv <4 x i32>			; CHECK: sdiv <4 x i32>

	Show All 33 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Correct lane zero optimization in insert/extract costsAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 97719

lib/Target/AArch64/AArch64TargetTransformInfo.cpp

test/Analysis/CostModel/AArch64/bswap.ll

test/Analysis/CostModel/AArch64/falkor.ll

test/Analysis/CostModel/AArch64/inserts-extracts.ll

test/Analysis/CostModel/AArch64/kryo.ll

test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll

test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll

test/Transforms/LoopVectorize/AArch64/interleaved_cost.ll

test/Transforms/LoopVectorize/AArch64/predication_costs.ll

test/Transforms/SLPVectorizer/AArch64/gather-root.ll

test/Transforms/SLPVectorizer/AArch64/getelementptr.ll

test/Transforms/SLPVectorizer/AArch64/horizontal.ll

test/Transforms/SLPVectorizer/AArch64/sdiv-pow2.ll

[AArch64] Correct lane zero optimization in insert/extract costs
AbandonedPublic