This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorizer] Try different vectorization factors and set max vector register size based on target
ClosedPublic

Authored by spatel on Jul 5 2015, 9:22 PM.

Download Raw Diff

Details

Reviewers

rengolin
nadav
• tstellarAMD
arsenm
hfinkel

Commits

rG131944619586: [SLPVectorizer] Try different vectorization factors for store chains ...and set…
rL241760: [SLPVectorizer] Try different vectorization factors for store chains

Summary

This patch is based on discussion on the llvmdev mailing list:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-July/087405.html

and also solves:
https://llvm.org/bugs/show_bug.cgi?id=17170

As mentioned on the dev list and bug report, the new loop on the vector register size may cause an unacceptable compile-time increase, so this may need to be shielded by some more aggressive optimization specification. If not, this patch could be extended to other SLP pattern matchers that hardcode the vector register size (see FIXME comments).

The AMDGPU XFAIL test either should be fixed or removed if it's not valid any more?

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 29061.Jul 5 2015, 9:22 PM

spatel retitled this revision from to [SLPVectorizer] Try different vectorization factors and set max vector register size based on target .

spatel updated this object.

spatel added reviewers: hfinkel, nadav, rengolin, arsenm.

spatel added a subscriber: llvm-commits.

Sanjay, this patch looks okay. I think that the compile time hit should be minimal but I think that we need to measure just to be sure.

Thanks, Nadav. I'm collecting some data using test-suite now. Should have something posted here tomorrow.

I ran the benchmarking subset of test-suite on and targeting an AMD Jaguar system (has AVX, so 256-bit SLP was activated).

The 256-bit vector store optimization fired 51 times on 16 different tests, so looking for the wider vector does appear to be a useful thing to do based on this sample of tests.
There's no measurable perf difference on any of those tests; this is somewhat expected given that Jaguar has a double-pumped AVX implementation (128-bit data paths).
The sum of average CC_Time (3 trials) for the tests was 121.21 seconds for the baseline (only check for 128-bit SLP) and 121.34 seconds for the new version (looks for 256-bit SLP before 128-bit), so about 0.1% longer, but that difference is in the noise for the compile times in this data set.

Hi Sanjay,

I think this patch is good to commit as-is, though I have one question (I'm ok with just adding TODO for now).

Thanks,
Michael

lib/Transforms/Vectorize/SLPVectorizer.cpp
4020–4022 ↗	(On Diff #29061)	Shouldn't we update this threshold too? Otherwise, we won't be able to vectorize with VF=32 (and AVX2 might need <32 x i8> vectors). However, increasing this value change would hurt compile time, so we need to be careful here.

In D10950#200724, @mzolotukhin wrote:

I think this patch is good to commit as-is, though I have one question (I'm ok with just adding TODO for now).

Thanks, Michael!

You're right; we need to increase that limit to vectorize more than 16 elements at a time. I'll make that a TODO and then add another cl::opt override, so we can experiment with that setting. This raises another problem: AVX has 256-bit registers, but it can't handle <32 x i8> ops, so creating those here would be useless. Using the data type rather than the register size could get us more optimizations while limiting the compile-time explosion.

Adding Tom:
I'd prefer not to have to XFAIL the AMDGPU test. Can you provide some guidance about what the expected behavior should be there? Thanks!

Closed by commit rL241760: [SLPVectorizer] Try different vectorization factors for store chains (authored by spatel). · Explain WhyJul 8 2015, 4:41 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

44 lines

test/

Transforms/

SLPVectorizer/

AMDGPU/

simplebb.ll

5 lines

X86/

29 lines

1 line

9 lines

19 lines

Diff 29292

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 63 Lines • ▼ Show 20 Lines
ShouldVectorizeHor("slp-vectorize-hor", cl::init(false), cl::Hidden,		ShouldVectorizeHor("slp-vectorize-hor", cl::init(false), cl::Hidden,
cl::desc("Attempt to vectorize horizontal reductions"));		cl::desc("Attempt to vectorize horizontal reductions"));

static cl::opt<bool> ShouldStartVectorizeHorAtStore(		static cl::opt<bool> ShouldStartVectorizeHorAtStore(
"slp-vectorize-hor-store", cl::init(false), cl::Hidden,		"slp-vectorize-hor-store", cl::init(false), cl::Hidden,
cl::desc(		cl::desc(
"Attempt to vectorize horizontal reductions feeding into a store"));		"Attempt to vectorize horizontal reductions feeding into a store"));

		static cl::opt<int>
		MaxVectorRegSizeOption("slp-max-reg-size", cl::init(128), cl::Hidden,
		cl::desc("Attempt to vectorize for this register size in bits"));

namespace {		namespace {

		// FIXME: Set this via cl::opt to allow overriding.
static const unsigned MinVecRegSize = 128;		static const unsigned MinVecRegSize = 128;

static const unsigned RecursionMaxDepth = 12;		static const unsigned RecursionMaxDepth = 12;

// Limit the number of alias checks. The limit is chosen so that		// Limit the number of alias checks. The limit is chosen so that
// it has no negative effect on the llvm benchmarks.		// it has no negative effect on the llvm benchmarks.
static const unsigned AliasedCheckLimit = 10;		static const unsigned AliasedCheckLimit = 10;

▲ Show 20 Lines • Show All 3,001 Lines • ▼ Show 20 Lines	bool runOnFunction(Function &F) override {
StoreRefs.clear();		StoreRefs.clear();
bool Changed = false;		bool Changed = false;

// If the target claims to have no vector registers don't attempt		// If the target claims to have no vector registers don't attempt
// vectorization.		// vectorization.
if (!TTI->getNumberOfRegisters(true))		if (!TTI->getNumberOfRegisters(true))
return false;		return false;

		// Use the vector register size specified by the target unless overridden
		// by a command-line option.
		// TODO: It would be better to limit the vectorization factor based on
		// data type rather than just register size. For example, x86 AVX has
		// 256-bit registers, but it does not support integer operations
		// at that width (that requires AVX2).
		if (MaxVectorRegSizeOption.getNumOccurrences())
		MaxVecRegSize = MaxVectorRegSizeOption;
		else
		MaxVecRegSize = TTI->getRegisterBitWidth(true);

// Don't vectorize when the attribute NoImplicitFloat is used.		// Don't vectorize when the attribute NoImplicitFloat is used.
if (F.hasFnAttribute(Attribute::NoImplicitFloat))		if (F.hasFnAttribute(Attribute::NoImplicitFloat))
return false;		return false;

DEBUG(dbgs() << "SLP: Analyzing blocks in " << F.getName() << ".\n");		DEBUG(dbgs() << "SLP: Analyzing blocks in " << F.getName() << ".\n");

// Use the bottom up slp vectorizer to construct chains that start with		// Use the bottom up slp vectorizer to construct chains that start with
// store instructions.		// store instructions.
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	private:
/// \brief Vectorize the stores that were collected in StoreRefs.		/// \brief Vectorize the stores that were collected in StoreRefs.
bool vectorizeStoreChains(BoUpSLP &R);		bool vectorizeStoreChains(BoUpSLP &R);

/// \brief Scan the basic block and look for patterns that are likely to start		/// \brief Scan the basic block and look for patterns that are likely to start
/// a vectorization chain.		/// a vectorization chain.
bool vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R);		bool vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R);

bool vectorizeStoreChain(ArrayRef<Value *> Chain, int CostThreshold,		bool vectorizeStoreChain(ArrayRef<Value *> Chain, int CostThreshold,
BoUpSLP &R);		BoUpSLP &R, unsigned VecRegSize);

bool vectorizeStores(ArrayRef<StoreInst *> Stores, int costThreshold,		bool vectorizeStores(ArrayRef<StoreInst *> Stores, int costThreshold,
BoUpSLP &R);		BoUpSLP &R);
private:		private:
StoreListMap StoreRefs;		StoreListMap StoreRefs;
		unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.
};		};

/// \brief Check that the Values in the slice in VL array are still existent in		/// \brief Check that the Values in the slice in VL array are still existent in
/// the WeakVH array.		/// the WeakVH array.
/// Vectorization of part of the VL array may cause later values in the VL array		/// Vectorization of part of the VL array may cause later values in the VL array
/// to become invalid. We track when this has happened in the WeakVH array.		/// to become invalid. We track when this has happened in the WeakVH array.
static bool hasValueBeenRAUWed(ArrayRef<Value *> VL, ArrayRef<WeakVH> VH,		static bool hasValueBeenRAUWed(ArrayRef<Value *> VL, ArrayRef<WeakVH> VH,
unsigned SliceBegin, unsigned SliceSize) {		unsigned SliceBegin, unsigned SliceSize) {
VL = VL.slice(SliceBegin, SliceSize);		VL = VL.slice(SliceBegin, SliceSize);
VH = VH.slice(SliceBegin, SliceSize);		VH = VH.slice(SliceBegin, SliceSize);
return !std::equal(VL.begin(), VL.end(), VH.begin());		return !std::equal(VL.begin(), VL.end(), VH.begin());
}		}

bool SLPVectorizer::vectorizeStoreChain(ArrayRef<Value *> Chain,		bool SLPVectorizer::vectorizeStoreChain(ArrayRef<Value *> Chain,
int CostThreshold, BoUpSLP &R) {		int CostThreshold, BoUpSLP &R,
		unsigned VecRegSize) {
unsigned ChainLen = Chain.size();		unsigned ChainLen = Chain.size();
DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << ChainLen		DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << ChainLen
<< "\n");		<< "\n");
Type *StoreTy = cast<StoreInst>(Chain[0])->getValueOperand()->getType();		Type *StoreTy = cast<StoreInst>(Chain[0])->getValueOperand()->getType();
auto &DL = cast<StoreInst>(Chain[0])->getModule()->getDataLayout();		auto &DL = cast<StoreInst>(Chain[0])->getModule()->getDataLayout();
unsigned Sz = DL.getTypeSizeInBits(StoreTy);		unsigned Sz = DL.getTypeSizeInBits(StoreTy);
unsigned VF = MinVecRegSize / Sz;		unsigned VF = VecRegSize / Sz;

if (!isPowerOf2_32(Sz) \|\| VF < 2)		if (!isPowerOf2_32(Sz) \|\| VF < 2)
return false;		return false;

// Keep track of values that were deleted by vectorizing in the loop below.		// Keep track of values that were deleted by vectorizing in the loop below.
SmallVector<WeakVH, 8> TrackValues(Chain.begin(), Chain.end());		SmallVector<WeakVH, 8> TrackValues(Chain.begin(), Chain.end());

bool Changed = false;		bool Changed = false;
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	for (SetVector<StoreInst *>::iterator it = Heads.begin(), e = Heads.end();
while (Tails.count(I) \|\| Heads.count(I)) {		while (Tails.count(I) \|\| Heads.count(I)) {
if (VectorizedStores.count(I))		if (VectorizedStores.count(I))
break;		break;
Operands.push_back(I);		Operands.push_back(I);
// Move to the next value in the chain.		// Move to the next value in the chain.
I = ConsecutiveChain[I];		I = ConsecutiveChain[I];
}		}

if (vectorizeStoreChain(Operands, costThreshold, R)) {		// FIXME: Is division-by-2 the correct step? Should we assert that the
		// register size is a power-of-2?
		for (unsigned Size = MaxVecRegSize; Size >= MinVecRegSize; Size /= 2) {
		if (vectorizeStoreChain(Operands, costThreshold, R, Size)) {
// Mark the vectorized stores so that we don't vectorize them again.		// Mark the vectorized stores so that we don't vectorize them again.
VectorizedStores.insert(Operands.begin(), Operands.end());		VectorizedStores.insert(Operands.begin(), Operands.end());
Changed = true;		Changed = true;
		break;
		}
}		}
}		}

return Changed;		return Changed;
}		}


unsigned SLPVectorizer::collectStores(BasicBlock *BB, BoUpSLP &R) {		unsigned SLPVectorizer::collectStores(BasicBlock *BB, BoUpSLP &R) {
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	bool SLPVectorizer::tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R,
if (!I0)		if (!I0)
return false;		return false;

unsigned Opcode0 = I0->getOpcode();		unsigned Opcode0 = I0->getOpcode();
const DataLayout &DL = I0->getModule()->getDataLayout();		const DataLayout &DL = I0->getModule()->getDataLayout();

Type *Ty0 = I0->getType();		Type *Ty0 = I0->getType();
unsigned Sz = DL.getTypeSizeInBits(Ty0);		unsigned Sz = DL.getTypeSizeInBits(Ty0);
		// FIXME: Register size should be a parameter to this function, so we can
		// try different vectorization factors.
unsigned VF = MinVecRegSize / Sz;		unsigned VF = MinVecRegSize / Sz;

for (Value *V : VL) {		for (Value *V : VL) {
Type *Ty = V->getType();		Type *Ty = V->getType();
if (!isValidElementType(Ty))		if (!isValidElementType(Ty))
return false;		return false;
Instruction *Inst = dyn_cast<Instruction>(V);		Instruction *Inst = dyn_cast<Instruction>(V);
if (!Inst \|\| Inst->getOpcode() != Opcode0)		if (!Inst \|\| Inst->getOpcode() != Opcode0)
▲ Show 20 Lines • Show All 213 Lines • ▼ Show 20 Lines	bool matchAssociativeReduction(PHINode Phi, BinaryOperator B) {

Type *Ty = B->getType();		Type *Ty = B->getType();
if (!isValidElementType(Ty))		if (!isValidElementType(Ty))
return false;		return false;

const DataLayout &DL = B->getModule()->getDataLayout();		const DataLayout &DL = B->getModule()->getDataLayout();
ReductionOpcode = B->getOpcode();		ReductionOpcode = B->getOpcode();
ReducedValueOpcode = 0;		ReducedValueOpcode = 0;
		// FIXME: Register size should be a parameter to this function, so we can
		// try different vectorization factors.
ReduxWidth = MinVecRegSize / DL.getTypeSizeInBits(Ty);		ReduxWidth = MinVecRegSize / DL.getTypeSizeInBits(Ty);
ReductionRoot = B;		ReductionRoot = B;
ReductionPHI = Phi;		ReductionPHI = Phi;

if (ReduxWidth < 4)		if (ReduxWidth < 4)
return false;		return false;

// We currently only support adds.		// We currently only support adds.
▲ Show 20 Lines • Show All 410 Lines • ▼ Show 20 Lines	for (StoreListMap::iterator it = StoreRefs.begin(), e = StoreRefs.end();
it != e; ++it) {		it != e; ++it) {
if (it->second.size() < 2)		if (it->second.size() < 2)
continue;		continue;

DEBUG(dbgs() << "SLP: Analyzing a store chain of length "		DEBUG(dbgs() << "SLP: Analyzing a store chain of length "
<< it->second.size() << ".\n");		<< it->second.size() << ".\n");

// Process the stores in chunks of 16.		// Process the stores in chunks of 16.
		// TODO: The limit of 16 inhibits greater vectorization factors.
		// For example, AVX2 supports v32i8. Increasing this limit, however,
		// may cause a significant compile-time increase.
for (unsigned CI = 0, CE = it->second.size(); CI < CE; CI+=16) {		for (unsigned CI = 0, CE = it->second.size(); CI < CE; CI+=16) {
unsigned Len = std::min<unsigned>(CE - CI, 16);		unsigned Len = std::min<unsigned>(CE - CI, 16);
Changed \|= vectorizeStores(makeArrayRef(&it->second[CI], Len),		Changed \|= vectorizeStores(makeArrayRef(&it->second[CI], Len),
-SLPCostThreshold, R);		-SLPCostThreshold, R);
}		}
}		}
return Changed;		return Changed;
}		}
Show All 16 Lines

llvm/trunk/test/Transforms/SLPVectorizer/AMDGPU/simplebb.ll

	; RUN: opt -S -march=r600 -mcpu=cayman -basicaa -slp-vectorizer -dce < %s \| FileCheck %s			; RUN: opt -S -march=r600 -mcpu=cayman -basicaa -slp-vectorizer -dce < %s \| FileCheck %s
				; XFAIL: *
				;
				; FIXME: If this test expects to be vectorized, the TTI must indicate that the target
				; has vector registers of the expected width.
				; Currently, it says there are 8 vector registers that are 32-bits wide.

	target datalayout = "e-p:32:32:32-p3:16:16:16-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-v2048:2048:2048-n32:64"			target datalayout = "e-p:32:32:32-p3:16:16:16-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-v2048:2048:2048-n32:64"


	; Simple 3-pair chain with loads and stores			; Simple 3-pair chain with loads and stores
	define void @test1_as_3_3_3(double addrspace(3)* %a, double addrspace(3)* %b, double addrspace(3)* %c) {			define void @test1_as_3_3_3(double addrspace(3)* %a, double addrspace(3)* %b, double addrspace(3)* %c) {
	; CHECK-LABEL: @test1_as_3_3_3(			; CHECK-LABEL: @test1_as_3_3_3(
	; CHECK: load <2 x double>, <2 x double> addrspace(3)*			; CHECK: load <2 x double>, <2 x double> addrspace(3)*
	▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/cse.ll

	; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=i386-apple-macosx10.8.0 -mcpu=corei7-avx \| FileCheck %s			; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=i386-apple-macosx10.8.0 -mcpu=corei7-avx \| FileCheck %s

	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-n8:16:32-S128"			target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-n8:16:32-S128"
	target triple = "i386-apple-macosx10.8.0"			target triple = "i386-apple-macosx10.8.0"

	;int test(double *G) {			;int test(double *G) {
	; G[0] = 1+G[5]*4;			; G[0] = 1+G[5]*4;
	; G[1] = 6+G[6]*3;			; G[1] = 6+G[6]*3;
	; G[2] = 7+G[5]*4;			; G[2] = 7+G[5]*4;
	; G[3] = 8+G[6]*4;			; G[3] = 8+G[6]*4;
	;}			;}

	;CHECK-LABEL: @test(			;CHECK-LABEL: @test(
	;CHECK: load <2 x double>			;CHECK: load <2 x double>
	;CHECK: fadd <2 x double>			;CHECK: fadd <4 x double>
	;CHECK: store <2 x double>			;CHECK: store <4 x double>
	;CHECK: insertelement <2 x double>
	;CHECK: fadd <2 x double>
	;CHECK: store <2 x double>
	;CHECK: ret i32			;CHECK: ret i32

	define i32 @test(double* nocapture %G) {			define i32 @test(double* nocapture %G) {
	entry:			entry:
	%arrayidx = getelementptr inbounds double, double* %G, i64 5			%arrayidx = getelementptr inbounds double, double* %G, i64 5
	%0 = load double, double* %arrayidx, align 8			%0 = load double, double* %arrayidx, align 8
	%mul = fmul double %0, 4.000000e+00			%mul = fmul double %0, 4.000000e+00
	%add = fadd double %mul, 1.000000e+00			%add = fadd double %mul, 1.000000e+00
	Show All 15 Lines
	}			}

	;int foo(double *A, int n) {			;int foo(double *A, int n) {
	; A[0] = A[0] * 7.9 * n + 6.0;			; A[0] = A[0] * 7.9 * n + 6.0;
	; A[1] = A[1] * 7.7 * n + 2.0;			; A[1] = A[1] * 7.7 * n + 2.0;
	; A[2] = A[2] * 7.6 * n + 3.0;			; A[2] = A[2] * 7.6 * n + 3.0;
	; A[3] = A[3] * 7.4 * n + 4.0;			; A[3] = A[3] * 7.4 * n + 4.0;
	;}			;}
	;CHECK-LABEL: @foo(			; CHECK-LABEL: @foo(
	;CHECK: insertelement <2 x double>			; CHECK: load <4 x double>
	;CHECK: insertelement <2 x double>			; CHECK: fmul <4 x double>
	;CHECK-NOT: insertelement <2 x double>			; CHECK: fmul <4 x double>
	;CHECK: ret			; CHECK: fadd <4 x double>
				; CHECK: store <4 x double>
	define i32 @foo(double* nocapture %A, i32 %n) {			define i32 @foo(double* nocapture %A, i32 %n) {
	entry:			entry:
	%0 = load double, double* %A, align 8			%0 = load double, double* %A, align 8
	%mul = fmul double %0, 7.900000e+00			%mul = fmul double %0, 7.900000e+00
	%conv = sitofp i32 %n to double			%conv = sitofp i32 %n to double
	%mul1 = fmul double %conv, %mul			%mul1 = fmul double %conv, %mul
	%add = fadd double %mul1, 6.000000e+00			%add = fadd double %mul1, 6.000000e+00
	store double %add, double* %A, align 8			store double %add, double* %A, align 8
	▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines


	;int foo(double *A, int n) {			;int foo(double *A, int n) {
	; A[0] = A[0] * 7.9 * n + 6.0;			; A[0] = A[0] * 7.9 * n + 6.0;
	; A[1] = A[1] * 7.9 * n + 6.0;			; A[1] = A[1] * 7.9 * n + 6.0;
	; A[2] = A[2] * 7.9 * n + 6.0;			; A[2] = A[2] * 7.9 * n + 6.0;
	; A[3] = A[3] * 7.9 * n + 6.0;			; A[3] = A[3] * 7.9 * n + 6.0;
	;}			;}
	;CHECK-LABEL: @foo4(			; CHECK-LABEL: @foo4(
	;CHECK: insertelement <2 x double>			; CHECK: load <4 x double>
	;CHECK: insertelement <2 x double>			; CHECK: fmul <4 x double>
	;CHECK-NOT: insertelement <2 x double>			; CHECK: fmul <4 x double>
	;CHECK: ret			; CHECK: fadd <4 x double>
				; CHECK: store <4 x double>
	define i32 @foo4(double* nocapture %A, i32 %n) {			define i32 @foo4(double* nocapture %A, i32 %n) {
	entry:			entry:
	%0 = load double, double* %A, align 8			%0 = load double, double* %A, align 8
	%mul = fmul double %0, 7.900000e+00			%mul = fmul double %0, 7.900000e+00
	%conv = sitofp i32 %n to double			%conv = sitofp i32 %n to double
	%mul1 = fmul double %conv, %mul			%mul1 = fmul double %conv, %mul
	%add = fadd double %mul1, 6.000000e+00			%add = fadd double %mul1, 6.000000e+00
	store double %add, double* %A, align 8			store double %add, double* %A, align 8
	▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/gep.ll

	; RUN: opt < %s -basicaa -slp-vectorizer -S \|FileCheck %s			; RUN: opt < %s -basicaa -slp-vectorizer -S \|FileCheck %s
	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-unknown"

	; Test if SLP can handle GEP expressions.			; Test if SLP can handle GEP expressions.
	; The test perform the following action:			; The test perform the following action:
	; x->first = y->first + 16			; x->first = y->first + 16
	; x->second = y->second + 16			; x->second = y->second + 16

	; CHECK-LABEL: foo1			; CHECK-LABEL: foo1
	; CHECK: <2 x i32*>			; CHECK: <2 x i32*>
	Show All 31 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/loopinvariant.ll

	; RUN: opt < %s -basicaa -slp-vectorizer -S -mcpu=corei7-avx \| FileCheck %s			; RUN: opt < %s -basicaa -slp-vectorizer -S -mcpu=corei7-avx \| FileCheck %s

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.8.0"			target triple = "x86_64-apple-macosx10.8.0"

	;CHECK-LABEL: @foo(			;CHECK-LABEL: @foo(
	;CHECK: load <4 x i32>			;CHECK: load <8 x i32>
	;CHECK: add nsw <4 x i32>			;CHECK: add nsw <8 x i32>
	;CHECK: store <4 x i32>			;CHECK: store <8 x i32>
	;CHECK: load <4 x i32>
	;CHECK: add nsw <4 x i32>
	;CHECK: store <4 x i32>
	;CHECK: ret			;CHECK: ret
	define i32 @foo(i32* nocapture %A, i32 %n) {			define i32 @foo(i32* nocapture %A, i32 %n) {
	entry:			entry:
	%cmp62 = icmp sgt i32 %n, 0			%cmp62 = icmp sgt i32 %n, 0
	br i1 %cmp62, label %for.body, label %for.end			br i1 %cmp62, label %for.body, label %for.end

	for.body:			for.body:
	%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/pr19657.ll

	; RUN: opt < %s -basicaa -slp-vectorizer -S -mcpu=corei7-avx \| FileCheck %s			; RUN: opt < %s -basicaa -slp-vectorizer -S -mcpu=corei7-avx \| FileCheck %s
				; RUN: opt < %s -basicaa -slp-vectorizer -slp-max-reg-size=128 -S -mcpu=corei7-avx \| FileCheck %s --check-prefix=V128

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	; CHECK: load <2 x double>, <2 x double>*			; CHECK-LABEL: @foo(
	; CHECK: fadd <2 x double>			; CHECK: load <4 x double>
	; CHECK: store <2 x double>			; CHECK: fadd <4 x double>
				; CHECK: fadd <4 x double>
				; CHECK: store <4 x double>

				; V128-LABEL: @foo(
				; V128: load <2 x double>
				; V128: fadd <2 x double>
				; V128: fadd <2 x double>
				; V128: store <2 x double>
				; V128: load <2 x double>
				; V128: fadd <2 x double>
				; V128: fadd <2 x double>
				; V128: store <2 x double>

	define void @foo(double* %x) {			define void @foo(double* %x) {
	%1 = load double, double* %x, align 8			%1 = load double, double* %x, align 8
	%2 = fadd double %1, %1			%2 = fadd double %1, %1
	%3 = fadd double %2, %1			%3 = fadd double %2, %1
	store double %3, double* %x, align 8			store double %3, double* %x, align 8
	%4 = getelementptr inbounds double, double* %x, i64 1			%4 = getelementptr inbounds double, double* %x, i64 1
	%5 = load double, double* %4, align 8			%5 = load double, double* %4, align 8
	Show All 16 Lines