This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
16/33
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
2/3
insertvalue.ll

Differential D14185

Extend SLP Vectorizer to deal with aggregates
ClosedPublic

Authored by ArchDRobison on Oct 29 2015, 12:59 PM.

Download Raw Diff

Details

Reviewers

nadav
mzolotukhin
majnemer
aschwaighofer
hfinkel

Summary

This patch extends SLP Vectorizer to vectorize code involving structs/arrays that resemble vectors. The motivation, as explained here is to vectorize tuples and structs in Julia that act like vectors.

Per Arnold Schwaighofer's suggestion, the improvement works backwards from a store of insert value instructions. The associated patch D14260 adds a peephole optimization to inst-combine to clean up.

I limited the SLPVectorization change to aggregates that resemble vectors of length 2, 4, 8, or 16 since other lengths seem unlikely to pay off, but I could be wrong.

In some spots, I changed logic for "extractelement" to be "polymorphic" to "extractelement" or "extractvalue". In other places (e.g. findBuildVector"), that seemed to be more trouble than it was worth and so I coded custom logic (e.g. findBuildAggregate).

Diff Detail

Event Timeline

ArchDRobison updated this revision to Diff 38757.Oct 29 2015, 12:59 PM

ArchDRobison retitled this revision from to Extende SLP Vectorizer to deal with aggregates.

ArchDRobison updated this object.

ArchDRobison added reviewers: aschwaighofer, majnemer.

ArchDRobison added a subscriber: llvm-commits.

ArchDRobison retitled this revision from Extende SLP Vectorizer to deal with aggregates to Extend SLP Vectorizer to deal with aggregates.Oct 30 2015, 11:07 AM

loladiro added a subscriber: loladiro.Oct 30 2015, 12:02 PM

aschwaighofer added a reviewer: mzolotukhin.Nov 1 2015, 9:34 AM

aschwaighofer added inline comments.

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
875 ↗	(On Diff #38757)	I think we must also check for DataLayout::getTypeSizeInBits(VectorType) == DataLayout::getTypeSizeInBits(StructType/ArrayType) to show that it is truly isomorphic. The allocated size for elements in an aggregate might be different because of padding. For example: {<i2, i2>} != <2 x i2>
lib/Transforms/Vectorize/SLPVectorizer.cpp
284	Similar to my previous comment we must check that the type sizes are equivalent.

Added DataLayout size checks per Arnold's advice.

Hi,

It looks good in general. Please separate changes for instcombine and SLP and find some remarks from me inline (mostly nitpicks).

Thanks,
Michael

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
842 ↗	(On Diff #38938)	s/underlyingvalue/underlying value/
843 ↗	(On Diff #38938)	I'd prefer to have an example in the comment as well.
850 ↗	(On Diff #38938)	Unnecessary curly braces?
lib/Transforms/Vectorize/SLPVectorizer.cpp
292	I think it should depend on available vector register size, i.e. on `MaxVectorRegSizeOption` and `MinVecRegSize`.
4287	Nitpick: missing dot in the end.
test/Transforms/SLPVectorizer/X86/insertvalue.ll
2	Could you please also add a test checking that `[2 x i2]` isn't vectorized to `<2 x i2>` here? Similar to other tests in `test/Transforms/SLPVectorizer/X86/bad_types.ll`

jevinskie added a subscriber: jevinskie.Nov 4 2015, 11:07 AM

This revision addresses Michael Zolotukhin's comments.

"instcombine" portion split out to D14260.
Adds negative test for NOT vectorizing to <4 x i2> . To check that the i2 is the stopper, and not some other oversight, I also added a positive similar test for <4 x i4>.
The test of whether an aggregate type can be mapped to a vector is changed to consider the min and max hardware vector sizes.

The last change required some restructuring to make MaxVecRegSize accessible where it was needed in canMapToVector. The restructuring is:

Member MaxVecRegSize moved from class SLPVectorizer to BoUpSLP.
CanReuseExtract and canMapToVectorchanged from global functions to members of BoUpSLP.

hfinkel added a subscriber: hfinkel.Dec 10 2015, 5:32 PM

hfinkel added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
283	Add spaces around ==.
291	You assert this here, but where is it checked?
292	Same here: spaces around ==.
1534	Adjust spaces: if (Opcode ==
1539	for -> if
4565	Don't need {} here.

Ayal added a subscriber: Ayal.Dec 21 2015, 9:50 AM

Ayal added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
287–290	maybe better to wrap return value inside ( )
290	cast should be doing the assert above it.
291	Should check similar to the way GEPs are checked to be 2 indexed. Also spaces around ==
1506	better call local isValidElementType to catch FP80, FP128 fwiw.
1509	spaces around >
1522	may be simpler to obtain Opcode from getSameOpcode(VL), instead of passing it.
2301	else ... return Gather? (but w/o explicit 'else')

mcrosier added subscribers: mssimpso, bmakam.Dec 22 2015, 6:39 AM

Updated per most of Hal and Ayal's suggestions. I'll add replies to line comments shortly.

Added commentary to line comments by others.

lib/Transforms/Vectorize/SLPVectorizer.cpp
287–290	My preference is to avoid superfluous parentheses around return values. The style of the source file varies on this point, but in general seems to avoid parentheses around return values of similar complexity.
290	The assert is checking Opcode, not E. The assert is really intended to check that the parameter Opcode is correct on entry. To make this clearer, I moved it up to the entry and make it check that Opcode is ExtractElement or ExtractValue.
291	I'll remove the assert and make the check part of the return value, i.e. return EI->getNumIndices() == 1 && *EI->idx_begin() == i;
1506	Thanks! I hadn't noticed that subtlety.
1522	I was concerned about not adding more overhead than necessary, though maybe it's not a big deal in practice.
2301	There are two ways to fix this. The "canReuseExtract" check could be an assertion, to verify that earlier checking (as it stands in the patch) cancels the scheduling if canReuseExtract returns false. But it seems better to use Gather as you suggest, in case the check is relaxed in the future.
test/Transforms/SLPVectorizer/X86/insertvalue.ll
3	Fixed by addition of @julia_load_array_of_i16 .

Hi,

The changes look good to me modulo some small remarks, but you might want to wait for other reviewers' "ok" too.
Also, when committing, please check-in clean-ups and refactoring separately from the main part.

Thanks,
Michael

lib/Transforms/Vectorize/SLPVectorizer.cpp
282	Nitpick: `i` should be capitalized (or should we use another name at all?).
357–366	I'd suggest committing this refactoring as a separate patch to separate NFC changes from others.
423	Nitpick: `s/GetMaxVecRegSize/getMaxVecRegSize/`
1178	This clean-up also should be committed separately.
1509	Should we use `getTypeSizeInBits`, or `getTypeStoreSizeInBits`? I remember a bug when we used a wrong one.
1513–1517	This probably can be rewritten as a range loop, or even with `std::all_of`?
test/Transforms/SLPVectorizer/X86/insertvalue.ll
4–7	Please add `CHECK-LABEL` statements before each function. That'll help to avoid possible interference between `CHECK` statements from different tests.

This revision is now accepted and ready to land.Jan 28 2016, 11:45 AM

ArchDRobison mentioned this in D14260: Optimize store of "bitcast" from vector to aggregate..Jan 29 2016, 12:45 PM

Cosmetic changes per Michael Zolothukin's comments.

ArchDRobison marked 4 inline comments as done.Jan 29 2016, 12:56 PM

ArchDRobison added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
282	I changed it to `Idx`, since that seemed to be used in other spots for variables representing LLVM indices.
1509	The fundamental issue is type-punning for load/store, so the "store size" seems more appropriate. My recent comment for http://reviews.llvm.org/D14260 with the contrived i24 example says more.

Is this still waiting for other reviews? I'd like to get this functionality upstream; I just ran into this situation again today.

On x86_64/Linux (at least compiling with libstdc++), compiling the following (-O3 -ffast-math):

#include <complex>
using namespace std;

complex<double> foo(complex<double> a, complex<double> b) {
  return a*b;
}

yields this:

define { double, double } @_Z3fooSt7complexIdES0_(double, double, double, double) #0 {
  %5 = fmul fast double %3, %0
  %6 = fmul fast double %2, %1
  %7 = fadd fast double %5, %6
  %8 = fmul fast double %2, %0
  %9 = fmul fast double %3, %1
  %10 = fsub fast double %8, %9
  %11 = insertvalue { double, double } undef, double %10, 0
  %12 = insertvalue { double, double } %11, double %7, 1
  ret { double, double } %12
}

and we completely miss this kind of code because the natural chain starts at the insertvalues.

It's waiting on a review of http://reviews.llvm.org/D14260 , which is a prerequisite, and was part of the original proposed patch.

Diffusion mentioned this in rL267482: Optimize store of "bitcast" from vector to aggregate..Apr 25 2016, 3:28 PM

Update is against today's LLVM sources. Only functional change is that patch MinVecRegSize to same place as MaxVecRegSize. Note that the prerequisite associated patch D14260 is now committed, so it would be a good time to give this patch a final look over.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptApr 26 2016, 1:53 PM

LGTM too.

lib/Transforms/Vectorize/SLPVectorizer.cpp
357–366	Yes, please split this when committing. The refactoring should be separate.
1512	Comment should be a sentence that ends with a period.
1537	Comment should end with a period.

Bump. I assume this didn't get merged because @ArchDRobison does not have commit privileges? Shall I go ahead and lang this and D14260?

In D14185#484556, @loladiro wrote:

Bump. I assume this didn't get merged because @ArchDRobison does not have commit privileges? Shall I go ahead and lang this and D14260?

SGTM.

Never mind, both of these were applied, but without the magic string that auto-closes the review (and I looked at the wrong code base initially). This was applied in rL267899.

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

236 lines

test/

Transforms/

SLPVectorizer/

X86/

insertvalue.ll

189 lines

Diff 55086

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 272 Lines • ▼ Show 20 Lines	static Type* getSameType(ArrayRef<Value *> VL) {
Type *Ty = VL[0]->getType();		Type *Ty = VL[0]->getType();
for (int i = 1, e = VL.size(); i < e; i++)		for (int i = 1, e = VL.size(); i < e; i++)
if (VL[i]->getType() != Ty)		if (VL[i]->getType() != Ty)
return nullptr;		return nullptr;

return Ty;		return Ty;
}		}

/// \returns True if the ExtractElement instructions in VL can be vectorized		/// \returns True if Extract{Value,Element} instruction extracts element Idx.
/// to use the original vector.		static bool matchExtractIndex(Instruction *E, unsigned Idx, unsigned Opcode) {
		mzolotukhinUnsubmitted Done Reply Inline Actions Nitpick: `i` should be capitalized (or should we use another name at all?). mzolotukhin: Nitpick: `i` should be capitalized (or should we use another name at all?).
		ArchDRobisonAuthorUnsubmitted Not Done Reply Inline Actions I changed it to `Idx`, since that seemed to be used in other spots for variables representing LLVM indices. ArchDRobison: I changed it to `Idx`, since that seemed to be used in other spots for variables representing…
static bool CanReuseExtract(ArrayRef<Value *> VL) {		assert(Opcode == Instruction::ExtractElement \|\|
		hfinkelUnsubmitted Done Reply Inline Actions Add spaces around ==. hfinkel: Add spaces around ==.
assert(Instruction::ExtractElement == getSameOpcode(VL) && "Invalid opcode");		Opcode == Instruction::ExtractValue);
		aschwaighoferUnsubmitted Done Reply Inline Actions Similar to my previous comment we must check that the type sizes are equivalent. aschwaighofer: Similar to my previous comment we must check that the type sizes are equivalent.
// Check if all of the extracts come from the same vector and from the		if (Opcode == Instruction::ExtractElement) {
// correct offset.
Value *VL0 = VL[0];
ExtractElementInst *E0 = cast<ExtractElementInst>(VL0);
Value *Vec = E0->getOperand(0);

// We have to extract from the same vector type.
unsigned NElts = Vec->getType()->getVectorNumElements();

if (NElts != VL.size())
return false;

// Check that all of the indices extract from the correct offset.
ConstantInt *CI = dyn_cast<ConstantInt>(E0->getOperand(1));
if (!CI \|\| CI->getZExtValue())
return false;

for (unsigned i = 1, e = VL.size(); i < e; ++i) {
ExtractElementInst *E = cast<ExtractElementInst>(VL[i]);
ConstantInt *CI = dyn_cast<ConstantInt>(E->getOperand(1));		ConstantInt *CI = dyn_cast<ConstantInt>(E->getOperand(1));
		return CI && CI->getZExtValue() == Idx;
if (!CI \|\| CI->getZExtValue() != i \|\| E->getOperand(0) != Vec)		} else {
return false;		ExtractValueInst *EI = cast<ExtractValueInst>(E);
		return EI->getNumIndices() == 1 && *EI->idx_begin() == Idx;
		AyalUnsubmitted Not Done Reply Inline Actions cast should be doing the assert above it. Ayal: cast should be doing the assert above it.
		ArchDRobisonAuthorUnsubmitted Not Done Reply Inline Actions The assert is checking Opcode, not E. The assert is really intended to check that the parameter Opcode is correct on entry. To make this clearer, I moved it up to the entry and make it check that Opcode is ExtractElement or ExtractValue. ArchDRobison: The assert is checking Opcode, not E. The assert is really intended to check that the…
		AyalUnsubmitted Not Done Reply Inline Actions maybe better to wrap return value inside ( ) Ayal: maybe better to wrap return value inside ( )
		ArchDRobisonAuthorUnsubmitted Not Done Reply Inline Actions My preference is to avoid superfluous parentheses around return values. The style of the source file varies on this point, but in general seems to avoid parentheses around return values of similar complexity. ArchDRobison: My preference is to avoid superfluous parentheses around return values. The style of the…
}		}
		hfinkelUnsubmitted Done Reply Inline Actions You assert this here, but where is it checked? hfinkel: You assert this here, but where is it checked?
		AyalUnsubmitted Done Reply Inline Actions Should check similar to the way GEPs are checked to be 2 indexed. Also spaces around == Ayal: Should check similar to the way GEPs are checked to be 2 indexed. Also spaces around ==
		ArchDRobisonAuthorUnsubmitted Not Done Reply Inline Actions I'll remove the assert and make the check part of the return value, i.e. return EI->getNumIndices() == 1 && EI->idx_begin() == i; ArchDRobison:* I'll remove the assert and make the check part of the return value, i.e. ``` return EI…

return true;
}		}
		mzolotukhinUnsubmitted Done Reply Inline Actions I think it should depend on available vector register size, i.e. on `MaxVectorRegSizeOption` and `MinVecRegSize`. mzolotukhin: I think it should depend on available vector register size, i.e. on `MaxVectorRegSizeOption`…
		hfinkelUnsubmitted Done Reply Inline Actions Same here: spaces around ==. hfinkel: Same here: spaces around ==.

/// \returns True if in-tree use also needs extract. This refers to		/// \returns True if in-tree use also needs extract. This refers to
/// possible scalar operand in vectorized instruction.		/// possible scalar operand in vectorized instruction.
static bool InTreeUserNeedToExtract(Value Scalar, Instruction UserInst,		static bool InTreeUserNeedToExtract(Value Scalar, Instruction UserInst,
TargetLibraryInfo *TLI) {		TargetLibraryInfo *TLI) {

unsigned Opcode = UserInst->getOpcode();		unsigned Opcode = UserInst->getOpcode();
switch (Opcode) {		switch (Opcode) {
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	public:
BoUpSLP(Function Func, ScalarEvolution Se, TargetTransformInfo *Tti,		BoUpSLP(Function Func, ScalarEvolution Se, TargetTransformInfo *Tti,
TargetLibraryInfo TLi, AliasAnalysis Aa, LoopInfo *Li,		TargetLibraryInfo TLi, AliasAnalysis Aa, LoopInfo *Li,
DominatorTree Dt, AssumptionCache AC, DemandedBits *DB,		DominatorTree Dt, AssumptionCache AC, DemandedBits *DB,
const DataLayout *DL)		const DataLayout *DL)
: NumLoadsWantToKeepOrder(0), NumLoadsWantToChangeOrder(0), F(Func),		: NumLoadsWantToKeepOrder(0), NumLoadsWantToChangeOrder(0), F(Func),
SE(Se), TTI(Tti), TLI(TLi), AA(Aa), LI(Li), DT(Dt), AC(AC), DB(DB),		SE(Se), TTI(Tti), TLI(TLi), AA(Aa), LI(Li), DT(Dt), AC(AC), DB(DB),
DL(DL), Builder(Se->getContext()) {		DL(DL), Builder(Se->getContext()) {
CodeMetrics::collectEphemeralValues(F, AC, EphValues);		CodeMetrics::collectEphemeralValues(F, AC, EphValues);
		// Use the vector register size specified by the target unless overridden
		// by a command-line option.
		// TODO: It would be better to limit the vectorization factor based on
		// data type rather than just register size. For example, x86 AVX has
		// 256-bit registers, but it does not support integer operations
		// at that width (that requires AVX2).
		if (MaxVectorRegSizeOption.getNumOccurrences())
		MaxVecRegSize = MaxVectorRegSizeOption;
		else
		MaxVecRegSize = TTI->getRegisterBitWidth(true);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions I'd suggest committing this refactoring as a separate patch to separate NFC changes from others. mzolotukhin: I'd suggest committing this refactoring as a separate patch to separate NFC changes from others.
		hfinkelUnsubmitted Not Done Reply Inline Actions Yes, please split this when committing. The refactoring should be separate. hfinkel: Yes, please split this when committing. The refactoring should be separate.

		MinVecRegSize = MinVectorRegSizeOption;
}		}

/// \brief Vectorize the tree that starts with the elements in \p VL.		/// \brief Vectorize the tree that starts with the elements in \p VL.
/// Returns the vectorized root.		/// Returns the vectorized root.
Value *vectorizeTree();		Value *vectorizeTree();

/// \returns the cost incurred by unwanted spills and fills, caused by		/// \returns the cost incurred by unwanted spills and fills, caused by
/// holding live values over call sites.		/// holding live values over call sites.
Show All 37 Lines	public:
/// value reaching V. This method is used by the vectorizer to calculate		/// value reaching V. This method is used by the vectorizer to calculate
/// vectorization factors.		/// vectorization factors.
unsigned getVectorElementSize(Value *V);		unsigned getVectorElementSize(Value *V);

/// Compute the minimum type sizes required to represent the entries in a		/// Compute the minimum type sizes required to represent the entries in a
/// vectorizable tree.		/// vectorizable tree.
void computeMinimumValueSizes();		void computeMinimumValueSizes();

		// \returns maximum vector register size as set by TTI or overridden by cl::opt.
		unsigned getMaxVecRegSize() const {
		mzolotukhinUnsubmitted Done Reply Inline Actions Nitpick: `s/GetMaxVecRegSize/getMaxVecRegSize/` mzolotukhin: Nitpick: `s/GetMaxVecRegSize/getMaxVecRegSize/`
		return MaxVecRegSize;
		}

		// \returns minimum vector register size as set by cl::opt.
		unsigned getMinVecRegSize() const {
		return MinVecRegSize;
		}

		/// \brief Check if ArrayType or StructType is isomorphic to some VectorType.
		///
		/// \returns number of elements in vector if isomorphism exists, 0 otherwise.
		unsigned canMapToVector(Type *T, const DataLayout &DL) const;

private:		private:
struct TreeEntry;		struct TreeEntry;

/// \returns the cost of the vectorizable entry.		/// \returns the cost of the vectorizable entry.
int getEntryCost(TreeEntry *E);		int getEntryCost(TreeEntry *E);

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);

		/// \returns True if the ExtractElement/ExtractValue instructions in VL can
		/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).
		bool canReuseExtract(ArrayRef<Value *> VL, unsigned Opcode) const;

/// Vectorize a single entry in the tree.		/// Vectorize a single entry in the tree.
Value vectorizeTree(TreeEntry E);		Value vectorizeTree(TreeEntry E);

/// Vectorize a single entry in the tree, starting in \p VL.		/// Vectorize a single entry in the tree, starting in \p VL.
Value vectorizeTree(ArrayRef<Value > VL);		Value vectorizeTree(ArrayRef<Value > VL);

/// \returns the pointer to the vectorized value if \p VL is already		/// \returns the pointer to the vectorized value if \p VL is already
/// vectorized, or NULL. They may happen in cycles.		/// vectorized, or NULL. They may happen in cycles.
▲ Show 20 Lines • Show All 475 Lines • ▼ Show 20 Lines	#endif
TargetTransformInfo *TTI;		TargetTransformInfo *TTI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
AliasAnalysis *AA;		AliasAnalysis *AA;
LoopInfo *LI;		LoopInfo *LI;
DominatorTree *DT;		DominatorTree *DT;
AssumptionCache *AC;		AssumptionCache *AC;
DemandedBits *DB;		DemandedBits *DB;
const DataLayout *DL;		const DataLayout *DL;
		unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.
		unsigned MinVecRegSize; // Set by cl::opt (default: 128).
/// Instruction builder to construct the vectorized tree.		/// Instruction builder to construct the vectorized tree.
IRBuilder<> Builder;		IRBuilder<> Builder;

/// A map of scalar integer values to the smallest bit width with which they		/// A map of scalar integer values to the smallest bit width with which they
/// can legally be represented.		/// can legally be represented.
MapVector<Value *, uint64_t> MinBWs;		MapVector<Value *, uint64_t> MinBWs;
};		};

▲ Show 20 Lines • Show All 217 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
for (unsigned j = 0; j < VL.size(); ++j)		for (unsigned j = 0; j < VL.size(); ++j)
Operands.push_back(cast<PHINode>(VL[j])->getIncomingValueForBlock(		Operands.push_back(cast<PHINode>(VL[j])->getIncomingValueForBlock(
PH->getIncomingBlock(i)));		PH->getIncomingBlock(i)));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
bool Reuse = CanReuseExtract(VL);		bool Reuse = canReuseExtract(VL, Opcode);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This clean-up also should be committed separately. mzolotukhin: This clean-up also should be committed separately.
if (Reuse) {		if (Reuse) {
DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");		DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");
} else {		} else {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
}		}
newTreeEntry(VL, Reuse);		newTreeEntry(VL, Reuse);
return;		return;
}		}
▲ Show 20 Lines • Show All 300 Lines • ▼ Show 20 Lines	switch (Opcode) {
default:		default:
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");		DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
return;		return;
}		}
}		}

		unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {
		unsigned N;
		Type *EltTy;
		auto *ST = dyn_cast<StructType>(T);
		if (ST) {
		N = ST->getNumElements();
		EltTy = *ST->element_begin();
		} else {
		N = cast<ArrayType>(T)->getNumElements();
		EltTy = cast<ArrayType>(T)->getElementType();
		}
		if (!isValidElementType(EltTy))
		AyalUnsubmitted Done Reply Inline Actions better call local isValidElementType to catch FP80, FP128 fwiw. Ayal: better call local isValidElementType to catch FP80, FP128 fwiw.
		ArchDRobisonAuthorUnsubmitted Not Done Reply Inline Actions Thanks! I hadn't noticed that subtlety. ArchDRobison: Thanks! I hadn't noticed that subtlety.
		return 0;
		uint64_t VTSize = DL.getTypeStoreSizeInBits(VectorType::get(EltTy, N));
		if (VTSize < MinVecRegSize \|\| VTSize > MaxVecRegSize \|\| VTSize != DL.getTypeStoreSizeInBits(T))
		AyalUnsubmitted Done Reply Inline Actions spaces around > Ayal: spaces around >
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Should we use `getTypeSizeInBits`, or `getTypeStoreSizeInBits`? I remember a bug when we used a wrong one. mzolotukhin: Should we use `getTypeSizeInBits`, or `getTypeStoreSizeInBits`? I remember a bug when we used a…
		ArchDRobisonAuthorUnsubmitted Not Done Reply Inline Actions The fundamental issue is type-punning for load/store, so the "store size" seems more appropriate. My recent comment for http://reviews.llvm.org/D14260 with the contrived i24 example says more. ArchDRobison: The fundamental issue is type-punning for load/store, so the "store size" seems more…
		return 0;
		if (ST) {
		// Check that struct is homogeneous
		hfinkelUnsubmitted Not Done Reply Inline Actions Comment should be a sentence that ends with a period. hfinkel: Comment should be a sentence that ends with a period.
		for (const auto *Ty : ST->elements())
		if (Ty != EltTy)
		return 0;
		}
		return N;
		mzolotukhinUnsubmitted Done Reply Inline Actions This probably can be rewritten as a range loop, or even with `std::all_of`? mzolotukhin: This probably can be rewritten as a range loop, or even with `std::all_of`?
		}

		bool BoUpSLP::canReuseExtract(ArrayRef<Value *> VL, unsigned Opcode) const {
		assert(Opcode == Instruction::ExtractElement \|\|
		Opcode == Instruction::ExtractValue);
		AyalUnsubmitted Not Done Reply Inline Actions may be simpler to obtain Opcode from getSameOpcode(VL), instead of passing it. Ayal: may be simpler to obtain Opcode from getSameOpcode(VL), instead of passing it.
		ArchDRobisonAuthorUnsubmitted Not Done Reply Inline Actions I was concerned about not adding more overhead than necessary, though maybe it's not a big deal in practice. ArchDRobison: I was concerned about not adding more overhead than necessary, though maybe it's not a big deal…
		assert(Opcode == getSameOpcode(VL) && "Invalid opcode");
		// Check if all of the extracts come from the same vector and from the
		// correct offset.
		Value *VL0 = VL[0];
		Instruction *E0 = cast<Instruction>(VL0);
		Value *Vec = E0->getOperand(0);

		// We have to extract from a vector/aggregate with the same number of elements.
		unsigned NElts;
		if (Opcode == Instruction::ExtractValue) {
		const DataLayout &DL = E0->getModule()->getDataLayout();
		NElts = canMapToVector(Vec->getType(), DL);
		hfinkelUnsubmitted Done Reply Inline Actions Adjust spaces: if (Opcode == hfinkel: Adjust spaces: if (Opcode ==
		if (!NElts)
		return false;
		// Check if load can be rewritten as load of vector
		hfinkelUnsubmitted Not Done Reply Inline Actions Comment should end with a period. hfinkel: Comment should end with a period.
		LoadInst *LI = dyn_cast<LoadInst>(Vec);
		if (!LI \|\| !LI->isSimple() \|\| !LI->hasNUses(VL.size()))
		hfinkelUnsubmitted Done Reply Inline Actions for -> if hfinkel: for -> if
		return false;
		} else {
		NElts = Vec->getType()->getVectorNumElements();
		}

		if (NElts != VL.size())
		return false;

		// Check that all of the indices extract from the correct offset.
		if (!matchExtractIndex(E0, 0, Opcode))
		return false;

		for (unsigned i = 1, e = VL.size(); i < e; ++i) {
		Instruction *E = cast<Instruction>(VL[i]);
		if (!matchExtractIndex(E, i, Opcode))
		return false;
		if (E->getOperand(0) != Vec)
		return false;
		}

		return true;
		}

int BoUpSLP::getEntryCost(TreeEntry *E) {		int BoUpSLP::getEntryCost(TreeEntry *E) {
ArrayRef<Value*> VL = E->Scalars;		ArrayRef<Value*> VL = E->Scalars;

Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
VectorType *VecTy = VectorType::get(ScalarTy, VL.size());		VectorType *VecTy = VectorType::get(ScalarTy, VL.size());

Show All 13 Lines	int BoUpSLP::getEntryCost(TreeEntry *E) {
}		}
unsigned Opcode = getSameOpcode(VL);		unsigned Opcode = getSameOpcode(VL);
assert(Opcode && getSameType(VL) && getSameBlock(VL) && "Invalid VL");		assert(Opcode && getSameType(VL) && getSameBlock(VL) && "Invalid VL");
Instruction *VL0 = cast<Instruction>(VL[0]);		Instruction *VL0 = cast<Instruction>(VL[0]);
switch (Opcode) {		switch (Opcode) {
case Instruction::PHI: {		case Instruction::PHI: {
return 0;		return 0;
}		}
		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
if (CanReuseExtract(VL)) {		if (canReuseExtract(VL, Opcode)) {
int DeadCost = 0;		int DeadCost = 0;
for (unsigned i = 0, e = VL.size(); i < e; ++i) {		for (unsigned i = 0, e = VL.size(); i < e; ++i) {
ExtractElementInst *E = cast<ExtractElementInst>(VL[i]);		Instruction *E = cast<Instruction>(VL[i]);
if (E->hasOneUse())		if (E->hasOneUse())
// Take credit for instruction that will become dead.		// Take credit for instruction that will become dead.
DeadCost +=		DeadCost +=
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, i);		TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, i);
}		}
return -DeadCost;		return -DeadCost;
}		}
return getGatherCost(VecTy);		return getGatherCost(VecTy);
▲ Show 20 Lines • Show All 671 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
}		}

assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&		assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&
"Invalid number of incoming values");		"Invalid number of incoming values");
return NewPhi;		return NewPhi;
}		}

case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
if (CanReuseExtract(E->Scalars)) {		if (canReuseExtract(E->Scalars, Instruction::ExtractElement)) {
Value *V = VL0->getOperand(0);		Value *V = VL0->getOperand(0);
E->VectorizedValue = V;		E->VectorizedValue = V;
return V;		return V;
}		}
return Gather(E->Scalars, VecTy);		return Gather(E->Scalars, VecTy);
}		}
		case Instruction::ExtractValue: {
		if (canReuseExtract(E->Scalars, Instruction::ExtractValue)) {
		LoadInst *LI = cast<LoadInst>(VL0->getOperand(0));
		Builder.SetInsertPoint(LI);
		PointerType *PtrTy = PointerType::get(VecTy, LI->getPointerAddressSpace());
		Value *Ptr = Builder.CreateBitCast(LI->getOperand(0), PtrTy);
		LoadInst *V = Builder.CreateAlignedLoad(Ptr, LI->getAlignment());
		E->VectorizedValue = V;
		return propagateMetadata(V, E->Scalars);
		}
		AyalUnsubmitted Done Reply Inline Actions else ... return Gather? (but w/o explicit 'else') Ayal: else ... return Gather? (but w/o explicit 'else')
		ArchDRobisonAuthorUnsubmitted Not Done Reply Inline Actions There are two ways to fix this. The "canReuseExtract" check could be an assertion, to verify that earlier checking (as it stands in the patch) cancels the scheduling if canReuseExtract returns false. But it seems better to use Gather as you suggest, in case the check is relaxed in the future. ArchDRobison: There are two ways to fix this. The "canReuseExtract" check could be an assertion, to verify…
		return Gather(E->Scalars, VecTy);
		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::PtrToInt:		case Instruction::PtrToInt:
case Instruction::IntToPtr:		case Instruction::IntToPtr:
case Instruction::SIToFP:		case Instruction::SIToFP:
▲ Show 20 Lines • Show All 1,197 Lines • ▼ Show 20 Lines	bool runOnFunction(Function &F) override {
GEPs.clear();		GEPs.clear();
bool Changed = false;		bool Changed = false;

// If the target claims to have no vector registers don't attempt		// If the target claims to have no vector registers don't attempt
// vectorization.		// vectorization.
if (!TTI->getNumberOfRegisters(true))		if (!TTI->getNumberOfRegisters(true))
return false;		return false;

// Use the vector register size specified by the target unless overridden
// by a command-line option.
// TODO: It would be better to limit the vectorization factor based on
// data type rather than just register size. For example, x86 AVX has
// 256-bit registers, but it does not support integer operations
// at that width (that requires AVX2).
if (MaxVectorRegSizeOption.getNumOccurrences())
MaxVecRegSize = MaxVectorRegSizeOption;
else
MaxVecRegSize = TTI->getRegisterBitWidth(true);

MinVecRegSize = MinVectorRegSizeOption;

// Don't vectorize when the attribute NoImplicitFloat is used.		// Don't vectorize when the attribute NoImplicitFloat is used.
if (F.hasFnAttribute(Attribute::NoImplicitFloat))		if (F.hasFnAttribute(Attribute::NoImplicitFloat))
return false;		return false;

DEBUG(dbgs() << "SLP: Analyzing blocks in " << F.getName() << ".\n");		DEBUG(dbgs() << "SLP: Analyzing blocks in " << F.getName() << ".\n");

// Use the bottom up slp vectorizer to construct chains that start with		// Use the bottom up slp vectorizer to construct chains that start with
// store instructions.		// store instructions.
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	private:
bool vectorizeStores(ArrayRef<StoreInst *> Stores, int costThreshold,		bool vectorizeStores(ArrayRef<StoreInst *> Stores, int costThreshold,
BoUpSLP &R);		BoUpSLP &R);

/// The store instructions in a basic block organized by base pointer.		/// The store instructions in a basic block organized by base pointer.
StoreListMap Stores;		StoreListMap Stores;

/// The getelementptr instructions in a basic block organized by base pointer.		/// The getelementptr instructions in a basic block organized by base pointer.
WeakVHListMap GEPs;		WeakVHListMap GEPs;

unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.
unsigned MinVecRegSize; // Set by cl::opt (default: 128).
};		};

/// \brief Check that the Values in the slice in VL array are still existent in		/// \brief Check that the Values in the slice in VL array are still existent in
/// the WeakVH array.		/// the WeakVH array.
/// Vectorization of part of the VL array may cause later values in the VL array		/// Vectorization of part of the VL array may cause later values in the VL array
/// to become invalid. We track when this has happened in the WeakVH array.		/// to become invalid. We track when this has happened in the WeakVH array.
static bool hasValueBeenRAUWed(ArrayRef<Value *> VL, ArrayRef<WeakVH> VH,		static bool hasValueBeenRAUWed(ArrayRef<Value *> VL, ArrayRef<WeakVH> VH,
unsigned SliceBegin, unsigned SliceSize) {		unsigned SliceBegin, unsigned SliceSize) {
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	while (Tails.count(I) \|\| Heads.count(I)) {
break;		break;
Operands.push_back(I);		Operands.push_back(I);
// Move to the next value in the chain.		// Move to the next value in the chain.
I = ConsecutiveChain[I];		I = ConsecutiveChain[I];
}		}

// FIXME: Is division-by-2 the correct step? Should we assert that the		// FIXME: Is division-by-2 the correct step? Should we assert that the
// register size is a power-of-2?		// register size is a power-of-2?
for (unsigned Size = MaxVecRegSize; Size >= MinVecRegSize; Size /= 2) {		for (unsigned Size = R.getMaxVecRegSize(); Size >= R.getMinVecRegSize(); Size /= 2) {
if (vectorizeStoreChain(Operands, costThreshold, R, Size)) {		if (vectorizeStoreChain(Operands, costThreshold, R, Size)) {
// Mark the vectorized stores so that we don't vectorize them again.		// Mark the vectorized stores so that we don't vectorize them again.
VectorizedStores.insert(Operands.begin(), Operands.end());		VectorizedStores.insert(Operands.begin(), Operands.end());
Changed = true;		Changed = true;
break;		break;
}		}
}		}
}		}
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	bool SLPVectorizer::tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R,
if (!I0)		if (!I0)
return false;		return false;

unsigned Opcode0 = I0->getOpcode();		unsigned Opcode0 = I0->getOpcode();

// FIXME: Register size should be a parameter to this function, so we can		// FIXME: Register size should be a parameter to this function, so we can
// try different vectorization factors.		// try different vectorization factors.
unsigned Sz = R.getVectorElementSize(I0);		unsigned Sz = R.getVectorElementSize(I0);
unsigned VF = MinVecRegSize / Sz;		unsigned VF = R.getMinVecRegSize() / Sz;

for (Value *V : VL) {		for (Value *V : VL) {
Type *Ty = V->getType();		Type *Ty = V->getType();
if (!isValidElementType(Ty))		if (!isValidElementType(Ty))
return false;		return false;
Instruction *Inst = dyn_cast<Instruction>(V);		Instruction *Inst = dyn_cast<Instruction>(V);
if (!Inst \|\| Inst->getOpcode() != Opcode0)		if (!Inst \|\| Inst->getOpcode() != Opcode0)
return false;		return false;
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	if (Cost < -SLPCostThreshold) {
// The insert point is the last build vector instruction. The vectorized		// The insert point is the last build vector instruction. The vectorized
// root will precede it. This guarantees that we get an instruction. The		// root will precede it. This guarantees that we get an instruction. The
// vectorized tree could have been constant folded.		// vectorized tree could have been constant folded.
Instruction *InsertAfter = cast<Instruction>(BuildVectorSlice.back());		Instruction *InsertAfter = cast<Instruction>(BuildVectorSlice.back());
unsigned VecIdx = 0;		unsigned VecIdx = 0;
for (auto &V : BuildVectorSlice) {		for (auto &V : BuildVectorSlice) {
IRBuilder<NoFolder> Builder(InsertAfter->getParent(),		IRBuilder<NoFolder> Builder(InsertAfter->getParent(),
++BasicBlock::iterator(InsertAfter));		++BasicBlock::iterator(InsertAfter));
InsertElementInst *IE = cast<InsertElementInst>(V);		Instruction *I = cast<Instruction>(V);
		assert(isa<InsertElementInst>(I) \|\| isa<InsertValueInst>(I));
Instruction *Extract = cast<Instruction>(Builder.CreateExtractElement(		Instruction *Extract = cast<Instruction>(Builder.CreateExtractElement(
VectorizedRoot, Builder.getInt32(VecIdx++)));		VectorizedRoot, Builder.getInt32(VecIdx++)));
IE->setOperand(1, Extract);		I->setOperand(1, Extract);
IE->removeFromParent();		I->removeFromParent();
IE->insertAfter(Extract);		I->insertAfter(Extract);
InsertAfter = IE;		InsertAfter = I;
}		}
}		}
// Move to the next bundle.		// Move to the next bundle.
i += VF - 1;		i += VF - 1;
Changed = true;		Changed = true;
}		}
}		}

▲ Show 20 Lines • Show All 380 Lines • ▼ Show 20 Lines	if (!IE->hasOneUse())
return false;		return false;

IE = NextUse;		IE = NextUse;
}		}

return false;		return false;
}		}

		/// \brief Like findBuildVector, but looks backwards for construction of aggregate.
		///
		/// \return true if it matches.
		mzolotukhinUnsubmitted Done Reply Inline Actions Nitpick: missing dot in the end. mzolotukhin: Nitpick: missing dot in the end.
		static bool findBuildAggregate(InsertValueInst *IV,
		SmallVectorImpl<Value *> &BuildVector,
		SmallVectorImpl<Value *> &BuildVectorOpds) {
		if (!IV->hasOneUse())
		return false;
		Value *V = IV->getAggregateOperand();
		if (!isa<UndefValue>(V)) {
		InsertValueInst *I = dyn_cast<InsertValueInst>(V);
		if (!I \|\| !findBuildAggregate(I, BuildVector, BuildVectorOpds))
		return false;
		}
		BuildVector.push_back(IV);
		BuildVectorOpds.push_back(IV->getInsertedValueOperand());
		return true;
		}

static bool PhiTypeSorterFunc(Value V, Value V2) {		static bool PhiTypeSorterFunc(Value V, Value V2) {
return V->getType() < V2->getType();		return V->getType() < V2->getType();
}		}

/// \brief Try and get a reduction value from a phi node.		/// \brief Try and get a reduction value from a phi node.
///		///
/// Given a phi node \p P in a block \p ParentBB, consider possible reductions		/// Given a phi node \p P in a block \p ParentBB, consider possible reductions
/// if they come from either \p ParentBB or a containing loop latch.		/// if they come from either \p ParentBB or a containing loop latch.
▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	if (PHINode *P = dyn_cast<PHINode>(it)) {
Value *Rdx = getReductionValue(DT, P, BB, LI);		Value *Rdx = getReductionValue(DT, P, BB, LI);

// Check if this is a Binary Operator.		// Check if this is a Binary Operator.
BinaryOperator *BI = dyn_cast_or_null<BinaryOperator>(Rdx);		BinaryOperator *BI = dyn_cast_or_null<BinaryOperator>(Rdx);
if (!BI)		if (!BI)
continue;		continue;

// Try to match and vectorize a horizontal reduction.		// Try to match and vectorize a horizontal reduction.
if (canMatchHorizontalReduction(P, BI, R, TTI, MinVecRegSize)) {		if (canMatchHorizontalReduction(P, BI, R, TTI, R.getMinVecRegSize())) {
Changed = true;		Changed = true;
it = BB->begin();		it = BB->begin();
e = BB->end();		e = BB->end();
continue;		continue;
}		}

Value *Inst = BI->getOperand(0);		Value *Inst = BI->getOperand(0);
if (Inst == P)		if (Inst == P)
Show All 11 Lines	Value *Inst = BI->getOperand(0);
continue;		continue;
}		}

if (ShouldStartVectorizeHorAtStore)		if (ShouldStartVectorizeHorAtStore)
if (StoreInst *SI = dyn_cast<StoreInst>(it))		if (StoreInst *SI = dyn_cast<StoreInst>(it))
if (BinaryOperator *BinOp =		if (BinaryOperator *BinOp =
dyn_cast<BinaryOperator>(SI->getValueOperand())) {		dyn_cast<BinaryOperator>(SI->getValueOperand())) {
if (canMatchHorizontalReduction(nullptr, BinOp, R, TTI,		if (canMatchHorizontalReduction(nullptr, BinOp, R, TTI,
MinVecRegSize) \|\|		R.getMinVecRegSize()) \|\|
tryToVectorize(BinOp, R)) {		tryToVectorize(BinOp, R)) {
Changed = true;		Changed = true;
it = BB->begin();		it = BB->begin();
e = BB->end();		e = BB->end();
continue;		continue;
}		}
}		}

▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	if (InsertElementInst *FirstInsertElem = dyn_cast<InsertElementInst>(it)) {
if (tryToVectorizeList(BuildVectorOpds, R, BuildVector)) {		if (tryToVectorizeList(BuildVectorOpds, R, BuildVector)) {
Changed = true;		Changed = true;
it = BB->begin();		it = BB->begin();
e = BB->end();		e = BB->end();
}		}

continue;		continue;
}		}

		// Try to vectorize trees that start at insertvalue instructions feeding into
		// a store.
		if (StoreInst *SI = dyn_cast<StoreInst>(it)) {
		if (InsertValueInst *LastInsertValue = dyn_cast<InsertValueInst>(SI->getValueOperand())) {
		const DataLayout &DL = BB->getModule()->getDataLayout();
		if (R.canMapToVector(SI->getValueOperand()->getType(), DL)) {
		SmallVector<Value *, 16> BuildVector;
		SmallVector<Value *, 16> BuildVectorOpds;
		if (!findBuildAggregate(LastInsertValue, BuildVector, BuildVectorOpds))
		hfinkelUnsubmitted Done Reply Inline Actions Don't need {} here. hfinkel: Don't need {} here.
		continue;

		DEBUG(dbgs() << "SLP: store of array mappable to vector: " << *SI << "\n");
		if (tryToVectorizeList(BuildVectorOpds, R, BuildVector, false)) {
		Changed = true;
		it = BB->begin();
		e = BB->end();
		}
		continue;
		}
		}
		}
}		}

return Changed;		return Changed;
}		}

bool SLPVectorizer::vectorizeGEPIndices(BasicBlock *BB, BoUpSLP &R) {		bool SLPVectorizer::vectorizeGEPIndices(BasicBlock *BB, BoUpSLP &R) {
auto Changed = false;		auto Changed = false;
for (auto &Entry : GEPs) {		for (auto &Entry : GEPs) {
▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/insertvalue.ll

				; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7-avx \| FileCheck %s

				mzolotukhinUnsubmitted Done Reply Inline Actions Could you please also add a test checking that `[2 x i2]` isn't vectorized to `<2 x i2>` here? Similar to other tests in `test/Transforms/SLPVectorizer/X86/bad_types.ll` mzolotukhin: Could you please also add a test checking that `[2 x i2]` isn't vectorized to `<2 x i2>` here?
				; CHECK-LABEL: julia_2xdouble
				ArchDRobisonAuthorUnsubmitted Not Done Reply Inline Actions Fixed by addition of @julia_load_array_of_i16 . ArchDRobison: Fixed by addition of @julia_load_array_of_i16 .
				; CHECK: load <2 x double>
				; CHECK: load <2 x double>
				; CHECK: fmul <2 x double>
				; CHECK: fadd <2 x double>
				mzolotukhinUnsubmitted Done Reply Inline Actions Please add `CHECK-LABEL` statements before each function. That'll help to avoid possible interference between `CHECK` statements from different tests. mzolotukhin: Please add `CHECK-LABEL` statements before each function. That'll help to avoid possible…
				define void @julia_2xdouble([2 x double]* sret, [2 x double], [2 x double], [2 x double]*) {
				top:
				%px0 = getelementptr inbounds [2 x double], [2 x double]* %2, i64 0, i64 0
				%x0 = load double, double* %px0, align 4
				%py0 = getelementptr inbounds [2 x double], [2 x double]* %3, i64 0, i64 0
				%y0 = load double, double* %py0, align 4
				%m0 = fmul double %x0, %y0
				%px1 = getelementptr inbounds [2 x double], [2 x double]* %2, i64 0, i64 1
				%x1 = load double, double* %px1, align 4
				%py1 = getelementptr inbounds [2 x double], [2 x double]* %3, i64 0, i64 1
				%y1 = load double, double* %py1, align 4
				%m1 = fmul double %x1, %y1
				%pz0 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 0
				%z0 = load double, double* %pz0, align 4
				%a0 = fadd double %m0, %z0
				%i0 = insertvalue [2 x double] undef, double %a0, 0
				%pz1 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64 1
				%z1 = load double, double* %pz1, align 4
				%a1 = fadd double %m1, %z1
				%i1 = insertvalue [2 x double] %i0, double %a1, 1
				store [2 x double] %i1, [2 x double]* %0, align 4
				ret void
				}

				; CHECK-LABEL: julia_4xfloat
				; CHECK: load <4 x float>
				; CHECK: load <4 x float>
				; CHECK: fmul <4 x float>
				; CHECK: fadd <4 x float>
				define void @julia_4xfloat([4 x float]* sret, [4 x float], [4 x float], [4 x float]*) {
				top:
				%px0 = getelementptr inbounds [4 x float], [4 x float]* %2, i64 0, i64 0
				%x0 = load float, float* %px0, align 4
				%py0 = getelementptr inbounds [4 x float], [4 x float]* %3, i64 0, i64 0
				%y0 = load float, float* %py0, align 4
				%m0 = fmul float %x0, %y0
				%px1 = getelementptr inbounds [4 x float], [4 x float]* %2, i64 0, i64 1
				%x1 = load float, float* %px1, align 4
				%py1 = getelementptr inbounds [4 x float], [4 x float]* %3, i64 0, i64 1
				%y1 = load float, float* %py1, align 4
				%m1 = fmul float %x1, %y1
				%px2 = getelementptr inbounds [4 x float], [4 x float]* %2, i64 0, i64 2
				%x2 = load float, float* %px2, align 4
				%py2 = getelementptr inbounds [4 x float], [4 x float]* %3, i64 0, i64 2
				%y2 = load float, float* %py2, align 4
				%m2 = fmul float %x2, %y2
				%px3 = getelementptr inbounds [4 x float], [4 x float]* %2, i64 0, i64 3
				%x3 = load float, float* %px3, align 4
				%py3 = getelementptr inbounds [4 x float], [4 x float]* %3, i64 0, i64 3
				%y3 = load float, float* %py3, align 4
				%m3 = fmul float %x3, %y3
				%pz0 = getelementptr inbounds [4 x float], [4 x float]* %1, i64 0, i64 0
				%z0 = load float, float* %pz0, align 4
				%a0 = fadd float %m0, %z0
				%i0 = insertvalue [4 x float] undef, float %a0, 0
				%pz1 = getelementptr inbounds [4 x float], [4 x float]* %1, i64 0, i64 1
				%z1 = load float, float* %pz1, align 4
				%a1 = fadd float %m1, %z1
				%i1 = insertvalue [4 x float] %i0, float %a1, 1
				%pz2 = getelementptr inbounds [4 x float], [4 x float]* %1, i64 0, i64 2
				%z2 = load float, float* %pz2, align 4
				%a2 = fadd float %m2, %z2
				%i2 = insertvalue [4 x float] %i1, float %a2, 2
				%pz3 = getelementptr inbounds [4 x float], [4 x float]* %1, i64 0, i64 3
				%z3 = load float, float* %pz3, align 4
				%a3 = fadd float %m3, %z3
				%i3 = insertvalue [4 x float] %i2, float %a3, 3
				store [4 x float] %i3, [4 x float]* %0, align 4
				ret void
				}

				; CHECK-LABEL: julia_load_array_of_float
				; CHECK: fsub <4 x float>
				define void @julia_load_array_of_float([4 x float]* %a, [4 x float]* %b, [4 x float]* %c) {
				top:
				%a_arr = load [4 x float], [4 x float]* %a, align 4
				%a0 = extractvalue [4 x float] %a_arr, 0
				%a2 = extractvalue [4 x float] %a_arr, 2
				%a1 = extractvalue [4 x float] %a_arr, 1
				%b_arr = load [4 x float], [4 x float]* %b, align 4
				%b0 = extractvalue [4 x float] %b_arr, 0
				%b2 = extractvalue [4 x float] %b_arr, 2
				%b1 = extractvalue [4 x float] %b_arr, 1
				%a3 = extractvalue [4 x float] %a_arr, 3
				%c1 = fsub float %a1, %b1
				%b3 = extractvalue [4 x float] %b_arr, 3
				%c0 = fsub float %a0, %b0
				%c2 = fsub float %a2, %b2
				%c_arr0 = insertvalue [4 x float] undef, float %c0, 0
				%c_arr1 = insertvalue [4 x float] %c_arr0, float %c1, 1
				%c3 = fsub float %a3, %b3
				%c_arr2 = insertvalue [4 x float] %c_arr1, float %c2, 2
				%c_arr3 = insertvalue [4 x float] %c_arr2, float %c3, 3
				store [4 x float] %c_arr3, [4 x float]* %c, align 4
				ret void
				}

				; CHECK-LABEL: julia_load_array_of_i32
				; CHECK: load <4 x i32>
				; CHECK: load <4 x i32>
				; CHECK: sub <4 x i32>
				define void @julia_load_array_of_i32([4 x i32]* %a, [4 x i32]* %b, [4 x i32]* %c) {
				top:
				%a_arr = load [4 x i32], [4 x i32]* %a, align 4
				%a0 = extractvalue [4 x i32] %a_arr, 0
				%a2 = extractvalue [4 x i32] %a_arr, 2
				%a1 = extractvalue [4 x i32] %a_arr, 1
				%b_arr = load [4 x i32], [4 x i32]* %b, align 4
				%b0 = extractvalue [4 x i32] %b_arr, 0
				%b2 = extractvalue [4 x i32] %b_arr, 2
				%b1 = extractvalue [4 x i32] %b_arr, 1
				%a3 = extractvalue [4 x i32] %a_arr, 3
				%c1 = sub i32 %a1, %b1
				%b3 = extractvalue [4 x i32] %b_arr, 3
				%c0 = sub i32 %a0, %b0
				%c2 = sub i32 %a2, %b2
				%c_arr0 = insertvalue [4 x i32] undef, i32 %c0, 0
				%c_arr1 = insertvalue [4 x i32] %c_arr0, i32 %c1, 1
				%c3 = sub i32 %a3, %b3
				%c_arr2 = insertvalue [4 x i32] %c_arr1, i32 %c2, 2
				%c_arr3 = insertvalue [4 x i32] %c_arr2, i32 %c3, 3
				store [4 x i32] %c_arr3, [4 x i32]* %c, align 4
				ret void
				}

				; Almost identical to previous test, but for type that should NOT be vectorized.
				;
				; CHECK-LABEL: julia_load_array_of_i16
				; CHECK-NOT: i2>
				define void @julia_load_array_of_i16([4 x i16]* %a, [4 x i16]* %b, [4 x i16]* %c) {
				top:
				%a_arr = load [4 x i16], [4 x i16]* %a, align 4
				%a0 = extractvalue [4 x i16] %a_arr, 0
				%a2 = extractvalue [4 x i16] %a_arr, 2
				%a1 = extractvalue [4 x i16] %a_arr, 1
				%b_arr = load [4 x i16], [4 x i16]* %b, align 4
				%b0 = extractvalue [4 x i16] %b_arr, 0
				%b2 = extractvalue [4 x i16] %b_arr, 2
				%b1 = extractvalue [4 x i16] %b_arr, 1
				%a3 = extractvalue [4 x i16] %a_arr, 3
				%c1 = sub i16 %a1, %b1
				%b3 = extractvalue [4 x i16] %b_arr, 3
				%c0 = sub i16 %a0, %b0
				%c2 = sub i16 %a2, %b2
				%c_arr0 = insertvalue [4 x i16] undef, i16 %c0, 0
				%c_arr1 = insertvalue [4 x i16] %c_arr0, i16 %c1, 1
				%c3 = sub i16 %a3, %b3
				%c_arr2 = insertvalue [4 x i16] %c_arr1, i16 %c2, 2
				%c_arr3 = insertvalue [4 x i16] %c_arr2, i16 %c3, 3
				store [4 x i16] %c_arr3, [4 x i16]* %c, align 4
				ret void
				}

				%pseudovec = type { float, float, float, float }

				; CHECK-LABEL: julia_load_struct_of_float
				; CHECK: load <4 x float>
				; CHECK: load <4 x float>
				; CHECK: fsub <4 x float>
				define void @julia_load_struct_of_float(%pseudovec* %a, %pseudovec* %b, %pseudovec* %c) {
				top:
				%a_struct = load %pseudovec, %pseudovec* %a, align 4
				%a0 = extractvalue %pseudovec %a_struct, 0
				%a1 = extractvalue %pseudovec %a_struct, 1
				%b_struct = load %pseudovec, %pseudovec* %b, align 4
				%a2 = extractvalue %pseudovec %a_struct, 2
				%b0 = extractvalue %pseudovec %b_struct, 0
				%a3 = extractvalue %pseudovec %a_struct, 3
				%c0 = fsub float %a0, %b0
				%b1 = extractvalue %pseudovec %b_struct, 1
				%b2 = extractvalue %pseudovec %b_struct, 2
				%c1 = fsub float %a1, %b1
				%c_struct0 = insertvalue %pseudovec undef, float %c0, 0
				%b3 = extractvalue %pseudovec %b_struct, 3
				%c3 = fsub float %a3, %b3
				%c_struct1 = insertvalue %pseudovec %c_struct0, float %c1, 1
				%c2 = fsub float %a2, %b2
				%c_struct2 = insertvalue %pseudovec %c_struct1, float %c2, 2
				%c_struct3 = insertvalue %pseudovec %c_struct2, float %c3, 3
				store %pseudovec %c_struct3, %pseudovec* %c, align 4
				ret void
				}