This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
LoopAccessAnalysis.h
-
lib/
-
Analysis/
-
LoopAccessAnalysis.cpp
-
Transforms/Vectorize/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
external_user_jumbled_load.ll
-
extract.ll
-
jumbled-load-multiuse.ll
-
jumbled-load-shuffle-placement.ll
-
jumbled-load-used-in-phi.ll
-
jumbled-load.ll
-
reassociated-loads.ll
-
store-jumbled.ll

Differential D43776

[SLP] Fix PR36481: vectorize reassociated instructions.
ClosedPublic

Authored by ABataev on Feb 26 2018, 12:20 PM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
hfinkel
mkuper
Ayal
• ashahid

Commits

rG428e9d9d8784: [SLP] Fix PR36481: vectorize reassociated instructions.
rG3decaf4275be: [SLP] Fix PR36481: vectorize reassociated instructions.
rL329085: [SLP] Fix PR36481: vectorize reassociated instructions.
rL328980: [SLP] Fix PR36481: vectorize reassociated instructions.

Summary

If the load/extractelement/extractvalue instructions are not originally
consecutive, the SLP vectorizer is unable to vectorize them. Patch
allows reordering of such instructions.

Diff Detail

Repository: rL LLVM

Event Timeline

ABataev created this revision.Feb 26 2018, 12:20 PM

ABataev mentioned this in D36130: [SLP] Vectorize jumbled memory loads..Feb 27 2018, 6:59 AM

Updated to the latest version

Harbormaster completed remote builds in B15527: Diff 136325.Feb 28 2018, 10:03 AM

Fixed generation of mask for shuffling of reordered instructions.

Harbormaster completed remote builds in B15533: Diff 136345.Feb 28 2018, 11:14 AM

This patch addresses the following TODO, plus handles extracts:

// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.

At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?

include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗	(On Diff #136345)	The usage later and documentation above look for a set of references that are consecutive under some permutation. I.e., w/o gaps. The implementation below allows arbitrary gaps, and does not check for zero gaps, i.e., replicated references. Would it be better to simply do a Pigeonhole sort, and perhaps call it isPermutedConsecutiveAccess(): Scan all references to find the one with minimal address. Bail out if any reference is incomparable with the running min. Scan all references and set SortedIndices[i] = difference(VL[i], Minimum). Bail out if this entry is to be set more than once. Note: it may be good to support replicated references, say when the gaps are internal to the vector to avoid loading out of bounds. Perhaps add a TODO. Note: it may be good to have LAA provide some common support for both SLP and LV's InterleaveGroup construction, which share some aspects. Perhaps add a TODO.
lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗	(On Diff #136345)	Use computeConstantDifference() instead of computing it explicitly? It should compare GetUnderlyingObject()'s, if worthwhile, rather than doing so here.
lib/Transforms/Vectorize/SLPVectorizer.cpp
454 ↗	(On Diff #136345)	Update above documentation accordingly. Instead of returning the index when it's not Idx, may as well have `getExtractIndex()` return it always, and have the caller compare it to Idx? While we're at it, may as well pass only `E` and have the callee get its Opcode.
608 ↗	(On Diff #136345)	The order which bestOrder() provides is then used to form a vector of instructions. Suggest to have this method supply the desired vector, given the instructions to permute.
667 ↗	(On Diff #136345)	and returns the mask for reordering operations, if it allows should specify more accurately something like: ...and sets \p BestOrder to the identity permutation; otherwise returns False, setting \p BestOrder to either an empty vector or a non-identity permutation that allows...
1224 ↗	(On Diff #136345)	Update above documentation.
1228 ↗	(On Diff #136345)	Can the permutations be kept inside NumOpsWantsToKeepOrder, using OrdersType as its key, instead of holding them in OpOrders? So that one could later simply do ++NumOpsWantsToKeepOrder[CurrentOrder]; See, e.g., `UniquifierDenseMapInfo` in LSR.
1229 ↗	(On Diff #136345)	Document what DirectOrderNum counts and/or have a more self-explanatory name similar to the original one, e.g., `NumOpsWantToKeepOriginalOrder` Can add that `NumOpsWantToKeepOriginalOrder` holds the count for the identity permutation, instead of holding this count inside `NumOpsWantsToKeepOrder` along with all other permutations.
1581 ↗	(On Diff #136345)	BestOrder >> CurrentOrder?
1587 ↗	(On Diff #136345)	Better early exit by returning here.
1593 ↗	(On Diff #136345)	This is pretty hard to follow, and deserves an explanation. Would be better to simply do something like `++NumOpsWantToKeepOrder[BestOrder]`.
1631 ↗	(On Diff #136345)	Sink the emplace_back to after the handling of non-simple loads?
1640 ↗	(On Diff #136345)	"BestOrder" >> "CurrentOrder", or "VLOrder"?
1644 ↗	(On Diff #136345)	Reuse PointerOps and have Value P0,1 = PointerOps.front,back() instead of LoadInst L0,1 just to get their PointerOperand (and SCEV) later?
1659 ↗	(On Diff #136345)	Better have sortPtrAccess() set "BestOrder" only if the given pointers are indeed consecutive once permuted, instead of checking here the Diff of max - min.
2022 ↗	(On Diff #136345)	Comment that BestOrder is initialize to invalid values. Perhaps set `E = VL.size()` here and assign `E + 1`, to match the later checks for initialized/unset values.
2029 ↗	(On Diff #136345)	Can simplify by checking if BestOrder is the identify permutation at the end, as done at the end of sortPtrAccesses(); using `getExtractIndex(Inst)` which returns Idx even if it's equal to I. Better rename BestOrder here too.
3077 ↗	(On Diff #136345)	Capture this "inversePermutation" in a method, to be called again below?
3086 ↗	(On Diff #136345)	InsertPoint may be set twice to VL0?
3298 ↗	(On Diff #136345)	Should we first inverse the permutation and then take its front()? Would be good to have a testcase where this makes a difference and check it (one way or the other), if there isn't one already.
4915 ↗	(On Diff #136345)	If only two operations are still allowed, ReorderedOps may as well stay Ops[1], Ops[0]. Provide the generalization below to any permutation when the assert can be dropped, i.e., in a separate patch which handles this TODO?
4917 ↗	(On Diff #136345)	Provide Ops.size() as operand to constructor of ReorderedOps (multiple similar occurrences).
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
24 ↗	(On Diff #136345)	This redundant sequence of extractions from REORDER_SHUFFLE and insertions into TMP13 is hopefully eliminated later. Is the cost model ignoring it, or can we avoid generating it? Would be good to have the test CHECK the cost.

In D43776#1031044, @Ayal wrote:
This patch addresses the following TODO, plus handles extracts:
// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.
At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?

Yes, but this must be in a different patch, not this one.

include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗	(On Diff #136345)	The documentation above does not say anything about consecutive access. It just states, that the pointers are sorted and that's it. I did it on purpose, so later we could reuse this function for masked loads\|stores. Masked loads are not supported yet, that's why in the SLPVectorizer I added an additional check for the consecutive access.
lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗	(On Diff #136345)	Tried, it does not work in many cases.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1229 ↗	(On Diff #136345)	I don't want to add the new entry for operations, that do not require reordering. I'd better the code in another way.
1581 ↗	(On Diff #136345)	I think it does not matter because the current order is the best order.
1593 ↗	(On Diff #136345)	I need to use an iterator, will rework the code.
1659 ↗	(On Diff #136345)	Just like I said, I did it for future support of masked loads\|stores.
2029 ↗	(On Diff #136345)	Check the boolean flag is faster than to perform N comparisons. That's why I'd prefer to leave it as is.
3086 ↗	(On Diff #136345)	Missed it, thanks.
3298 ↗	(On Diff #136345)	There are several tests already that test that this is correct code. SLPVectorizer/X86/PR32086.ll, SLPVectorizer/X86/jumbled-load.ll, SLPVectorizer/X86/store-jumbled.ll etc.
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
24 ↗	(On Diff #136345)	Yes, InstCombiner will squash all these instructions into one shuffle. Yes, cost model is aware of these operations and ignores their cost (canReuseExtract() is intended to do this).

Updates after review

Harbormaster completed remote builds in B15891: Diff 137784.Mar 9 2018, 10:02 AM

Ayal added inline comments.Mar 9 2018, 2:43 PM

include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗	(On Diff #136345)	By "The documentation above" I meant the one example of a[i+2], a[i+0], a[i+1] and a[i+3] which is a permutation of consecutive addresses. Masked loads and stores (will) also require that all pointers be within range of a single vector, right?
lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗	(On Diff #136345)	Then that should be fixed, worth opening a PR?
lib/Transforms/Vectorize/SLPVectorizer.cpp
1229 ↗	(On Diff #136345)	Understood; suggested before to simply document that one could alternatively hold the count for the identity permutation inside `NumOpsWantsToKeepOrder` map, but we're instead keeping it outside using `NumOpsWantToKeepOriginalOrder` counter.
1581 ↗	(On Diff #136345)	The current order may not be the best order, the first time the tree is built, right?
2029 ↗	(On Diff #136345)	Right; it could simplify the code though, and N is usually very small.
1621 ↗	(On Diff #137784)	Very good, thanks. Can `CurrentOrder` be used instead of `I->getFirst()`? Can `++NumOpsWantToKeepOrder[CurrentOrder]` work? After all the trouble...
3027 ↗	(On Diff #137784)	May as well use `E` when resizing `Mask`.
3358 ↗	(On Diff #137784)	Another inversePermutation?
5693 ↗	(On Diff #137784)	Fold by feeding the constructor with the size.

ABataev added inline comments.Mar 12 2018, 9:35 AM

include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗	(On Diff #136345)	I see. Right
lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗	(On Diff #136345)	I'm not sure that this is a bug, it is a feature :) . computeConstantDifference() is used to get the difference between the pointer returned by the `GetUnderlyingObject()` and kind of a `GEP %ptr, 0, n`, where `n` is constant. But this function does not work if the first load element is from `GEP %ptr, 0, m`, not from `%ptr`, and others are from `GEP %ptr, 0, m+1`, `GEP %ptr, 0, m+2` etc.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1229 ↗	(On Diff #136345)	Added a comment to the declaration of `NumOpsWantToKeepOrder`
1581 ↗	(On Diff #136345)	Yes, but this is the best order for this particular bundle. Anyway, I renamed it.
1621 ↗	(On Diff #137784)	No, we cannot use `CurrentOrder`. I tried to reduce the memory usage and instead of copying the `CurrentOrder` in the `newTreEntry()` function I keep the ArrayRef for this order. But `CurrentOrder` is the local variable and it will be destroyed when we exit out of its declaration scope. And, thus, we will keep the reference to incorrect memory. Instead, I need to use the reference that will exist until the end of the vectorization process (as the key of the map)
3027 ↗	(On Diff #137784)	Oh, yes, missed it, thanks.
3358 ↗	(On Diff #137784)	Yup, thanks.
5693 ↗	(On Diff #137784)	Reworked it, thanks.

Update after review

Harbormaster completed remote builds in B15973: Diff 138035.Mar 12 2018, 9:36 AM

mssimpso added a subscriber: mssimpso.Mar 13 2018, 11:48 AM

Ping

Have test(s) for extractvalue's, for completeness.
Make sure tests cover best-order selection: cases where original order is just as frequent as other orders (tie-break), less frequent, more frequent.

include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗	(On Diff #136345)	Check for zero gaps, i.e., replicated references? These are currently not supported, when checking `canReuseExtract()`.
lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗	(On Diff #136345)	ok, right; `computeConstantDifference()` is lightweight. But it would be good to refactor a more time consuming variant which checks if `getMinusSCEV` returns a constant, and first compares UnderlyingObjects; and also AddressSpaces, following LoopVectorize.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1659 ↗	(On Diff #136345)	OK, so `sortPtrAccesses()` can serve more general cases where max - min > VF. It would still be helpful to wrap it, and refactor the above code/usage inside an `isPermutedConsecutiveAccess()` method which would call `sortPtrAccesses()`.
1621 ↗	(On Diff #137784)	OK, right; it's clear why each Tree Entry should not hold an OrdersType object, and that `newTreeEntry()` should be given the object stored inside `NumOpsWantToKeepOrder` rather than the equivalent temporary `CurrentOrder`. So `++NumOpsWantToKeepOrder[CurrentOrder]` can work, but then `newTreeEntry()` will need to be given `NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst()`?
458 ↗	(On Diff #138035)	Add a message to the assert.
1250 ↗	(On Diff #138035)	"\a" >> "\p"
1254 ↗	(On Diff #138035)	"DirectOrderNum" >> "NumOpsWantToKeepOriginalOrder" instead of adding the comment?
1617 ↗	(On Diff #138035)	May be helpful to also dump CurrentOrder into dbgs().
1651 ↗	(On Diff #138035)	Fold the size into the constructor.
2041 ↗	(On Diff #138035)	// Assign initial value to of all items to E + 1 so we can check if the // Assign to all items the initial value `E + 1` so we can check if the
2044 ↗	(On Diff #138035)	(if at least one element of ... ). ", by checking that no element of `CurrentOrder` still has value `E + 1`." Note that there is no such check later.
2074 ↗	(On Diff #138035)	It may be easier to read if instead of clearing CurrentOrder at each early exit, we break from the loop, and right after the loop check if it exited early and if (`I < E`) clear CurrentOrder and return false.

In D43776#1047322, @Ayal wrote:

Have test(s) for extractvalue's, for completeness.
Make sure tests cover best-order selection: cases where original order is just as frequent as other orders (tie-break), less frequent, more frequent.

We already have couple tests for extractvalue: ARM/sroa.ll and X86/insertvalue.ll. They already have tests with the different order of the extractvalue instructions.

include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗	(On Diff #136345)	Agree, will add checks
lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗	(On Diff #136345)	Added checks for addressspace, comparison for underlying objects already were there.
lib/Transforms/Vectorize/SLPVectorizer.cpp
458 ↗	(On Diff #138035)	Ok
2044 ↗	(On Diff #138035)	This check is not required, it is checked automatically. We check that number of the extract elements is the same as the vector length at first. If later we try to write the index to element that does not equals `E+1` it means that at least one element will still have `E+1` as the value and we're have at least 2 elements with the same index.

Updated after review

Looks good to me, thanks for addressing the issues, have only a few last minor suggestions.

In D43776#1032937, @ABataev wrote:
In D43776#1031044, @Ayal wrote:
This patch addresses the following TODO, plus handles extracts:
// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.
At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?
Yes, but this must be in a different patch, not this one.

Sure, please add a TODO. This patch makes such rebuilds more frequent.

include/llvm/Analysis/LoopAccessAnalysis.h
673 ↗	(On Diff #140474)	Sounds better to specify "Returns 'true' if ..., otherwise returns 'false'" ?
lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗	(On Diff #136345)	Right, the comment was about refactoring it altogether into a separate, more time consuming variant of `computeConstantDifference()` that checks everything; can alternatively leave a TODO for now?
lib/Transforms/Vectorize/SLPVectorizer.cpp
1252 ↗	(On Diff #140474)	`/// DirectOrderNum.` >> `/// NumOpsWantToKeepOriginalOrder.`
1629 ↗	(On Diff #140474)	Would the following work and be easier to read? ++NumOpsWantToKeepOrder[CurrentOrder]; auto &StoredCurrentOrder = NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst(); newTreeEntry(VL, /Vectorized=/true, UserTreeIdx, ReuseShuffleIndicies, StoredCurrentOrder); or alternatively rename `I` to something more meaningful like `StoredCurrentOrderAndNum`.
2054 ↗	(On Diff #140474)	May as well do `CurrentOrder.assign(E, E+1);`
2074 ↗	(On Diff #138035)	The suggestion for easier reading was for something like: for (unsigned I = 0; I < E; ++I) { auto Inst = cast<Instruction>(VL[I]); if (Inst->getOperand(0) != Vec) break; Optional<unsigned> Idx = getExtractIndex(Inst); if (!Idx) break; const unsigned ExtIdx = Idx; if (ExtIdx >= E \|\| CurrentOrder[ExtIdx] != E + 1) break; CurrentOrder[ExtIdx] = I; if (ExtIdx != I) ShouldKeepOrder = false; } if (I < E) { CurrentOrder.clear(); return false; } return ShouldKeepOrder;

This revision is now accepted and ready to land.Apr 1 2018, 12:29 AM

Closed by commit rL328980: [SLP] Fix PR36481: vectorize reassociated instructions. (authored by ABataev). · Explain WhyApr 2 2018, 7:54 AM

This revision was automatically updated to reflect the committed changes.

ABataev marked 5 inline comments as done.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

LoopAccessAnalysis.h

14 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

61 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

315 lines

test/

Transforms/

SLPVectorizer/

X86/

external_user_jumbled_load.ll

13 lines

extract.ll

11 lines

jumbled-load-multiuse.ll

25 lines

jumbled-load-shuffle-placement.ll

46 lines

jumbled-load-used-in-phi.ll

27 lines

jumbled-load.ll

51 lines

reassociated-loads.ll

107 lines

store-jumbled.ll

25 lines

Diff 140625

llvm/trunk/include/llvm/Analysis/LoopAccessAnalysis.h

	Show First 20 Lines • Show All 661 Lines • ▼ Show 20 Lines
	/// If necessary this method will version the stride of the pointer according			/// If necessary this method will version the stride of the pointer according
	/// to \p PtrToStride and therefore add further predicates to \p PSE.			/// to \p PtrToStride and therefore add further predicates to \p PSE.
	/// The \p Assume parameter indicates if we are allowed to make additional			/// The \p Assume parameter indicates if we are allowed to make additional
	/// run-time assumptions.			/// run-time assumptions.
	int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,			int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,
	const ValueToValueMap &StridesMap = ValueToValueMap(),			const ValueToValueMap &StridesMap = ValueToValueMap(),
	bool Assume = false, bool ShouldCheckWrap = true);			bool Assume = false, bool ShouldCheckWrap = true);

				/// \brief Attempt to sort the pointers in \p VL and return the sorted indices
				/// in \p SortedIndices, if reordering is required.
				///
				/// Returns 'true' if sorting is legal, otherwise returns 'false'.
				///
				/// For example, for a given \p VL of memory accesses in program order, a[i+4],
				/// a[i+0], a[i+1] and a[i+7], this function will sort the \p VL and save the
				/// sorted indices in \p SortedIndices as a[i+0], a[i+1], a[i+4], a[i+7] and
				/// saves the mask for actual memory accesses in program order in
				/// \p SortedIndices as <1,2,0,3>
				bool sortPtrAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				ScalarEvolution &SE,
				SmallVectorImpl<unsigned> &SortedIndices);

	/// \brief Returns true if the memory operations \p A and \p B are consecutive.			/// \brief Returns true if the memory operations \p A and \p B are consecutive.
	/// This is a simple API that does not depend on the analysis pass.			/// This is a simple API that does not depend on the analysis pass.
	bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType = true);			ScalarEvolution &SE, bool CheckType = true);

	/// \brief This analysis provides dependence information for the memory accesses			/// \brief This analysis provides dependence information for the memory accesses
	/// of a loop.			/// of a loop.
	///			///
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 1,081 Lines • ▼ Show 20 Lines	if (Assume) {
PSE.setNoOverflow(Ptr, SCEVWrapPredicate::IncrementNUSW);		PSE.setNoOverflow(Ptr, SCEVWrapPredicate::IncrementNUSW);
} else		} else
return 0;		return 0;
}		}

return Stride;		return Stride;
}		}

		bool llvm::sortPtrAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
		ScalarEvolution &SE,
		SmallVectorImpl<unsigned> &SortedIndices) {
		assert(llvm::all_of(
		VL, [](const Value *V) { return V->getType()->isPointerTy(); }) &&
		"Expected list of pointer operands.");
		SmallVector<std::pair<int64_t, Value *>, 4> OffValPairs;
		OffValPairs.reserve(VL.size());

		// Walk over the pointers, and map each of them to an offset relative to
		// first pointer in the array.
		Value *Ptr0 = VL[0];
		const SCEV *Scev0 = SE.getSCEV(Ptr0);
		Value *Obj0 = GetUnderlyingObject(Ptr0, DL);

		llvm::SmallSet<int64_t, 4> Offsets;
		for (auto *Ptr : VL) {
		// TODO: Outline this code as a special, more time consuming, version of
		// computeConstantDifference() function.
		if (Ptr->getType()->getPointerAddressSpace() !=
		Ptr0->getType()->getPointerAddressSpace())
		return false;
		// If a pointer refers to a different underlying object, bail - the
		// pointers are by definition incomparable.
		Value *CurrObj = GetUnderlyingObject(Ptr, DL);
		if (CurrObj != Obj0)
		return false;

		const SCEV *Scev = SE.getSCEV(Ptr);
		const auto *Diff = dyn_cast<SCEVConstant>(SE.getMinusSCEV(Scev, Scev0));
		// The pointers may not have a constant offset from each other, or SCEV
		// may just not be smart enough to figure out they do. Regardless,
		// there's nothing we can do.
		if (!Diff)
		return false;

		// Check if the pointer with the same offset is found.
		int64_t Offset = Diff->getAPInt().getSExtValue();
		if (!Offsets.insert(Offset).second)
		return false;
		OffValPairs.emplace_back(Offset, Ptr);
		}
		SortedIndices.clear();
		SortedIndices.resize(VL.size());
		std::iota(SortedIndices.begin(), SortedIndices.end(), 0);

		// Sort the memory accesses and keep the order of their uses in UseOrder.
		std::stable_sort(SortedIndices.begin(), SortedIndices.end(),
		[&OffValPairs](unsigned Left, unsigned Right) {
		return OffValPairs[Left].first < OffValPairs[Right].first;
		});

		// Check if the order is consecutive already.
		if (llvm::all_of(SortedIndices, [&SortedIndices](const unsigned I) {
		return I == SortedIndices[I];
		}))
		SortedIndices.clear();

		return true;
		}

/// Take the address space operand from the Load/Store instruction.		/// Take the address space operand from the Load/Store instruction.
/// Returns -1 if this is not a valid Load/Store instruction.		/// Returns -1 if this is not a valid Load/Store instruction.
static unsigned getAddressSpaceOperand(Value *I) {		static unsigned getAddressSpaceOperand(Value *I) {
if (LoadInst *L = dyn_cast<LoadInst>(I))		if (LoadInst *L = dyn_cast<LoadInst>(I))
return L->getPointerAddressSpace();		return L->getPointerAddressSpace();
if (StoreInst *S = dyn_cast<StoreInst>(I))		if (StoreInst *S = dyn_cast<StoreInst>(I))
return S->getPointerAddressSpace();		return S->getPointerAddressSpace();
return -1;		return -1;
▲ Show 20 Lines • Show All 1,197 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 446 Lines • ▼ Show 20 Lines	static bool allSameType(ArrayRef<Value *> VL) {
for (int i = 1, e = VL.size(); i < e; i++)		for (int i = 1, e = VL.size(); i < e; i++)
if (VL[i]->getType() != Ty)		if (VL[i]->getType() != Ty)
return false;		return false;

return true;		return true;
}		}

/// \returns True if Extract{Value,Element} instruction extracts element Idx.		/// \returns True if Extract{Value,Element} instruction extracts element Idx.
static bool matchExtractIndex(Instruction *E, unsigned Idx, unsigned Opcode) {		static Optional<unsigned> getExtractIndex(Instruction *E) {
assert(Opcode == Instruction::ExtractElement \|\|		unsigned Opcode = E->getOpcode();
Opcode == Instruction::ExtractValue);		assert((Opcode == Instruction::ExtractElement \|\|
		Opcode == Instruction::ExtractValue) &&
		"Expected extractelement or extractvalue instruction.");
if (Opcode == Instruction::ExtractElement) {		if (Opcode == Instruction::ExtractElement) {
ConstantInt *CI = dyn_cast<ConstantInt>(E->getOperand(1));		auto *CI = dyn_cast<ConstantInt>(E->getOperand(1));
return CI && CI->getZExtValue() == Idx;		if (!CI)
} else {		return None;
ExtractValueInst *EI = cast<ExtractValueInst>(E);		return CI->getZExtValue();
return EI->getNumIndices() == 1 && *EI->idx_begin() == Idx;
}		}
		ExtractValueInst *EI = cast<ExtractValueInst>(E);
		if (EI->getNumIndices() != 1)
		return None;
		return *EI->idx_begin();
}		}

/// \returns True if in-tree use also needs extract. This refers to		/// \returns True if in-tree use also needs extract. This refers to
/// possible scalar operand in vectorized instruction.		/// possible scalar operand in vectorized instruction.
static bool InTreeUserNeedToExtract(Value Scalar, Instruction UserInst,		static bool InTreeUserNeedToExtract(Value Scalar, Instruction UserInst,
TargetLibraryInfo *TLI) {		TargetLibraryInfo *TLI) {
unsigned Opcode = UserInst->getOpcode();		unsigned Opcode = UserInst->getOpcode();
switch (Opcode) {		switch (Opcode) {
▲ Show 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	public:

/// Clear the internal data structures that are created by 'buildTree'.		/// Clear the internal data structures that are created by 'buildTree'.
void deleteTree() {		void deleteTree() {
VectorizableTree.clear();		VectorizableTree.clear();
ScalarToTreeEntry.clear();		ScalarToTreeEntry.clear();
MustGather.clear();		MustGather.clear();
ExternalUses.clear();		ExternalUses.clear();
NumOpsWantToKeepOrder.clear();		NumOpsWantToKeepOrder.clear();
		NumOpsWantToKeepOriginalOrder = 0;
for (auto &Iter : BlocksSchedules) {		for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();		BlockScheduling *BS = Iter.second.get();
BS->clear();		BS->clear();
}		}
MinBWs.clear();		MinBWs.clear();
}		}

unsigned getTreeSize() const { return VectorizableTree.size(); }		unsigned getTreeSize() const { return VectorizableTree.size(); }

/// \brief Perform LICM and CSE on the newly generated gather sequences.		/// \brief Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();

/// \returns true if it is beneficial to reverse the vector order.		/// \returns The best order of instructions for vectorization.
bool shouldReorder() const {		Optional<ArrayRef<unsigned>> bestOrder() const {
return std::accumulate(		auto I = std::max_element(
NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(), 0,		NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(),
[](int Val1,		[](const decltype(NumOpsWantToKeepOrder)::value_type &D1,
const decltype(NumOpsWantToKeepOrder)::value_type &Val2) {		const decltype(NumOpsWantToKeepOrder)::value_type &D2) {
return Val1 + (Val2.second < 0 ? 1 : -1);		return D1.second < D2.second;
}) > 0;		});
		if (I == NumOpsWantToKeepOrder.end() \|\| I->getSecond() <= NumOpsWantToKeepOriginalOrder)
		return None;

		return makeArrayRef(I->getFirst());
}		}

/// \return The vector element size in bits to use when vectorizing the		/// \return The vector element size in bits to use when vectorizing the
/// expression tree ending at \p V. If V is a store, the size is the width of		/// expression tree ending at \p V. If V is a store, the size is the width of
/// the stored value. Otherwise, the size is the width of the largest loaded		/// the stored value. Otherwise, the size is the width of the largest loaded
/// value reaching V. This method is used by the vectorizer to calculate		/// value reaching V. This method is used by the vectorizer to calculate
/// vectorization factors.		/// vectorization factors.
unsigned getVectorElementSize(Value *V);		unsigned getVectorElementSize(Value *V);
Show All 30 Lines	private:
bool areAllUsersVectorized(Instruction *I) const;		bool areAllUsersVectorized(Instruction *I) const;

/// \returns the cost of the vectorizable entry.		/// \returns the cost of the vectorizable entry.
int getEntryCost(TreeEntry *E);		int getEntryCost(TreeEntry *E);

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int);

/// \returns True if the ExtractElement/ExtractValue instructions in VL can		/// \returns true if the ExtractElement/ExtractValue instructions in \p VL can
/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).		/// be vectorized to use the original vector (or aggregate "bitcast" to a
bool canReuseExtract(ArrayRef<Value > VL, Value OpValue) const;		/// vector) and sets \p CurrentOrder to the identity permutation; otherwise
		/// returns false, setting \p CurrentOrder to either an empty vector or a
		/// non-identity permutation that allows to reuse extract instructions.
		bool canReuseExtract(ArrayRef<Value > VL, Value OpValue,
		SmallVectorImpl<unsigned> &CurrentOrder) const;

/// Vectorize a single entry in the tree.		/// Vectorize a single entry in the tree.
Value vectorizeTree(TreeEntry E);		Value vectorizeTree(TreeEntry E);

/// Vectorize a single entry in the tree, starting in \p VL.		/// Vectorize a single entry in the tree, starting in \p VL.
Value vectorizeTree(ArrayRef<Value > VL);		Value vectorizeTree(ArrayRef<Value > VL);

/// \returns the scalarization cost for this type. Scalarization in this		/// \returns the scalarization cost for this type. Scalarization in this
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	struct TreeEntry {
Value *VectorizedValue = nullptr;		Value *VectorizedValue = nullptr;

/// Do we need to gather this sequence ?		/// Do we need to gather this sequence ?
bool NeedToGather = false;		bool NeedToGather = false;

/// Does this sequence require some shuffling?		/// Does this sequence require some shuffling?
SmallVector<unsigned, 4> ReuseShuffleIndices;		SmallVector<unsigned, 4> ReuseShuffleIndices;

		/// Does this entry require reordering?
		ArrayRef<unsigned> ReorderIndices;

/// Points back to the VectorizableTree.		/// Points back to the VectorizableTree.
///		///
/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has		/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has
/// to be a pointer and needs to be able to initialize the child iterator.		/// to be a pointer and needs to be able to initialize the child iterator.
/// Thus we need a reference back to the container to translate the indices		/// Thus we need a reference back to the container to translate the indices
/// to entries.		/// to entries.
std::vector<TreeEntry> &Container;		std::vector<TreeEntry> &Container;

/// The TreeEntry index containing the user of this entry. We can actually		/// The TreeEntry index containing the user of this entry. We can actually
/// have multiple users so the data structure is not truly a tree.		/// have multiple users so the data structure is not truly a tree.
SmallVector<int, 1> UserTreeIndices;		SmallVector<int, 1> UserTreeIndices;
};		};

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
void newTreeEntry(ArrayRef<Value *> VL, bool Vectorized, int &UserTreeIdx,		void newTreeEntry(ArrayRef<Value *> VL, bool Vectorized, int &UserTreeIdx,
ArrayRef<unsigned> ReuseShuffleIndices = None) {		ArrayRef<unsigned> ReuseShuffleIndices = None,
		ArrayRef<unsigned> ReorderIndices = None) {
VectorizableTree.emplace_back(VectorizableTree);		VectorizableTree.emplace_back(VectorizableTree);
int idx = VectorizableTree.size() - 1;		int idx = VectorizableTree.size() - 1;
TreeEntry *Last = &VectorizableTree[idx];		TreeEntry *Last = &VectorizableTree[idx];
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->NeedToGather = !Vectorized;		Last->NeedToGather = !Vectorized;
Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),		Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),
ReuseShuffleIndices.end());		ReuseShuffleIndices.end());
		Last->ReorderIndices = ReorderIndices;
if (Vectorized) {		if (Vectorized) {
for (int i = 0, e = VL.size(); i != e; ++i) {		for (int i = 0, e = VL.size(); i != e; ++i) {
assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");		assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");
ScalarToTreeEntry[VL[i]] = idx;		ScalarToTreeEntry[VL[i]] = idx;
}		}
} else {		} else {
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
}		}
▲ Show 20 Lines • Show All 445 Lines • ▼ Show 20 Lines	#endif

/// Performs the "real" scheduling. Done before vectorization is actually		/// Performs the "real" scheduling. Done before vectorization is actually
/// performed in a basic block.		/// performed in a basic block.
void scheduleBlock(BlockScheduling *BS);		void scheduleBlock(BlockScheduling *BS);

/// List of users to ignore during scheduling and that don't need extracting.		/// List of users to ignore during scheduling and that don't need extracting.
ArrayRef<Value *> UserIgnoreList;		ArrayRef<Value *> UserIgnoreList;

/// Number of operation bundles that contain consecutive operations - number		using OrdersType = SmallVector<unsigned, 4>;
/// of operation bundles that contain consecutive operations in reversed		/// A DenseMapInfo implementation for holding DenseMaps and DenseSets of
/// order.		/// sorted SmallVectors of unsigned.
DenseMap<unsigned, int> NumOpsWantToKeepOrder;		struct OrdersTypeDenseMapInfo {
		static OrdersType getEmptyKey() {
		OrdersType V;
		V.push_back(~1U);
		return V;
		}

		static OrdersType getTombstoneKey() {
		OrdersType V;
		V.push_back(~2U);
		return V;
		}

		static unsigned getHashValue(const OrdersType &V) {
		return static_cast<unsigned>(hash_combine_range(V.begin(), V.end()));
		}

		static bool isEqual(const OrdersType &LHS, const OrdersType &RHS) {
		return LHS == RHS;
		}
		};

		/// Contains orders of operations along with the number of bundles that have
		/// operations in this order. It stores only those orders that require
		/// reordering, if reordering is not required it is counted using \a
		/// NumOpsWantToKeepOriginalOrder.
		DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder;
		/// Number of bundles that do not require reordering.
		unsigned NumOpsWantToKeepOriginalOrder = 0;

// Analysis and block reference.		// Analysis and block reference.
Function *F;		Function *F;
ScalarEvolution *SE;		ScalarEvolution *SE;
TargetTransformInfo *TTI;		TargetTransformInfo *TTI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
AliasAnalysis *AA;		AliasAnalysis *AA;
LoopInfo *LI;		LoopInfo *LI;
▲ Show 20 Lines • Show All 335 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
PH->getIncomingBlock(i)));		PH->getIncomingBlock(i)));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx);
}		}
return;		return;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
bool Reuse = canReuseExtract(VL, VL0);		OrdersType CurrentOrder;
		bool Reuse = canReuseExtract(VL, VL0, CurrentOrder);
if (Reuse) {		if (Reuse) {
DEBUG(dbgs() << "SLP: Reusing or shuffling extract sequence.\n");		DEBUG(dbgs() << "SLP: Reusing or shuffling extract sequence.\n");
++NumOpsWantToKeepOrder[S.Opcode];		++NumOpsWantToKeepOriginalOrder;
} else {		newTreeEntry(VL, /Vectorized=/true, UserTreeIdx,
SmallVector<Value *, 4> ReverseVL(VL.rbegin(), VL.rend());		ReuseShuffleIndicies);
if (canReuseExtract(ReverseVL, VL0))		return;
--NumOpsWantToKeepOrder[S.Opcode];
BS.cancelScheduling(VL, VL0);
}		}
newTreeEntry(VL, Reuse, UserTreeIdx, ReuseShuffleIndicies);		if (!CurrentOrder.empty()) {
		#ifndef NDEBUG
		dbgs() << "SLP: Reusing or shuffling of reordered extract sequence "
		"with order";
		for (unsigned Idx : CurrentOrder)
		dbgs() << " " << Idx;
		dbgs() << "\n";
		#endif // NDEBUG
		// Insert new order with initial value 0, if it does not exist,
		// otherwise return the iterator to the existing one.
		auto StoredCurrentOrderAndNum =
		NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;
		++StoredCurrentOrderAndNum->getSecond();
		newTreeEntry(VL, /Vectorized=/true, UserTreeIdx, ReuseShuffleIndicies,
		StoredCurrentOrderAndNum->getFirst());
		return;
		}
		DEBUG(dbgs() << "SLP: Gather extract sequence.\n");
		newTreeEntry(VL, /Vectorized=/false, UserTreeIdx, ReuseShuffleIndicies);
		BS.cancelScheduling(VL, VL0);
return;		return;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Check that a vectorized load would load the same memory as a scalar		// Check that a vectorized load would load the same memory as a scalar
// load. For example, we don't want to vectorize loads that are smaller		// load. For example, we don't want to vectorize loads that are smaller
// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM		// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM
// treats loading/storing it as an i8 struct. If we vectorize loads/stores		// treats loading/storing it as an i8 struct. If we vectorize loads/stores
// from such a struct, we read/write packed bits disagreeing with the		// from such a struct, we read/write packed bits disagreeing with the
// unvectorized version.		// unvectorized version.
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();

if (DL->getTypeSizeInBits(ScalarTy) !=		if (DL->getTypeSizeInBits(ScalarTy) !=
DL->getTypeAllocSizeInBits(ScalarTy)) {		DL->getTypeAllocSizeInBits(ScalarTy)) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");		DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");
return;		return;
}		}

// Make sure all loads in the bundle are simple - we can't vectorize		// Make sure all loads in the bundle are simple - we can't vectorize
// atomic or volatile loads.		// atomic or volatile loads.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		SmallVector<Value *, 4> PointerOps(VL.size());
LoadInst *L = cast<LoadInst>(VL[i]);		auto POIter = PointerOps.begin();
		for (Value *V : VL) {
		auto *L = cast<LoadInst>(V);
if (!L->isSimple()) {		if (!L->isSimple()) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
return;		return;
}		}
		*POIter = L->getPointerOperand();
		++POIter;
}		}

// Check if the loads are consecutive, reversed, or neither.		OrdersType CurrentOrder;
// TODO: What we really want is to sort the loads, but for now, check		// Check the order of pointer operands.
// the two likely directions.		if (llvm::sortPtrAccesses(PointerOps, DL, SE, CurrentOrder)) {
bool Consecutive = true;		Value *Ptr0;
bool ReverseConsecutive = true;		Value *PtrN;
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		if (CurrentOrder.empty()) {
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		Ptr0 = PointerOps.front();
Consecutive = false;		PtrN = PointerOps.back();
break;
} else {		} else {
ReverseConsecutive = false;		Ptr0 = PointerOps[CurrentOrder.front()];
}		PtrN = PointerOps[CurrentOrder.back()];
}		}
		const SCEV *Scev0 = SE->getSCEV(Ptr0);
if (Consecutive) {		const SCEV *ScevN = SE->getSCEV(PtrN);
++NumOpsWantToKeepOrder[S.Opcode];		const auto *Diff =
newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);		dyn_cast<SCEVConstant>(SE->getMinusSCEV(ScevN, Scev0));
		uint64_t Size = DL->getTypeAllocSize(ScalarTy);
		// Check that the sorted loads are consecutive.
		if (Diff && Diff->getAPInt().getZExtValue() == (VL.size() - 1) * Size) {
		if (CurrentOrder.empty()) {
		// Original loads are consecutive and does not require reordering.
		++NumOpsWantToKeepOriginalOrder;
		newTreeEntry(VL, /Vectorized=/true, UserTreeIdx,
		ReuseShuffleIndicies);
DEBUG(dbgs() << "SLP: added a vector of loads.\n");		DEBUG(dbgs() << "SLP: added a vector of loads.\n");
return;		} else {
}		// Need to reorder.
		auto I = NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;
// If none of the load pairs were consecutive when checked in order,		++I->getSecond();
// check the reverse order.		newTreeEntry(VL, /Vectorized=/true, UserTreeIdx,
if (ReverseConsecutive)		ReuseShuffleIndicies, I->getFirst());
for (unsigned i = VL.size() - 1; i > 0; --i)		DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");
if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {
ReverseConsecutive = false;
break;
}		}

if (ReverseConsecutive) {
--NumOpsWantToKeepOrder[S.Opcode];
newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);
DEBUG(dbgs() << "SLP: added a vector of reversed loads.\n");
return;		return;
}		}
		}

DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
return;		return;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
▲ Show 20 Lines • Show All 293 Lines • ▼ Show 20 Lines	if (ST) {
// Check that struct is homogeneous.		// Check that struct is homogeneous.
for (const auto *Ty : ST->elements())		for (const auto *Ty : ST->elements())
if (Ty != EltTy)		if (Ty != EltTy)
return 0;		return 0;
}		}
return N;		return N;
}		}

bool BoUpSLP::canReuseExtract(ArrayRef<Value > VL, Value OpValue) const {		bool BoUpSLP::canReuseExtract(ArrayRef<Value > VL, Value OpValue,
		SmallVectorImpl<unsigned> &CurrentOrder) const {
Instruction *E0 = cast<Instruction>(OpValue);		Instruction *E0 = cast<Instruction>(OpValue);
assert(E0->getOpcode() == Instruction::ExtractElement \|\|		assert(E0->getOpcode() == Instruction::ExtractElement \|\|
E0->getOpcode() == Instruction::ExtractValue);		E0->getOpcode() == Instruction::ExtractValue);
assert(E0->getOpcode() == getSameOpcode(VL).Opcode && "Invalid opcode");		assert(E0->getOpcode() == getSameOpcode(VL).Opcode && "Invalid opcode");
// Check if all of the extracts come from the same vector and from the		// Check if all of the extracts come from the same vector and from the
// correct offset.		// correct offset.
Value *Vec = E0->getOperand(0);		Value *Vec = E0->getOperand(0);

		CurrentOrder.clear();

// We have to extract from a vector/aggregate with the same number of elements.		// We have to extract from a vector/aggregate with the same number of elements.
unsigned NElts;		unsigned NElts;
if (E0->getOpcode() == Instruction::ExtractValue) {		if (E0->getOpcode() == Instruction::ExtractValue) {
const DataLayout &DL = E0->getModule()->getDataLayout();		const DataLayout &DL = E0->getModule()->getDataLayout();
NElts = canMapToVector(Vec->getType(), DL);		NElts = canMapToVector(Vec->getType(), DL);
if (!NElts)		if (!NElts)
return false;		return false;
// Check if load can be rewritten as load of vector.		// Check if load can be rewritten as load of vector.
LoadInst *LI = dyn_cast<LoadInst>(Vec);		LoadInst *LI = dyn_cast<LoadInst>(Vec);
if (!LI \|\| !LI->isSimple() \|\| !LI->hasNUses(VL.size()))		if (!LI \|\| !LI->isSimple() \|\| !LI->hasNUses(VL.size()))
return false;		return false;
} else {		} else {
NElts = Vec->getType()->getVectorNumElements();		NElts = Vec->getType()->getVectorNumElements();
}		}

if (NElts != VL.size())		if (NElts != VL.size())
return false;		return false;

// Check that all of the indices extract from the correct offset.		// Check that all of the indices extract from the correct offset.
for (unsigned I = 0, E = VL.size(); I < E; ++I) {		bool ShouldKeepOrder = true;
Instruction *Inst = cast<Instruction>(VL[I]);		unsigned E = VL.size();
if (!matchExtractIndex(Inst, I, Inst->getOpcode()))		// Assign to all items the initial value E + 1 so we can check if the extract
return false;		// instruction index was used already.
		// Also, later we can check that all the indices are used and we have a
		// consecutive access in the extract instructions, by checking that no
		// element of CurrentOrder still has value E + 1.
		CurrentOrder.assign(E, E + 1);
		unsigned I = 0;
		for (; I < E; ++I) {
		auto *Inst = cast<Instruction>(VL[I]);
if (Inst->getOperand(0) != Vec)		if (Inst->getOperand(0) != Vec)
		break;
		Optional<unsigned> Idx = getExtractIndex(Inst);
		if (!Idx)
		break;
		const unsigned ExtIdx = *Idx;
		if (ExtIdx != I) {
		if (ExtIdx >= E \|\| CurrentOrder[ExtIdx] != E + 1)
		break;
		ShouldKeepOrder = false;
		CurrentOrder[ExtIdx] = I;
		} else {
		if (CurrentOrder[I] != E + 1)
		break;
		CurrentOrder[I] = I;
		}
		}
		if (I < E) {
		CurrentOrder.clear();
return false;		return false;
}		}

return true;		return ShouldKeepOrder;
}		}

bool BoUpSLP::areAllUsersVectorized(Instruction *I) const {		bool BoUpSLP::areAllUsersVectorized(Instruction *I) const {
return I->hasOneUse() \|\|		return I->hasOneUse() \|\|
std::all_of(I->user_begin(), I->user_end(), [this](User *U) {		std::all_of(I->user_begin(), I->user_end(), [this](User *U) {
return ScalarToTreeEntry.count(U) > 0;		return ScalarToTreeEntry.count(U) > 0;
});		});
}		}
▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	case Instruction::ExtractElement:
Idx = IO->getZExtValue();		Idx = IO->getZExtValue();
} else {		} else {
--Idx;		--Idx;
}		}
ReuseShuffleCost +=		ReuseShuffleCost +=
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, Idx);		TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, Idx);
}		}
}		}
if (canReuseExtract(VL, S.OpValue)) {		if (!E->NeedToGather) {
int DeadCost = ReuseShuffleCost;		int DeadCost = ReuseShuffleCost;
		if (!E->ReorderIndices.empty()) {
		// TODO: Merge this shuffle with the ReuseShuffleCost.
		DeadCost += TTI->getShuffleCost(
		TargetTransformInfo::SK_PermuteSingleSrc, VecTy);
		}
for (unsigned i = 0, e = VL.size(); i < e; ++i) {		for (unsigned i = 0, e = VL.size(); i < e; ++i) {
Instruction *E = cast<Instruction>(VL[i]);		Instruction *E = cast<Instruction>(VL[i]);
// If all users are going to be vectorized, instruction can be		// If all users are going to be vectorized, instruction can be
// considered as dead.		// considered as dead.
// The same, if have only one user, it will be vectorized for sure.		// The same, if have only one user, it will be vectorized for sure.
if (areAllUsersVectorized(E))		if (areAllUsersVectorized(E))
// Take credit for instruction that will become dead.		// Take credit for instruction that will become dead.
DeadCost -=		DeadCost -=
▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	case Instruction::Load: {
ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) *		ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) *
TTI->getMemoryOpCost(Instruction::Load, ScalarTy,		TTI->getMemoryOpCost(Instruction::Load, ScalarTy,
alignment, 0, VL0);		alignment, 0, VL0);
}		}
int ScalarLdCost = VecTy->getNumElements() *		int ScalarLdCost = VecTy->getNumElements() *
TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0, VL0);		TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0, VL0);
int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,		int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,
VecTy, alignment, 0, VL0);		VecTy, alignment, 0, VL0);
if (!isConsecutiveAccess(VL[0], VL[1], DL, SE)) {		if (!E->ReorderIndices.empty()) {
		// TODO: Merge this shuffle with the ReuseShuffleCost.
VecLdCost += TTI->getShuffleCost(		VecLdCost += TTI->getShuffleCost(
TargetTransformInfo::SK_PermuteSingleSrc, VecTy);		TargetTransformInfo::SK_PermuteSingleSrc, VecTy);
}		}
return ReuseShuffleCost + VecLdCost - ScalarLdCost;		return ReuseShuffleCost + VecLdCost - ScalarLdCost;
}		}
case Instruction::Store: {		case Instruction::Store: {
// We know that we can merge the stores. Calculate the cost.		// We know that we can merge the stores. Calculate the cost.
unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();		unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();
▲ Show 20 Lines • Show All 681 Lines • ▼ Show 20 Lines	if (!ReuseShuffleIndicies.empty()) {
if (auto *I = dyn_cast<Instruction>(V)) {		if (auto *I = dyn_cast<Instruction>(V)) {
GatherSeq.insert(I);		GatherSeq.insert(I);
CSEBlocks.insert(I->getParent());		CSEBlocks.insert(I->getParent());
}		}
}		}
return V;		return V;
}		}

		static void inversePermutation(ArrayRef<unsigned> Indices,
		SmallVectorImpl<unsigned> &Mask) {
		Mask.clear();
		const unsigned E = Indices.size();
		Mask.resize(E);
		for (unsigned I = 0; I < E; ++I)
		Mask[Indices[I]] = I;
		}

Value BoUpSLP::vectorizeTree(TreeEntry E) {		Value BoUpSLP::vectorizeTree(TreeEntry E) {
IRBuilder<>::InsertPointGuard Guard(Builder);		IRBuilder<>::InsertPointGuard Guard(Builder);

if (E->VectorizedValue) {		if (E->VectorizedValue) {
DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");		DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
}		}

assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&		assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&
"Invalid number of incoming values");		"Invalid number of incoming values");
return V;		return V;
}		}

case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
if (canReuseExtract(E->Scalars, VL0)) {		if (!E->NeedToGather) {
Value *V = VL0->getOperand(0);		Value *V = VL0->getOperand(0);
		if (!E->ReorderIndices.empty()) {
		OrdersType Mask;
		inversePermutation(E->ReorderIndices, Mask);
		Builder.SetInsertPoint(VL0);
		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy), Mask,
		"reorder_shuffle");
		}
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
		// TODO: Merge this shuffle with the ReorderShuffleMask.
		if (!E->ReorderIndices.empty())
Builder.SetInsertPoint(VL0);		Builder.SetInsertPoint(VL0);
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
return V;		return V;
}		}
setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);
auto *V = Gather(E->Scalars, VecTy);		auto *V = Gather(E->Scalars, VecTy);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
if (auto *I = dyn_cast<Instruction>(V)) {		if (auto *I = dyn_cast<Instruction>(V)) {
GatherSeq.insert(I);		GatherSeq.insert(I);
CSEBlocks.insert(I->getParent());		CSEBlocks.insert(I->getParent());
}		}
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
return V;		return V;
}		}
case Instruction::ExtractValue: {		case Instruction::ExtractValue: {
if (canReuseExtract(E->Scalars, VL0)) {		if (!E->NeedToGather) {
LoadInst *LI = cast<LoadInst>(VL0->getOperand(0));		LoadInst *LI = cast<LoadInst>(VL0->getOperand(0));
Builder.SetInsertPoint(LI);		Builder.SetInsertPoint(LI);
PointerType *PtrTy = PointerType::get(VecTy, LI->getPointerAddressSpace());		PointerType *PtrTy = PointerType::get(VecTy, LI->getPointerAddressSpace());
Value *Ptr = Builder.CreateBitCast(LI->getOperand(0), PtrTy);		Value *Ptr = Builder.CreateBitCast(LI->getOperand(0), PtrTy);
LoadInst *V = Builder.CreateAlignedLoad(Ptr, LI->getAlignment());		LoadInst *V = Builder.CreateAlignedLoad(Ptr, LI->getAlignment());
Value *NewV = propagateMetadata(V, E->Scalars);		Value *NewV = propagateMetadata(V, E->Scalars);
		if (!E->ReorderIndices.empty()) {
		OrdersType Mask;
		inversePermutation(E->ReorderIndices, Mask);
		NewV = Builder.CreateShuffleVector(NewV, UndefValue::get(VecTy), Mask,
		"reorder_shuffle");
		}
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
		// TODO: Merge this shuffle with the ReorderShuffleMask.
NewV = Builder.CreateShuffleVector(		NewV = Builder.CreateShuffleVector(
NewV, UndefValue::get(VecTy), E->ReuseShuffleIndices, "shuffle");		NewV, UndefValue::get(VecTy), E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = NewV;		E->VectorizedValue = NewV;
return NewV;		return NewV;
}		}
setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);
auto *V = Gather(E->Scalars, VecTy);		auto *V = Gather(E->Scalars, VecTy);
▲ Show 20 Lines • Show All 157 Lines • ▼ Show 20 Lines	case Instruction::Xor: {
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

return V;		return V;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Loads are inserted at the head of the tree because we don't want to		// Loads are inserted at the head of the tree because we don't want to
// sink them all the way down past store instructions.		// sink them all the way down past store instructions.
bool IsReversed =		bool IsReorder = !E->ReorderIndices.empty();
!isConsecutiveAccess(E->Scalars[0], E->Scalars[1], DL, SE);		if (IsReorder)
if (IsReversed)		VL0 = cast<Instruction>(E->Scalars[E->ReorderIndices.front()]);
VL0 = cast<Instruction>(E->Scalars.back());
setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

LoadInst *LI = cast<LoadInst>(VL0);		LoadInst *LI = cast<LoadInst>(VL0);
Type *ScalarLoadTy = LI->getType();		Type *ScalarLoadTy = LI->getType();
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = LI->getPointerAddressSpace();

Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),		Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),
VecTy->getPointerTo(AS));		VecTy->getPointerTo(AS));

// The pointer operand uses an in-tree scalar so we add the new BitCast to		// The pointer operand uses an in-tree scalar so we add the new BitCast to
// ExternalUses list to make sure that an extract will be generated in the		// ExternalUses list to make sure that an extract will be generated in the
// future.		// future.
Value *PO = LI->getPointerOperand();		Value *PO = LI->getPointerOperand();
if (getTreeEntry(PO))		if (getTreeEntry(PO))
ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));		ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));

unsigned Alignment = LI->getAlignment();		unsigned Alignment = LI->getAlignment();
LI = Builder.CreateLoad(VecPtr);		LI = Builder.CreateLoad(VecPtr);
if (!Alignment) {		if (!Alignment) {
Alignment = DL->getABITypeAlignment(ScalarLoadTy);		Alignment = DL->getABITypeAlignment(ScalarLoadTy);
}		}
LI->setAlignment(Alignment);		LI->setAlignment(Alignment);
Value *V = propagateMetadata(LI, E->Scalars);		Value *V = propagateMetadata(LI, E->Scalars);
if (IsReversed) {		if (IsReorder) {
SmallVector<uint32_t, 4> Mask(E->Scalars.size());		OrdersType Mask;
std::iota(Mask.rbegin(), Mask.rend(), 0);		inversePermutation(E->ReorderIndices, Mask);
V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()), Mask);		V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()),
		Mask, "reorder_shuffle");
}		}
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
		// TODO: Merge this shuffle with the ReorderShuffleMask.
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::Store: {		case Instruction::Store: {
▲ Show 20 Lines • Show All 1,562 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
if (hasValueBeenRAUWed(VL, TrackValues, I, OpsWidth))		if (hasValueBeenRAUWed(VL, TrackValues, I, OpsWidth))
continue;		continue;

DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "		DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "
<< "\n");		<< "\n");
ArrayRef<Value *> Ops = VL.slice(I, OpsWidth);		ArrayRef<Value *> Ops = VL.slice(I, OpsWidth);

R.buildTree(Ops);		R.buildTree(Ops);
		Optional<ArrayRef<unsigned>> Order = R.bestOrder();
// TODO: check if we can allow reordering for more cases.		// TODO: check if we can allow reordering for more cases.
if (AllowReorder && R.shouldReorder()) {		if (AllowReorder && Order) {
		// TODO: reorder tree nodes without tree rebuilding.
// Conceptually, there is nothing actually preventing us from trying to		// Conceptually, there is nothing actually preventing us from trying to
// reorder a larger list. In fact, we do exactly this when vectorizing		// reorder a larger list. In fact, we do exactly this when vectorizing
// reductions. However, at this point, we only expect to get here when		// reductions. However, at this point, we only expect to get here when
// there are exactly two operations.		// there are exactly two operations.
assert(Ops.size() == 2);		assert(Ops.size() == 2);
Value *ReorderedOps[] = {Ops[1], Ops[0]};		Value *ReorderedOps[] = {Ops[1], Ops[0]};
R.buildTree(ReorderedOps, None);		R.buildTree(ReorderedOps, None);
}		}
▲ Show 20 Lines • Show All 729 Lines • ▼ Show 20 Lines	bool tryToReduce(BoUpSLP &V, TargetTransformInfo *TTI) {
for (auto &Pair : ExtraArgs)		for (auto &Pair : ExtraArgs)
ExternallyUsedValues[Pair.second].push_back(Pair.first);		ExternallyUsedValues[Pair.second].push_back(Pair.first);
SmallVector<Value *, 16> IgnoreList;		SmallVector<Value *, 16> IgnoreList;
for (auto &V : ReductionOps)		for (auto &V : ReductionOps)
IgnoreList.append(V.begin(), V.end());		IgnoreList.append(V.begin(), V.end());
while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {		while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {
auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);		auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);
V.buildTree(VL, ExternallyUsedValues, IgnoreList);		V.buildTree(VL, ExternallyUsedValues, IgnoreList);
if (V.shouldReorder()) {		Optional<ArrayRef<unsigned>> Order = V.bestOrder();
SmallVector<Value *, 8> Reversed(VL.rbegin(), VL.rend());		if (Order) {
V.buildTree(Reversed, ExternallyUsedValues, IgnoreList);		// TODO: reorder tree nodes without tree rebuilding.
		SmallVector<Value *, 4> ReorderedOps(VL.size());
		llvm::transform(*Order, ReorderedOps.begin(),
		[VL](const unsigned Idx) { return VL[Idx]; });
		V.buildTree(ReorderedOps, ExternallyUsedValues, IgnoreList);
}		}
if (V.isTreeTinyAndNotFullyVectorizable())		if (V.isTreeTinyAndNotFullyVectorizable())
break;		break;

V.computeMinimumValueSizes();		V.computeMinimumValueSizes();

// Estimate cost.		// Estimate cost.
int Cost =		int Cost =
▲ Show 20 Lines • Show All 710 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s

	@array = external global [20 x [13 x i32]]			@array = external global [20 x [13 x i32]]

	define void @hoge(i64 %idx, <4 x i32>* %sink) {			define void @hoge(i64 %idx, <4 x i32>* %sink) {
	; CHECK-LABEL: @hoge(			; CHECK-LABEL: @hoge(
	; CHECK-NEXT: bb:			; CHECK-NEXT: bb:
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX:%.*]], i64 5			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX:%.*]], i64 5
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 6			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 6
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 7			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 7
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 8			; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 8
	; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP1]] to <2 x i32>*			; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[TMP5:%.]] = load <2 x i32>, <2 x i32> [[TMP4]], align 4			; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x i32> [[TMP5]], i32 0			; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 0
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP5]], i32 1			; CHECK-NEXT: [[TMP8:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 1
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP8]], i32 1			; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP8]], i32 1
	; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[TMP3]], align 4			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 2
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[TMP10]], i32 2			; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[TMP10]], i32 2
	; CHECK-NEXT: [[TMP12:%.]] = load i32, i32 [[TMP0]], align 4			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 3
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP12]], i32 3			; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP12]], i32 3
	; CHECK-NEXT: store <4 x i32> [[TMP13]], <4 x i32>* [[SINK:%.*]]			; CHECK-NEXT: store <4 x i32> [[TMP13]], <4 x i32>* [[SINK:%.*]]
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	bb:			bb:
	%0 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 5			%0 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 5
	%1 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 6			%1 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 6
	%2 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 7			%2 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 7
	Show All 13 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/extract.ll

Show All 24 Lines	entry:
store double %A1, double* %P1, align 4		store double %A1, double* %P1, align 4
ret void		ret void
}		}

define void @fextr1(double* %ptr) {		define void @fextr1(double* %ptr) {
; CHECK-LABEL: @fextr1(		; CHECK-LABEL: @fextr1(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[LD:%.]] = load <2 x double>, <2 x double> undef		; CHECK-NEXT: [[LD:%.]] = load <2 x double>, <2 x double> undef
; CHECK-NEXT: [[V0:%.*]] = extractelement <2 x double> [[LD]], i32 0		; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <2 x double> [[LD]], <2 x double> undef, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[V1:%.*]] = extractelement <2 x double> [[LD]], i32 1
; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds double, double [[PTR:%.*]], i64 0		; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds double, double [[PTR:%.*]], i64 0
; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> undef, double [[V1]], i32 0		; CHECK-NEXT: [[TMP0:%.*]] = fadd <2 x double> <double 3.400000e+00, double 1.200000e+00>, [[REORDER_SHUFFLE]]
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[V0]], i32 1		; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[P1]] to <2 x double>*
; CHECK-NEXT: [[TMP2:%.*]] = fadd <2 x double> <double 3.400000e+00, double 1.200000e+00>, [[TMP1]]		; CHECK-NEXT: store <2 x double> [[TMP0]], <2 x double>* [[TMP1]], align 4
; CHECK-NEXT: [[TMP3:%.]] = bitcast double [[P1]] to <2 x double>*
; CHECK-NEXT: store <2 x double> [[TMP2]], <2 x double>* [[TMP3]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%LD = load <2 x double>, <2 x double>* undef		%LD = load <2 x double>, <2 x double>* undef
%V0 = extractelement <2 x double> %LD, i32 0		%V0 = extractelement <2 x double> %LD, i32 0
%V1 = extractelement <2 x double> %LD, i32 1		%V1 = extractelement <2 x double> %LD, i32 1
%P0 = getelementptr inbounds double, double* %ptr, i64 1 ; <--- incorrect order		%P0 = getelementptr inbounds double, double* %ptr, i64 1 ; <--- incorrect order
%P1 = getelementptr inbounds double, double* %ptr, i64 0		%P1 = getelementptr inbounds double, double* %ptr, i64 0
Show All 34 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4
	@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4

	define i32 @fn1() {			define i32 @fn1() {
	; CHECK-LABEL: @fn1(			; CHECK-LABEL: @fn1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> bitcast (i32* getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 1) to <2 x i32>*), align 4			; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP0]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
	; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 3), align 4			; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt <4 x i32> [[REORDER_SHUFFLE]], zeroinitializer
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 0
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP3]], i32 0			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> undef, i32 [[TMP2]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1			; CHECK-NEXT: [[TMP4:%.]] = insertelement <4 x i32> [[TMP3]], i32 ptrtoint (i32 () @fn1 to i32), i32 1
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP5]], i32 1			; CHECK-NEXT: [[TMP5:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP2]], i32 2			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 8, i32 3
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP0]], i32 3			; CHECK-NEXT: [[TMP7:%.*]] = select <4 x i1> [[TMP1]], <4 x i32> [[TMP6]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
	; CHECK-NEXT: [[TMP9:%.*]] = icmp sgt <4 x i32> [[TMP8]], zeroinitializer			; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP10:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 1
	; CHECK-NEXT: [[TMP11:%.]] = insertelement <4 x i32> [[TMP10]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 8, i32 3
	; CHECK-NEXT: [[TMP13:%.*]] = select <4 x i1> [[TMP9]], <4 x i32> [[TMP12]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
	; CHECK-NEXT: store <4 x i32> [[TMP13]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;
	entry:			entry:
	%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4
	%cmp = icmp sgt i32 %0, 0			%cmp = icmp sgt i32 %0, 0
	%cond = select i1 %cmp, i32 8, i32 0			%cond = select i1 %cmp, i32 8, i32 0
	store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4			store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4
	%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4			%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4
	Show All 13 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll

Show All 15 Lines	;void jumble (int * restrict A, int * restrict B) {

; Function Attrs: norecurse nounwind uwtable		; Function Attrs: norecurse nounwind uwtable
define void @jumble1(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {		define void @jumble1(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {
; CHECK-LABEL: @jumble1(		; CHECK-LABEL: @jumble1(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10
; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11		; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11
; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1		; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1
; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[A]] to <2 x i32>*
; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 4
; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12		; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12
; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3		; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3
; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX6]], align 4
; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13		; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13
; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*		; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*
; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4		; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2		; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX9]], align 4		; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A]] to <4 x i32>*
; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0		; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4
; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1		; CHECK-NEXT: [[TMP5:%.*]] = mul nsw <4 x i32> [[TMP1]], [[TMP4]]
; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP8]], i32 1
; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[TMP2]], i32 2
; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP5]], i32 3
; CHECK-NEXT: [[TMP12:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP11]]
; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1		; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2		; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3		; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[B]] to <4 x i32>*		; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[B]] to <4 x i32>*
; CHECK-NEXT: store <4 x i32> [[TMP12]], <4 x i32>* [[TMP13]], align 4		; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%arrayidx = getelementptr inbounds i32, i32* %A, i64 10		%arrayidx = getelementptr inbounds i32, i32* %A, i64 10
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
%1 = load i32, i32* %A, align 4		%1 = load i32, i32* %A, align 4
%mul = mul nsw i32 %0, %1		%mul = mul nsw i32 %0, %1
%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11		%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11
Show All 24 Lines
;Reversing the operand of MUL		;Reversing the operand of MUL
; Function Attrs: norecurse nounwind uwtable		; Function Attrs: norecurse nounwind uwtable
define void @jumble2(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {		define void @jumble2(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {
; CHECK-LABEL: @jumble2(		; CHECK-LABEL: @jumble2(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10
; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11		; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11
; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1		; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1
; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[A]] to <2 x i32>*
; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 4
; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12		; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12
; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3		; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3
; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX6]], align 4
; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13		; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13
; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*		; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*
; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4		; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2		; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX9]], align 4		; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A]] to <4 x i32>*
; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0		; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4
; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1		; CHECK-NEXT: [[TMP5:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP1]]
; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP8]], i32 1
; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[TMP2]], i32 2
; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP5]], i32 3
; CHECK-NEXT: [[TMP12:%.*]] = mul nsw <4 x i32> [[TMP11]], [[TMP4]]
; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1		; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2		; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3		; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[B]] to <4 x i32>*		; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[B]] to <4 x i32>*
; CHECK-NEXT: store <4 x i32> [[TMP12]], <4 x i32>* [[TMP13]], align 4		; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%arrayidx = getelementptr inbounds i32, i32* %A, i64 10		%arrayidx = getelementptr inbounds i32, i32* %A, i64 10
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
%1 = load i32, i32* %A, align 4		%1 = load i32, i32* %A, align 4
%mul = mul nsw i32 %1, %0		%mul = mul nsw i32 %1, %0
%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11		%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11
Show All 24 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll

	Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[ARRAYIDX28:%.]] = getelementptr inbounds i32, i32 [[A]], i64 50			; CHECK-NEXT: [[ARRAYIDX28:%.]] = getelementptr inbounds i32, i32 [[A]], i64 50
	; CHECK-NEXT: [[ARRAYIDX44:%.]] = getelementptr inbounds i32, i32 [[A]], i64 75			; CHECK-NEXT: [[ARRAYIDX44:%.]] = getelementptr inbounds i32, i32 [[A]], i64 75
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: [[ARRAYIDX64:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1			; CHECK-NEXT: [[ARRAYIDX64:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
	; CHECK-NEXT: [[ARRAYIDX65:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2			; CHECK-NEXT: [[ARRAYIDX65:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
	; CHECK-NEXT: [[ARRAYIDX66:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3			; CHECK-NEXT: [[ARRAYIDX66:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[B]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[B]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[TMP34:%.]], <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP27:%.]], <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.]] ]
	; CHECK-NEXT: [[TMP2:%.*]] = phi <4 x i32> [ undef, [[ENTRY]] ], [ [[TMP34]], [[FOR_INC]] ]			; CHECK-NEXT: [[TMP2:%.*]] = phi <4 x i32> [ undef, [[ENTRY]] ], [ [[TMP27]], [[FOR_INC]] ]
	; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.]], label [[IF_ELSE:%.]]			; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.]], label [[IF_ELSE:%.]]
	; CHECK: if.then:			; CHECK: if.then:
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP3]]			; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP3]]
	; CHECK-NEXT: [[TMP4:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2			; CHECK-NEXT: [[TMP4:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
	; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP4]]			; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP4]]
	; CHECK-NEXT: [[TMP5:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3			; CHECK-NEXT: [[TMP5:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
	Show All 34 Lines
	; CHECK: if.else43:			; CHECK: if.else43:
	; CHECK-NEXT: [[TMP20:%.]] = load i32, i32 [[ARRAYIDX44]], align 4			; CHECK-NEXT: [[TMP20:%.]] = load i32, i32 [[ARRAYIDX44]], align 4
	; CHECK-NEXT: [[CMP45:%.*]] = icmp eq i32 [[TMP20]], 0			; CHECK-NEXT: [[CMP45:%.*]] = icmp eq i32 [[TMP20]], 0
	; CHECK-NEXT: br i1 [[CMP45]], label [[IF_THEN46:%.*]], label [[FOR_INC]]			; CHECK-NEXT: br i1 [[CMP45]], label [[IF_THEN46:%.*]], label [[FOR_INC]]
	; CHECK: if.then46:			; CHECK: if.then46:
	; CHECK-NEXT: [[ARRAYIDX49:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX49:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP21:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[TMP21:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[ARRAYIDX52:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP21]]			; CHECK-NEXT: [[ARRAYIDX52:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP21]]
	; CHECK-NEXT: [[TMP22:%.]] = bitcast i32 [[ARRAYIDX49]] to <2 x i32>*			; CHECK-NEXT: [[TMP22:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
	; CHECK-NEXT: [[TMP23:%.]] = load <2 x i32>, <2 x i32> [[TMP22]], align 4			; CHECK-NEXT: [[ARRAYIDX55:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP22]]
	; CHECK-NEXT: [[TMP24:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3			; CHECK-NEXT: [[TMP23:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
	; CHECK-NEXT: [[ARRAYIDX55:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP24]]			; CHECK-NEXT: [[ARRAYIDX58:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP23]]
	; CHECK-NEXT: [[TMP25:%.]] = load i32, i32 [[ARRAYIDX55]], align 4			; CHECK-NEXT: [[TMP24:%.]] = bitcast i32 [[ARRAYIDX49]] to <4 x i32>*
	; CHECK-NEXT: [[TMP26:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2			; CHECK-NEXT: [[TMP25:%.]] = load <4 x i32>, <4 x i32> [[TMP24]], align 4
	; CHECK-NEXT: [[ARRAYIDX58:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP26]]			; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <4 x i32> [[TMP25]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
	; CHECK-NEXT: [[TMP27:%.]] = load i32, i32 [[ARRAYIDX58]], align 4
	; CHECK-NEXT: [[TMP28:%.*]] = extractelement <2 x i32> [[TMP23]], i32 0
	; CHECK-NEXT: [[TMP29:%.*]] = insertelement <4 x i32> undef, i32 [[TMP28]], i32 0
	; CHECK-NEXT: [[TMP30:%.*]] = extractelement <2 x i32> [[TMP23]], i32 1
	; CHECK-NEXT: [[TMP31:%.*]] = insertelement <4 x i32> [[TMP29]], i32 [[TMP30]], i32 1
	; CHECK-NEXT: [[TMP32:%.*]] = insertelement <4 x i32> [[TMP31]], i32 [[TMP25]], i32 2
	; CHECK-NEXT: [[TMP33:%.*]] = insertelement <4 x i32> [[TMP32]], i32 [[TMP27]], i32 3
	; CHECK-NEXT: br label [[FOR_INC]]			; CHECK-NEXT: br label [[FOR_INC]]
	; CHECK: for.inc:			; CHECK: for.inc:
	; CHECK-NEXT: [[TMP34]] = phi <4 x i32> [ [[TMP7]], [[IF_THEN]] ], [ [[TMP13]], [[IF_THEN14]] ], [ [[TMP19]], [[IF_THEN30]] ], [ [[TMP33]], [[IF_THEN46]] ], [ [[TMP2]], [[IF_ELSE43]] ]			; CHECK-NEXT: [[TMP27]] = phi <4 x i32> [ [[TMP7]], [[IF_THEN]] ], [ [[TMP13]], [[IF_THEN14]] ], [ [[TMP19]], [[IF_THEN30]] ], [ [[TMP26]], [[IF_THEN46]] ], [ [[TMP2]], [[IF_ELSE43]] ]
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 100			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 100
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
	;			;
	entry:			entry:
	%0 = load i32, i32* %A, align 4			%0 = load i32, i32* %A, align 4
	%cmp1 = icmp eq i32 %0, 0			%cmp1 = icmp eq i32 %0, 0
	%arrayidx12 = getelementptr inbounds i32, i32* %A, i64 25			%arrayidx12 = getelementptr inbounds i32, i32* %A, i64 25
	▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/jumbled-load.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s		; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {		define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
; CHECK-LABEL: @jumbled-load(		; CHECK-LABEL: @jumbled-load(
; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0		; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3		; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1		; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2		; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
		; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0		; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4
; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2		; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3		; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1		; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4		; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_3]], [[LOAD_5]]		; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_8]]		; CHECK-NEXT: [[REORDER_SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_4]], [[LOAD_7]]		; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i32> [[REORDER_SHUFFLE]], [[REORDER_SHUFFLE1]]
; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_1]], [[LOAD_6]]
; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0		; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_7]], align 4
; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1		; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_8]], align 4
; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2		; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_9]], align 4
; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3		; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_10]], align 4		; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
		; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
; CHECK-NEXT: ret i32 undef		; CHECK-NEXT: ret i32 undef
;		;
%in.addr = getelementptr inbounds i32, i32* %in, i64 0		%in.addr = getelementptr inbounds i32, i32* %in, i64 0
%load.1 = load i32, i32* %in.addr, align 4		%load.1 = load i32, i32* %in.addr, align 4
%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3		%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
%load.2 = load i32, i32* %gep.1, align 4		%load.2 = load i32, i32* %gep.1, align 4
%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1		%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
%load.3 = load i32, i32* %gep.2, align 4		%load.3 = load i32, i32* %gep.2, align 4
Show All 22 Lines	;

ret i32 undef		ret i32 undef
}		}


define i32 @jumbled-load-multiuses(i32* noalias nocapture %in, i32* noalias nocapture %out) {		define i32 @jumbled-load-multiuses(i32* noalias nocapture %in, i32* noalias nocapture %out) {
; CHECK-LABEL: @jumbled-load-multiuses(		; CHECK-LABEL: @jumbled-load-multiuses(
; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0		; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3		; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1		; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2		; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_3]], [[LOAD_4]]		; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_2]]		; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_4]], [[LOAD_1]]		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 2
; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_1]], [[LOAD_3]]		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP3]], i32 0
		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 1
		; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP5]], i32 1
		; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 3
		; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP7]], i32 2
		; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 0
		; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP8]], i32 [[TMP9]], i32 3
		; CHECK-NEXT: [[TMP11:%.*]] = mul <4 x i32> [[REORDER_SHUFFLE]], [[TMP10]]
; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0		; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_7]], align 4
; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1		; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_8]], align 4
; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2		; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_9]], align 4
; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3		; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_10]], align 4		; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
		; CHECK-NEXT: store <4 x i32> [[TMP11]], <4 x i32>* [[TMP12]], align 4
; CHECK-NEXT: ret i32 undef		; CHECK-NEXT: ret i32 undef
;		;
%in.addr = getelementptr inbounds i32, i32* %in, i64 0		%in.addr = getelementptr inbounds i32, i32* %in, i64 0
%load.1 = load i32, i32* %in.addr, align 4		%load.1 = load i32, i32* %in.addr, align 4
%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3		%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
%load.2 = load i32, i32* %gep.1, align 4		%load.2 = load i32, i32* %gep.1, align 4
%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1		%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
%load.3 = load i32, i32* %gep.2, align 4		%load.3 = load i32, i32* %gep.2, align 4
Show All 17 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/reassociated-loads.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -reassociate -slp-vectorizer -slp-vectorize-hor -slp-vectorize-hor-store -S < %s -mtriple=x86_64-apple-macosx -mcpu=corei7-avx -mattr=+avx2 \| FileCheck %s			; RUN: opt -reassociate -slp-vectorizer -slp-vectorize-hor -slp-vectorize-hor-store -S < %s -mtriple=x86_64-apple-macosx -mcpu=corei7-avx -mattr=+avx2 \| FileCheck %s

	define signext i8 @Foo(<32 x i8>* %__v) {			define signext i8 @Foo(<32 x i8>* %__v) {
	; CHECK-LABEL: @Foo(			; CHECK-LABEL: @Foo(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load <32 x i8>, <32 x i8> [[__V:%.*]], align 32			; CHECK-NEXT: [[TMP0:%.]] = load <32 x i8>, <32 x i8> [[__V:%.*]], align 32
	; CHECK-NEXT: [[VECEXT_I_I_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 0			; CHECK-NEXT: [[ADD_I_1_I:%.*]] = add i8 undef, undef
	; CHECK-NEXT: [[VECEXT_I_I_1_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 1			; CHECK-NEXT: [[ADD_I_2_I:%.*]] = add i8 [[ADD_I_1_I]], undef
	; CHECK-NEXT: [[ADD_I_1_I:%.*]] = add i8 [[VECEXT_I_I_1_I]], [[VECEXT_I_I_I]]			; CHECK-NEXT: [[ADD_I_3_I:%.*]] = add i8 [[ADD_I_2_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_2_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 2			; CHECK-NEXT: [[ADD_I_4_I:%.*]] = add i8 [[ADD_I_3_I]], undef
	; CHECK-NEXT: [[ADD_I_2_I:%.*]] = add i8 [[ADD_I_1_I]], [[VECEXT_I_I_2_I]]			; CHECK-NEXT: [[ADD_I_5_I:%.*]] = add i8 [[ADD_I_4_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_3_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 3			; CHECK-NEXT: [[ADD_I_6_I:%.*]] = add i8 [[ADD_I_5_I]], undef
	; CHECK-NEXT: [[ADD_I_3_I:%.*]] = add i8 [[ADD_I_2_I]], [[VECEXT_I_I_3_I]]			; CHECK-NEXT: [[ADD_I_7_I:%.*]] = add i8 [[ADD_I_6_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_4_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 4			; CHECK-NEXT: [[ADD_I_8_I:%.*]] = add i8 [[ADD_I_7_I]], undef
	; CHECK-NEXT: [[ADD_I_4_I:%.*]] = add i8 [[ADD_I_3_I]], [[VECEXT_I_I_4_I]]			; CHECK-NEXT: [[ADD_I_9_I:%.*]] = add i8 [[ADD_I_8_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_5_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 5			; CHECK-NEXT: [[ADD_I_10_I:%.*]] = add i8 [[ADD_I_9_I]], undef
	; CHECK-NEXT: [[ADD_I_5_I:%.*]] = add i8 [[ADD_I_4_I]], [[VECEXT_I_I_5_I]]			; CHECK-NEXT: [[ADD_I_11_I:%.*]] = add i8 [[ADD_I_10_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_6_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 6			; CHECK-NEXT: [[ADD_I_12_I:%.*]] = add i8 [[ADD_I_11_I]], undef
	; CHECK-NEXT: [[ADD_I_6_I:%.*]] = add i8 [[ADD_I_5_I]], [[VECEXT_I_I_6_I]]			; CHECK-NEXT: [[ADD_I_13_I:%.*]] = add i8 [[ADD_I_12_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_7_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 7			; CHECK-NEXT: [[ADD_I_14_I:%.*]] = add i8 [[ADD_I_13_I]], undef
	; CHECK-NEXT: [[ADD_I_7_I:%.*]] = add i8 [[ADD_I_6_I]], [[VECEXT_I_I_7_I]]			; CHECK-NEXT: [[ADD_I_15_I:%.*]] = add i8 [[ADD_I_14_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_8_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 8			; CHECK-NEXT: [[ADD_I_16_I:%.*]] = add i8 [[ADD_I_15_I]], undef
	; CHECK-NEXT: [[ADD_I_8_I:%.*]] = add i8 [[ADD_I_7_I]], [[VECEXT_I_I_8_I]]			; CHECK-NEXT: [[ADD_I_17_I:%.*]] = add i8 [[ADD_I_16_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_9_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 9			; CHECK-NEXT: [[ADD_I_18_I:%.*]] = add i8 [[ADD_I_17_I]], undef
	; CHECK-NEXT: [[ADD_I_9_I:%.*]] = add i8 [[ADD_I_8_I]], [[VECEXT_I_I_9_I]]			; CHECK-NEXT: [[ADD_I_19_I:%.*]] = add i8 [[ADD_I_18_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_10_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 10			; CHECK-NEXT: [[ADD_I_20_I:%.*]] = add i8 [[ADD_I_19_I]], undef
	; CHECK-NEXT: [[ADD_I_10_I:%.*]] = add i8 [[ADD_I_9_I]], [[VECEXT_I_I_10_I]]			; CHECK-NEXT: [[ADD_I_21_I:%.*]] = add i8 [[ADD_I_20_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_11_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 11			; CHECK-NEXT: [[ADD_I_22_I:%.*]] = add i8 [[ADD_I_21_I]], undef
	; CHECK-NEXT: [[ADD_I_11_I:%.*]] = add i8 [[ADD_I_10_I]], [[VECEXT_I_I_11_I]]			; CHECK-NEXT: [[ADD_I_23_I:%.*]] = add i8 [[ADD_I_22_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_12_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 12			; CHECK-NEXT: [[ADD_I_24_I:%.*]] = add i8 [[ADD_I_23_I]], undef
	; CHECK-NEXT: [[ADD_I_12_I:%.*]] = add i8 [[ADD_I_11_I]], [[VECEXT_I_I_12_I]]			; CHECK-NEXT: [[ADD_I_25_I:%.*]] = add i8 [[ADD_I_24_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_13_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 13			; CHECK-NEXT: [[ADD_I_26_I:%.*]] = add i8 [[ADD_I_25_I]], undef
	; CHECK-NEXT: [[ADD_I_13_I:%.*]] = add i8 [[ADD_I_12_I]], [[VECEXT_I_I_13_I]]			; CHECK-NEXT: [[ADD_I_27_I:%.*]] = add i8 [[ADD_I_26_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_14_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 14			; CHECK-NEXT: [[ADD_I_28_I:%.*]] = add i8 [[ADD_I_27_I]], undef
	; CHECK-NEXT: [[ADD_I_14_I:%.*]] = add i8 [[ADD_I_13_I]], [[VECEXT_I_I_14_I]]			; CHECK-NEXT: [[ADD_I_29_I:%.*]] = add i8 [[ADD_I_28_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_15_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 15			; CHECK-NEXT: [[ADD_I_30_I:%.*]] = add i8 [[ADD_I_29_I]], undef
	; CHECK-NEXT: [[ADD_I_15_I:%.*]] = add i8 [[ADD_I_14_I]], [[VECEXT_I_I_15_I]]			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <32 x i8> [[TMP0]], <32 x i8> undef, <32 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[VECEXT_I_I_16_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 16			; CHECK-NEXT: [[BIN_RDX:%.*]] = add <32 x i8> [[TMP0]], [[RDX_SHUF]]
	; CHECK-NEXT: [[ADD_I_16_I:%.*]] = add i8 [[ADD_I_15_I]], [[VECEXT_I_I_16_I]]			; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <32 x i8> [[BIN_RDX]], <32 x i8> undef, <32 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[VECEXT_I_I_17_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 17			; CHECK-NEXT: [[BIN_RDX2:%.*]] = add <32 x i8> [[BIN_RDX]], [[RDX_SHUF1]]
	; CHECK-NEXT: [[ADD_I_17_I:%.*]] = add i8 [[ADD_I_16_I]], [[VECEXT_I_I_17_I]]			; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <32 x i8> [[BIN_RDX2]], <32 x i8> undef, <32 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[VECEXT_I_I_18_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 18			; CHECK-NEXT: [[BIN_RDX4:%.*]] = add <32 x i8> [[BIN_RDX2]], [[RDX_SHUF3]]
	; CHECK-NEXT: [[ADD_I_18_I:%.*]] = add i8 [[ADD_I_17_I]], [[VECEXT_I_I_18_I]]			; CHECK-NEXT: [[RDX_SHUF5:%.*]] = shufflevector <32 x i8> [[BIN_RDX4]], <32 x i8> undef, <32 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[VECEXT_I_I_19_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 19			; CHECK-NEXT: [[BIN_RDX6:%.*]] = add <32 x i8> [[BIN_RDX4]], [[RDX_SHUF5]]
	; CHECK-NEXT: [[ADD_I_19_I:%.*]] = add i8 [[ADD_I_18_I]], [[VECEXT_I_I_19_I]]			; CHECK-NEXT: [[RDX_SHUF7:%.*]] = shufflevector <32 x i8> [[BIN_RDX6]], <32 x i8> undef, <32 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[VECEXT_I_I_20_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 20			; CHECK-NEXT: [[BIN_RDX8:%.*]] = add <32 x i8> [[BIN_RDX6]], [[RDX_SHUF7]]
	; CHECK-NEXT: [[ADD_I_20_I:%.*]] = add i8 [[ADD_I_19_I]], [[VECEXT_I_I_20_I]]			; CHECK-NEXT: [[TMP1:%.*]] = extractelement <32 x i8> [[BIN_RDX8]], i32 0
	; CHECK-NEXT: [[VECEXT_I_I_21_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 21			; CHECK-NEXT: [[ADD_I_31_I:%.*]] = add i8 [[ADD_I_30_I]], undef
	; CHECK-NEXT: [[ADD_I_21_I:%.*]] = add i8 [[ADD_I_20_I]], [[VECEXT_I_I_21_I]]			; CHECK-NEXT: ret i8 [[TMP1]]
	; CHECK-NEXT: [[VECEXT_I_I_22_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 22
	; CHECK-NEXT: [[ADD_I_22_I:%.*]] = add i8 [[ADD_I_21_I]], [[VECEXT_I_I_22_I]]
	; CHECK-NEXT: [[VECEXT_I_I_23_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 23
	; CHECK-NEXT: [[ADD_I_23_I:%.*]] = add i8 [[ADD_I_22_I]], [[VECEXT_I_I_23_I]]
	; CHECK-NEXT: [[VECEXT_I_I_24_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 24
	; CHECK-NEXT: [[ADD_I_24_I:%.*]] = add i8 [[ADD_I_23_I]], [[VECEXT_I_I_24_I]]
	; CHECK-NEXT: [[VECEXT_I_I_25_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 25
	; CHECK-NEXT: [[ADD_I_25_I:%.*]] = add i8 [[ADD_I_24_I]], [[VECEXT_I_I_25_I]]
	; CHECK-NEXT: [[VECEXT_I_I_26_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 26
	; CHECK-NEXT: [[ADD_I_26_I:%.*]] = add i8 [[ADD_I_25_I]], [[VECEXT_I_I_26_I]]
	; CHECK-NEXT: [[VECEXT_I_I_27_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 27
	; CHECK-NEXT: [[ADD_I_27_I:%.*]] = add i8 [[ADD_I_26_I]], [[VECEXT_I_I_27_I]]
	; CHECK-NEXT: [[VECEXT_I_I_28_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 28
	; CHECK-NEXT: [[ADD_I_28_I:%.*]] = add i8 [[ADD_I_27_I]], [[VECEXT_I_I_28_I]]
	; CHECK-NEXT: [[VECEXT_I_I_29_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 29
	; CHECK-NEXT: [[ADD_I_29_I:%.*]] = add i8 [[ADD_I_28_I]], [[VECEXT_I_I_29_I]]
	; CHECK-NEXT: [[VECEXT_I_I_30_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 30
	; CHECK-NEXT: [[ADD_I_30_I:%.*]] = add i8 [[ADD_I_29_I]], [[VECEXT_I_I_30_I]]
	; CHECK-NEXT: [[VECEXT_I_I_31_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 31
	; CHECK-NEXT: [[ADD_I_31_I:%.*]] = add i8 [[ADD_I_30_I]], [[VECEXT_I_I_31_I]]
	; CHECK-NEXT: ret i8 [[ADD_I_31_I]]
	;			;
	entry:			entry:
	%0 = load <32 x i8>, <32 x i8>* %__v, align 32			%0 = load <32 x i8>, <32 x i8>* %__v, align 32
	%vecext.i.i.i = extractelement <32 x i8> %0, i64 0			%vecext.i.i.i = extractelement <32 x i8> %0, i64 0
	%vecext.i.i.1.i = extractelement <32 x i8> %0, i64 1			%vecext.i.i.1.i = extractelement <32 x i8> %0, i64 1
	%add.i.1.i = add i8 %vecext.i.i.1.i, %vecext.i.i.i			%add.i.1.i = add i8 %vecext.i.i.1.i, %vecext.i.i.i
	%vecext.i.i.2.i = extractelement <32 x i8> %0, i64 2			%vecext.i.i.2.i = extractelement <32 x i8> %0, i64 2
	%add.i.2.i = add i8 %vecext.i.i.2.i, %add.i.1.i			%add.i.2.i = add i8 %vecext.i.i.2.i, %add.i.1.i
	▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/store-jumbled.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



	define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {			define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load(			; CHECK-LABEL: @jumbled-load(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
				; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0			; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_1]], [[LOAD_5]]			; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_6]]			; CHECK-NEXT: [[REORDER_SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_3]], [[LOAD_7]]			; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i32> [[REORDER_SHUFFLE]], [[REORDER_SHUFFLE1]]
	; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_4]], [[LOAD_8]]
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
	; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_9]], align 4			; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_7]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
	; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_10]], align 4
	; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_8]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 25 Lines