This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
2/7
LoopAccessAnalysis.h
-
lib/
-
Analysis/
1/7
LoopAccessAnalysis.cpp
-
Transforms/Vectorize/
-
Vectorize/
23/56
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
2
external_user_jumbled_load.ll
-
extract.ll
-
jumbled-load-multiuse.ll
-
jumbled-load-shuffle-placement.ll
-
jumbled-load-used-in-phi.ll
-
jumbled-load.ll
-
reassociated-loads.ll
-
store-jumbled.ll

Differential D43776

[SLP] Fix PR36481: vectorize reassociated instructions.
ClosedPublic

Authored by ABataev on Feb 26 2018, 12:20 PM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
hfinkel
mkuper
Ayal
• ashahid

Commits

rG428e9d9d8784: [SLP] Fix PR36481: vectorize reassociated instructions.
rG3decaf4275be: [SLP] Fix PR36481: vectorize reassociated instructions.
rL329085: [SLP] Fix PR36481: vectorize reassociated instructions.
rL328980: [SLP] Fix PR36481: vectorize reassociated instructions.

Summary

If the load/extractelement/extractvalue instructions are not originally
consecutive, the SLP vectorizer is unable to vectorize them. Patch
allows reordering of such instructions.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 15891
Build 15891: arc lint + arc unit

Event Timeline

ABataev created this revision.Feb 26 2018, 12:20 PM

ABataev mentioned this in D36130: [SLP] Vectorize jumbled memory loads..Feb 27 2018, 6:59 AM

Updated to the latest version

Harbormaster completed remote builds in B15527: Diff 136325.Feb 28 2018, 10:03 AM

Fixed generation of mask for shuffling of reordered instructions.

Harbormaster completed remote builds in B15533: Diff 136345.Feb 28 2018, 11:14 AM

This patch addresses the following TODO, plus handles extracts:

// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.

At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?

include/llvm/Analysis/LoopAccessAnalysis.h
679	The usage later and documentation above look for a set of references that are consecutive under some permutation. I.e., w/o gaps. The implementation below allows arbitrary gaps, and does not check for zero gaps, i.e., replicated references. Would it be better to simply do a Pigeonhole sort, and perhaps call it isPermutedConsecutiveAccess(): Scan all references to find the one with minimal address. Bail out if any reference is incomparable with the running min. Scan all references and set SortedIndices[i] = difference(VL[i], Minimum). Bail out if this entry is to be set more than once. Note: it may be good to support replicated references, say when the gaps are internal to the vector to avoid loading out of bounds. Perhaps add a TODO. Note: it may be good to have LAA provide some common support for both SLP and LV's InterleaveGroup construction, which share some aspects. Perhaps add a TODO.
lib/Analysis/LoopAccessAnalysis.cpp
1113	Use computeConstantDifference() instead of computing it explicitly? It should compare GetUnderlyingObject()'s, if worthwhile, rather than doing so here.
lib/Transforms/Vectorize/SLPVectorizer.cpp
454	Update above documentation accordingly. Instead of returning the index when it's not Idx, may as well have `getExtractIndex()` return it always, and have the caller compare it to Idx? While we're at it, may as well pass only `E` and have the callee get its Opcode.
607	The order which bestOrder() provides is then used to form a vector of instructions. Suggest to have this method supply the desired vector, given the instructions to permute.
667	and returns the mask for reordering operations, if it allows should specify more accurately something like: ...and sets \p BestOrder to the identity permutation; otherwise returns False, setting \p BestOrder to either an empty vector or a non-identity permutation that allows...
1222–1223	Update above documentation.
1226	Can the permutations be kept inside NumOpsWantsToKeepOrder, using OrdersType as its key, instead of holding them in OpOrders? So that one could later simply do ++NumOpsWantsToKeepOrder[CurrentOrder]; See, e.g., `UniquifierDenseMapInfo` in LSR.
1227	Document what DirectOrderNum counts and/or have a more self-explanatory name similar to the original one, e.g., `NumOpsWantToKeepOriginalOrder` Can add that `NumOpsWantToKeepOriginalOrder` holds the count for the identity permutation, instead of holding this count inside `NumOpsWantsToKeepOrder` along with all other permutations.
1604–1605	BestOrder >> CurrentOrder?
1609	Better early exit by returning here.
1612	This is pretty hard to follow, and deserves an explanation. Would be better to simply do something like `++NumOpsWantToKeepOrder[BestOrder]`.
1651	Sink the emplace_back to after the handling of non-simple loads?
1670–1671	"BestOrder" >> "CurrentOrder", or "VLOrder"?
1674	Reuse PointerOps and have Value P0,1 = PointerOps.front,back() instead of LoadInst L0,1 just to get their PointerOperand (and SCEV) later?
1683	Better have sortPtrAccess() set "BestOrder" only if the given pointers are indeed consecutive once permuted, instead of checking here the Diff of max - min.
2038	Comment that BestOrder is initialize to invalid values. Perhaps set `E = VL.size()` here and assign `E + 1`, to match the later checks for initialized/unset values.
2057	Can simplify by checking if BestOrder is the identify permutation at the end, as done at the end of sortPtrAccesses(); using `getExtractIndex(Inst)` which returns Idx even if it's equal to I. Better rename BestOrder here too.
3113	Capture this "inversePermutation" in a method, to be called again below?
3113–3120	InsertPoint may be set twice to VL0?
3331	Should we first inverse the permutation and then take its front()? Would be good to have a testcase where this makes a difference and check it (one way or the other), if there isn't one already.
4949	If only two operations are still allowed, ReorderedOps may as well stay Ops[1], Ops[0]. Provide the generalization below to any permutation when the assert can be dropped, i.e., in a separate patch which handles this TODO?
4951	Provide Ops.size() as operand to constructor of ReorderedOps (multiple similar occurrences).
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
24	This redundant sequence of extractions from REORDER_SHUFFLE and insertions into TMP13 is hopefully eliminated later. Is the cost model ignoring it, or can we avoid generating it? Would be good to have the test CHECK the cost.

In D43776#1031044, @Ayal wrote:
This patch addresses the following TODO, plus handles extracts:
// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.
At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?

Yes, but this must be in a different patch, not this one.

include/llvm/Analysis/LoopAccessAnalysis.h
679	The documentation above does not say anything about consecutive access. It just states, that the pointers are sorted and that's it. I did it on purpose, so later we could reuse this function for masked loads\|stores. Masked loads are not supported yet, that's why in the SLPVectorizer I added an additional check for the consecutive access.
lib/Analysis/LoopAccessAnalysis.cpp
1113	Tried, it does not work in many cases.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1227	I don't want to add the new entry for operations, that do not require reordering. I'd better the code in another way.
1604–1605	I think it does not matter because the current order is the best order.
1612	I need to use an iterator, will rework the code.
1683	Just like I said, I did it for future support of masked loads\|stores.
2057	Check the boolean flag is faster than to perform N comparisons. That's why I'd prefer to leave it as is.
3113–3120	Missed it, thanks.
3331	There are several tests already that test that this is correct code. SLPVectorizer/X86/PR32086.ll, SLPVectorizer/X86/jumbled-load.ll, SLPVectorizer/X86/store-jumbled.ll etc.
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
24	Yes, InstCombiner will squash all these instructions into one shuffle. Yes, cost model is aware of these operations and ignores their cost (canReuseExtract() is intended to do this).

Updates after review

Harbormaster completed remote builds in B15891: Diff 137784.Mar 9 2018, 10:02 AM

Ayal added inline comments.Mar 9 2018, 2:43 PM

include/llvm/Analysis/LoopAccessAnalysis.h
679	By "The documentation above" I meant the one example of a[i+2], a[i+0], a[i+1] and a[i+3] which is a permutation of consecutive addresses. Masked loads and stores (will) also require that all pointers be within range of a single vector, right?
lib/Analysis/LoopAccessAnalysis.cpp
1113	Then that should be fixed, worth opening a PR?
lib/Transforms/Vectorize/SLPVectorizer.cpp
1227	Understood; suggested before to simply document that one could alternatively hold the count for the identity permutation inside `NumOpsWantsToKeepOrder` map, but we're instead keeping it outside using `NumOpsWantToKeepOriginalOrder` counter.
1604–1605	The current order may not be the best order, the first time the tree is built, right?
1621	Very good, thanks. Can `CurrentOrder` be used instead of `I->getFirst()`? Can `++NumOpsWantToKeepOrder[CurrentOrder]` work? After all the trouble...
2057	Right; it could simplify the code though, and N is usually very small.
3027	May as well use `E` when resizing `Mask`.
3358	Another inversePermutation?
5693	Fold by feeding the constructor with the size.

ABataev added inline comments.Mar 12 2018, 9:35 AM

include/llvm/Analysis/LoopAccessAnalysis.h
679	I see. Right
lib/Analysis/LoopAccessAnalysis.cpp
1113	I'm not sure that this is a bug, it is a feature :) . computeConstantDifference() is used to get the difference between the pointer returned by the `GetUnderlyingObject()` and kind of a `GEP %ptr, 0, n`, where `n` is constant. But this function does not work if the first load element is from `GEP %ptr, 0, m`, not from `%ptr`, and others are from `GEP %ptr, 0, m+1`, `GEP %ptr, 0, m+2` etc.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1227	Added a comment to the declaration of `NumOpsWantToKeepOrder`
1604–1605	Yes, but this is the best order for this particular bundle. Anyway, I renamed it.
1621	No, we cannot use `CurrentOrder`. I tried to reduce the memory usage and instead of copying the `CurrentOrder` in the `newTreEntry()` function I keep the ArrayRef for this order. But `CurrentOrder` is the local variable and it will be destroyed when we exit out of its declaration scope. And, thus, we will keep the reference to incorrect memory. Instead, I need to use the reference that will exist until the end of the vectorization process (as the key of the map)
3027	Oh, yes, missed it, thanks.
3358	Yup, thanks.
5693	Reworked it, thanks.

Update after review

Harbormaster completed remote builds in B15973: Diff 138035.Mar 12 2018, 9:36 AM

mssimpso added a subscriber: mssimpso.Mar 13 2018, 11:48 AM

Ping

Have test(s) for extractvalue's, for completeness.
Make sure tests cover best-order selection: cases where original order is just as frequent as other orders (tie-break), less frequent, more frequent.

include/llvm/Analysis/LoopAccessAnalysis.h
679	Check for zero gaps, i.e., replicated references? These are currently not supported, when checking `canReuseExtract()`.
lib/Analysis/LoopAccessAnalysis.cpp
1113	ok, right; `computeConstantDifference()` is lightweight. But it would be good to refactor a more time consuming variant which checks if `getMinusSCEV` returns a constant, and first compares UnderlyingObjects; and also AddressSpaces, following LoopVectorize.
lib/Transforms/Vectorize/SLPVectorizer.cpp
458	Add a message to the assert.
1250	"\a" >> "\p"
1254	"DirectOrderNum" >> "NumOpsWantToKeepOriginalOrder" instead of adding the comment?
1615	May be helpful to also dump CurrentOrder into dbgs().
1621	OK, right; it's clear why each Tree Entry should not hold an OrdersType object, and that `newTreeEntry()` should be given the object stored inside `NumOpsWantToKeepOrder` rather than the equivalent temporary `CurrentOrder`. So `++NumOpsWantToKeepOrder[CurrentOrder]` can work, but then `newTreeEntry()` will need to be given `NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst()`?
1649	Fold the size into the constructor.
1683	OK, so `sortPtrAccesses()` can serve more general cases where max - min > VF. It would still be helpful to wrap it, and refactor the above code/usage inside an `isPermutedConsecutiveAccess()` method which would call `sortPtrAccesses()`.
2039	// Assign initial value to of all items to E + 1 so we can check if the // Assign to all items the initial value `E + 1` so we can check if the
2042	(if at least one element of ... ). ", by checking that no element of `CurrentOrder` still has value `E + 1`." Note that there is no such check later.
2072	It may be easier to read if instead of clearing CurrentOrder at each early exit, we break from the loop, and right after the loop check if it exited early and if (`I < E`) clear CurrentOrder and return false.

In D43776#1047322, @Ayal wrote:

Have test(s) for extractvalue's, for completeness.
Make sure tests cover best-order selection: cases where original order is just as frequent as other orders (tie-break), less frequent, more frequent.

We already have couple tests for extractvalue: ARM/sroa.ll and X86/insertvalue.ll. They already have tests with the different order of the extractvalue instructions.

include/llvm/Analysis/LoopAccessAnalysis.h
679	Agree, will add checks
lib/Analysis/LoopAccessAnalysis.cpp
1113	Added checks for addressspace, comparison for underlying objects already were there.
lib/Transforms/Vectorize/SLPVectorizer.cpp
458	Ok
2042	This check is not required, it is checked automatically. We check that number of the extract elements is the same as the vector length at first. If later we try to write the index to element that does not equals `E+1` it means that at least one element will still have `E+1` as the value and we're have at least 2 elements with the same index.

Updated after review

Looks good to me, thanks for addressing the issues, have only a few last minor suggestions.

In D43776#1032937, @ABataev wrote:
In D43776#1031044, @Ayal wrote:
This patch addresses the following TODO, plus handles extracts:
// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.
At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?
Yes, but this must be in a different patch, not this one.

Sure, please add a TODO. This patch makes such rebuilds more frequent.

include/llvm/Analysis/LoopAccessAnalysis.h
673	Sounds better to specify "Returns 'true' if ..., otherwise returns 'false'" ?
lib/Analysis/LoopAccessAnalysis.cpp
1113	Right, the comment was about refactoring it altogether into a separate, more time consuming variant of `computeConstantDifference()` that checks everything; can alternatively leave a TODO for now?
lib/Transforms/Vectorize/SLPVectorizer.cpp
1251	`/// DirectOrderNum.` >> `/// NumOpsWantToKeepOriginalOrder.`
1626	Would the following work and be easier to read? ++NumOpsWantToKeepOrder[CurrentOrder]; auto &StoredCurrentOrder = NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst(); newTreeEntry(VL, /Vectorized=/true, UserTreeIdx, ReuseShuffleIndicies, StoredCurrentOrder); or alternatively rename `I` to something more meaningful like `StoredCurrentOrderAndNum`.
2063	May as well do `CurrentOrder.assign(E, E+1);`
2072	The suggestion for easier reading was for something like: for (unsigned I = 0; I < E; ++I) { auto Inst = cast<Instruction>(VL[I]); if (Inst->getOperand(0) != Vec) break; Optional<unsigned> Idx = getExtractIndex(Inst); if (!Idx) break; const unsigned ExtIdx = Idx; if (ExtIdx >= E \|\| CurrentOrder[ExtIdx] != E + 1) break; CurrentOrder[ExtIdx] = I; if (ExtIdx != I) ShouldKeepOrder = false; } if (I < E) { CurrentOrder.clear(); return false; } return ShouldKeepOrder;

This revision is now accepted and ready to land.Apr 1 2018, 12:29 AM

Closed by commit rL328980: [SLP] Fix PR36481: vectorize reassociated instructions. (authored by ABataev). · Explain WhyApr 2 2018, 7:54 AM

This revision was automatically updated to reflect the committed changes.

ABataev marked 5 inline comments as done.

Revision Contents

Path

Size

include/

llvm/

Analysis/

LoopAccessAnalysis.h

14 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

51 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

303 lines

test/

Transforms/

SLPVectorizer/

X86/

external_user_jumbled_load.ll

13 lines

extract.ll

11 lines

jumbled-load-multiuse.ll

25 lines

jumbled-load-shuffle-placement.ll

46 lines

jumbled-load-used-in-phi.ll

27 lines

jumbled-load.ll

51 lines

reassociated-loads.ll

107 lines

store-jumbled.ll

25 lines

Diff 137784

include/llvm/Analysis/LoopAccessAnalysis.h

	Show First 20 Lines • Show All 661 Lines • ▼ Show 20 Lines
	/// If necessary this method will version the stride of the pointer according			/// If necessary this method will version the stride of the pointer according
	/// to \p PtrToStride and therefore add further predicates to \p PSE.			/// to \p PtrToStride and therefore add further predicates to \p PSE.
	/// The \p Assume parameter indicates if we are allowed to make additional			/// The \p Assume parameter indicates if we are allowed to make additional
	/// run-time assumptions.			/// run-time assumptions.
	int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,			int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,
	const ValueToValueMap &StridesMap = ValueToValueMap(),			const ValueToValueMap &StridesMap = ValueToValueMap(),
	bool Assume = false, bool ShouldCheckWrap = true);			bool Assume = false, bool ShouldCheckWrap = true);

				/// \brief Attempt to sort the pointers in \p VL and return the sorted indices
				/// in \p SortedIndices, if reordering is required.
				///
				/// Returns 'false' if sorting is not legal, otherwise returns 'true'.
				AyalUnsubmitted Done Reply Inline Actions Sounds better to specify "Returns 'true' if ..., otherwise returns 'false'" ? Ayal: Sounds better to specify "Returns 'true' if ..., otherwise returns 'false'" ?
				///
				/// For example, for a given \p VL of memory accesses in program order, a[i+2],
				/// a[i+0], a[i+1] and a[i+3], this function will sort the \p VL and save the
				/// sorted indices in \p SortedIndices as a[i+0], a[i+1], a[i+2], a[i+3] and
				/// saves the mask for actual memory accesses in program order in
				/// \p SortedIndices as <2,0,1,3>
				AyalUnsubmitted Not Done Reply Inline Actions The usage later and documentation above look for a set of references that are consecutive under some permutation. I.e., w/o gaps. The implementation below allows arbitrary gaps, and does not check for zero gaps, i.e., replicated references. Would it be better to simply do a Pigeonhole sort, and perhaps call it isPermutedConsecutiveAccess(): Scan all references to find the one with minimal address. Bail out if any reference is incomparable with the running min. Scan all references and set SortedIndices[i] = difference(VL[i], Minimum). Bail out if this entry is to be set more than once. Note: it may be good to support replicated references, say when the gaps are internal to the vector to avoid loading out of bounds. Perhaps add a TODO. Note: it may be good to have LAA provide some common support for both SLP and LV's InterleaveGroup construction, which share some aspects. Perhaps add a TODO. Ayal: The usage later and documentation above look for a set of references that are consecutive under…
				ABataevAuthorUnsubmitted Not Done Reply Inline Actions The documentation above does not say anything about consecutive access. It just states, that the pointers are sorted and that's it. I did it on purpose, so later we could reuse this function for masked loads\|stores. Masked loads are not supported yet, that's why in the SLPVectorizer I added an additional check for the consecutive access. ABataev: The documentation above does not say anything about consecutive access. It just states, that…
				AyalUnsubmitted Not Done Reply Inline Actions By "The documentation above" I meant the one example of a[i+2], a[i+0], a[i+1] and a[i+3] which is a permutation of consecutive addresses. Masked loads and stores (will) also require that all pointers be within range of a single vector, right? Ayal: By "The documentation above" I meant the one example of a[i+2], a[i+0], a[i+1] and a[i+3] which…
				ABataevAuthorUnsubmitted Not Done Reply Inline Actions I see. Right ABataev: 1. I see. 2. Right
				AyalUnsubmitted Done Reply Inline Actions Check for zero gaps, i.e., replicated references? These are currently not supported, when checking `canReuseExtract()`. Ayal: Check for zero gaps, i.e., replicated references? These are currently not supported, when…
				ABataevAuthorUnsubmitted Not Done Reply Inline Actions Agree, will add checks ABataev: Agree, will add checks
				bool sortPtrAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				ScalarEvolution &SE,
				SmallVectorImpl<unsigned> &SortedIndices);

	/// \brief Returns true if the memory operations \p A and \p B are consecutive.			/// \brief Returns true if the memory operations \p A and \p B are consecutive.
	/// This is a simple API that does not depend on the analysis pass.			/// This is a simple API that does not depend on the analysis pass.
	bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType = true);			ScalarEvolution &SE, bool CheckType = true);

	/// \brief This analysis provides dependence information for the memory accesses			/// \brief This analysis provides dependence information for the memory accesses
	/// of a loop.			/// of a loop.
	///			///
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 1,081 Lines • ▼ Show 20 Lines	if (Assume) {
PSE.setNoOverflow(Ptr, SCEVWrapPredicate::IncrementNUSW);		PSE.setNoOverflow(Ptr, SCEVWrapPredicate::IncrementNUSW);
} else		} else
return 0;		return 0;
}		}

return Stride;		return Stride;
}		}

		bool llvm::sortPtrAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
		ScalarEvolution &SE,
		SmallVectorImpl<unsigned> &SortedIndices) {
		assert(llvm::all_of(
		VL, [](const Value *V) { return V->getType()->isPointerTy(); }) &&
		"Expected list of pointer operands.");
		SmallVector<std::pair<int64_t, Value *>, 4> OffValPairs;
		OffValPairs.reserve(VL.size());

		// Walk over the pointers, and map each of them to an offset relative to
		// first pointer in the array.
		Value *Ptr0 = VL[0];
		const SCEV *Scev0 = SE.getSCEV(Ptr0);
		Value *Obj0 = GetUnderlyingObject(Ptr0, DL);

		for (auto *Ptr : VL) {
		// If a pointer refers to a different underlying object, bail - the
		// pointers are by definition incomparable.
		Value *CurrObj = GetUnderlyingObject(Ptr, DL);
		if (CurrObj != Obj0)
		return false;

		const SCEV *Scev = SE.getSCEV(Ptr);
		const auto *Diff = dyn_cast<SCEVConstant>(SE.getMinusSCEV(Scev, Scev0));
		AyalUnsubmitted Not Done Reply Inline Actions Use computeConstantDifference() instead of computing it explicitly? It should compare GetUnderlyingObject()'s, if worthwhile, rather than doing so here. Ayal: Use computeConstantDifference() instead of computing it explicitly? It should compare…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Tried, it does not work in many cases. ABataev: Tried, it does not work in many cases.
		AyalUnsubmitted Not Done Reply Inline Actions Then that should be fixed, worth opening a PR? Ayal: Then that should be fixed, worth opening a PR?
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure that this is a bug, it is a feature :) . computeConstantDifference() is used to get the difference between the pointer returned by the `GetUnderlyingObject()` and kind of a `GEP %ptr, 0, n`, where `n` is constant. But this function does not work if the first load element is from `GEP %ptr, 0, m`, not from `%ptr`, and others are from `GEP %ptr, 0, m+1`, `GEP %ptr, 0, m+2` etc. ABataev: I'm not sure that this is a bug, it is a feature :) . computeConstantDifference() is used to…
		AyalUnsubmitted Not Done Reply Inline Actions ok, right; `computeConstantDifference()` is lightweight. But it would be good to refactor a more time consuming variant which checks if `getMinusSCEV` returns a constant, and first compares UnderlyingObjects; and also AddressSpaces, following LoopVectorize. Ayal: ok, right; `computeConstantDifference()` is lightweight. But it would be good to refactor a…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Added checks for addressspace, comparison for underlying objects already were there. ABataev: Added checks for addressspace, comparison for underlying objects already were there.
		AyalUnsubmitted Done Reply Inline Actions Right, the comment was about refactoring it altogether into a separate, more time consuming variant of `computeConstantDifference()` that checks everything; can alternatively leave a TODO for now? Ayal: Right, the comment was about refactoring it altogether into a separate, more time consuming…
		// The pointers may not have a constant offset from each other, or SCEV
		// may just not be smart enough to figure out they do. Regardless,
		// there's nothing we can do.
		if (!Diff)
		return false;

		OffValPairs.emplace_back(Diff->getAPInt().getSExtValue(), Ptr);
		}
		SortedIndices.clear();
		SortedIndices.resize(VL.size());
		std::iota(SortedIndices.begin(), SortedIndices.end(), 0);

		// Sort the memory accesses and keep the order of their uses in UseOrder.
		std::stable_sort(SortedIndices.begin(), SortedIndices.end(),
		[&OffValPairs](unsigned Left, unsigned Right) {
		return OffValPairs[Left].first < OffValPairs[Right].first;
		});

		// Check if the order is consecutive already.
		if (llvm::all_of(SortedIndices, [&SortedIndices](const unsigned I) {
		return I == SortedIndices[I];
		}))
		SortedIndices.clear();

		return true;
		}

/// Take the pointer operand from the Load/Store instruction.		/// Take the pointer operand from the Load/Store instruction.
/// Returns NULL if this is not a valid Load/Store instruction.		/// Returns NULL if this is not a valid Load/Store instruction.
static Value getPointerOperand(Value I) {		static Value getPointerOperand(Value I) {
if (auto *LI = dyn_cast<LoadInst>(I))		if (auto *LI = dyn_cast<LoadInst>(I))
return LI->getPointerOperand();		return LI->getPointerOperand();
if (auto *SI = dyn_cast<StoreInst>(I))		if (auto *SI = dyn_cast<StoreInst>(I))
return SI->getPointerOperand();		return SI->getPointerOperand();
return nullptr;		return nullptr;
▲ Show 20 Lines • Show All 1,207 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 445 Lines • ▼ Show 20 Lines	static bool allSameType(ArrayRef<Value *> VL) {
Type *Ty = VL[0]->getType();		Type *Ty = VL[0]->getType();
for (int i = 1, e = VL.size(); i < e; i++)		for (int i = 1, e = VL.size(); i < e; i++)
if (VL[i]->getType() != Ty)		if (VL[i]->getType() != Ty)
return false;		return false;

return true;		return true;
}		}

/// \returns True if Extract{Value,Element} instruction extracts element Idx.		/// \returns True if Extract{Value,Element} instruction extracts element Idx.
		AyalUnsubmitted Done Reply Inline Actions Update above documentation accordingly. Instead of returning the index when it's not Idx, may as well have `getExtractIndex()` return it always, and have the caller compare it to Idx? While we're at it, may as well pass only `E` and have the callee get its Opcode. Ayal: Update above documentation accordingly. Instead of returning the index when it's not Idx, may…
static bool matchExtractIndex(Instruction *E, unsigned Idx, unsigned Opcode) {		static Optional<unsigned> getExtractIndex(Instruction *E) {
		unsigned Opcode = E->getOpcode();
assert(Opcode == Instruction::ExtractElement \|\|		assert(Opcode == Instruction::ExtractElement \|\|
Opcode == Instruction::ExtractValue);		Opcode == Instruction::ExtractValue);
		AyalUnsubmitted Done Reply Inline Actions Add a message to the assert. Ayal: Add a message to the assert.
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Ok ABataev: Ok
if (Opcode == Instruction::ExtractElement) {		if (Opcode == Instruction::ExtractElement) {
ConstantInt *CI = dyn_cast<ConstantInt>(E->getOperand(1));		auto *CI = dyn_cast<ConstantInt>(E->getOperand(1));
return CI && CI->getZExtValue() == Idx;		if (!CI)
} else {		return None;
ExtractValueInst *EI = cast<ExtractValueInst>(E);		return CI->getZExtValue();
return EI->getNumIndices() == 1 && *EI->idx_begin() == Idx;
}		}
		ExtractValueInst *EI = cast<ExtractValueInst>(E);
		if (EI->getNumIndices() != 1)
		return None;
		return *EI->idx_begin();
}		}

/// \returns True if in-tree use also needs extract. This refers to		/// \returns True if in-tree use also needs extract. This refers to
/// possible scalar operand in vectorized instruction.		/// possible scalar operand in vectorized instruction.
static bool InTreeUserNeedToExtract(Value Scalar, Instruction UserInst,		static bool InTreeUserNeedToExtract(Value Scalar, Instruction UserInst,
TargetLibraryInfo *TLI) {		TargetLibraryInfo *TLI) {
unsigned Opcode = UserInst->getOpcode();		unsigned Opcode = UserInst->getOpcode();
switch (Opcode) {		switch (Opcode) {
▲ Show 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	public:

/// Clear the internal data structures that are created by 'buildTree'.		/// Clear the internal data structures that are created by 'buildTree'.
void deleteTree() {		void deleteTree() {
VectorizableTree.clear();		VectorizableTree.clear();
ScalarToTreeEntry.clear();		ScalarToTreeEntry.clear();
MustGather.clear();		MustGather.clear();
ExternalUses.clear();		ExternalUses.clear();
NumOpsWantToKeepOrder.clear();		NumOpsWantToKeepOrder.clear();
		DirectOrderNum = 0;
for (auto &Iter : BlocksSchedules) {		for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();		BlockScheduling *BS = Iter.second.get();
BS->clear();		BS->clear();
}		}
MinBWs.clear();		MinBWs.clear();
}		}

unsigned getTreeSize() const { return VectorizableTree.size(); }		unsigned getTreeSize() const { return VectorizableTree.size(); }

/// \brief Perform LICM and CSE on the newly generated gather sequences.		/// \brief Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();

/// \returns true if it is beneficial to reverse the vector order.		/// \returns The best order of instructions for vectorization.
bool shouldReorder() const {		Optional<ArrayRef<unsigned>> bestOrder() const {
		AyalUnsubmitted Not Done Reply Inline Actions The order which bestOrder() provides is then used to form a vector of instructions. Suggest to have this method supply the desired vector, given the instructions to permute. Ayal: The order which bestOrder() provides is then used to form a vector of instructions. Suggest to…
return std::accumulate(		auto I = std::max_element(
NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(), 0,		NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(),
[](int Val1,		[](const decltype(NumOpsWantToKeepOrder)::value_type &D1,
const decltype(NumOpsWantToKeepOrder)::value_type &Val2) {		const decltype(NumOpsWantToKeepOrder)::value_type &D2) {
return Val1 + (Val2.second < 0 ? 1 : -1);		return D1.second < D2.second;
}) > 0;		});
		if (I == NumOpsWantToKeepOrder.end() \|\| I->getSecond() <= DirectOrderNum)
		return None;

		return makeArrayRef(I->getFirst());
}		}

/// \return The vector element size in bits to use when vectorizing the		/// \return The vector element size in bits to use when vectorizing the
/// expression tree ending at \p V. If V is a store, the size is the width of		/// expression tree ending at \p V. If V is a store, the size is the width of
/// the stored value. Otherwise, the size is the width of the largest loaded		/// the stored value. Otherwise, the size is the width of the largest loaded
/// value reaching V. This method is used by the vectorizer to calculate		/// value reaching V. This method is used by the vectorizer to calculate
/// vectorization factors.		/// vectorization factors.
unsigned getVectorElementSize(Value *V);		unsigned getVectorElementSize(Value *V);
Show All 30 Lines	private:
bool areAllUsersVectorized(Instruction *I) const;		bool areAllUsersVectorized(Instruction *I) const;

/// \returns the cost of the vectorizable entry.		/// \returns the cost of the vectorizable entry.
int getEntryCost(TreeEntry *E);		int getEntryCost(TreeEntry *E);

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int);

/// \returns True if the ExtractElement/ExtractValue instructions in VL can		/// \returns true if the ExtractElement/ExtractValue instructions in \p VL can
/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).		/// be vectorized to use the original vector (or aggregate "bitcast" to a
bool canReuseExtract(ArrayRef<Value > VL, Value OpValue) const;		/// vector) and sets \p CurrentOrder to the identity permutation; otherwise
		/// returns false, setting \p CurrentOrder to either an empty vector or a
		AyalUnsubmitted Done Reply Inline Actions and returns the mask for reordering operations, if it allows should specify more accurately something like: ...and sets \p BestOrder to the identity permutation; otherwise returns False, setting \p BestOrder to either an empty vector or a non-identity permutation that allows... Ayal: ``` and returns the mask for reordering operations, if it allows ``` should specify more…
		/// non-identity permutation that allows to reuse extract instructions.
		bool canReuseExtract(ArrayRef<Value > VL, Value OpValue,
		SmallVectorImpl<unsigned> &CurrentOrder) const;

/// Vectorize a single entry in the tree.		/// Vectorize a single entry in the tree.
Value vectorizeTree(TreeEntry E);		Value vectorizeTree(TreeEntry E);

/// Vectorize a single entry in the tree, starting in \p VL.		/// Vectorize a single entry in the tree, starting in \p VL.
Value vectorizeTree(ArrayRef<Value > VL);		Value vectorizeTree(ArrayRef<Value > VL);

/// \returns the scalarization cost for this type. Scalarization in this		/// \returns the scalarization cost for this type. Scalarization in this
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	struct TreeEntry {
Value *VectorizedValue = nullptr;		Value *VectorizedValue = nullptr;

/// Do we need to gather this sequence ?		/// Do we need to gather this sequence ?
bool NeedToGather = false;		bool NeedToGather = false;

/// Does this sequence require some shuffling?		/// Does this sequence require some shuffling?
SmallVector<unsigned, 4> ReuseShuffleIndices;		SmallVector<unsigned, 4> ReuseShuffleIndices;

		/// Does this entry require reordering?
		ArrayRef<unsigned> ReorderIndices;

/// Points back to the VectorizableTree.		/// Points back to the VectorizableTree.
///		///
/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has		/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has
/// to be a pointer and needs to be able to initialize the child iterator.		/// to be a pointer and needs to be able to initialize the child iterator.
/// Thus we need a reference back to the container to translate the indices		/// Thus we need a reference back to the container to translate the indices
/// to entries.		/// to entries.
std::vector<TreeEntry> &Container;		std::vector<TreeEntry> &Container;

/// The TreeEntry index containing the user of this entry. We can actually		/// The TreeEntry index containing the user of this entry. We can actually
/// have multiple users so the data structure is not truly a tree.		/// have multiple users so the data structure is not truly a tree.
SmallVector<int, 1> UserTreeIndices;		SmallVector<int, 1> UserTreeIndices;
};		};

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
void newTreeEntry(ArrayRef<Value *> VL, bool Vectorized, int &UserTreeIdx,		void newTreeEntry(ArrayRef<Value *> VL, bool Vectorized, int &UserTreeIdx,
ArrayRef<unsigned> ReuseShuffleIndices = None) {		ArrayRef<unsigned> ReuseShuffleIndices = None,
		ArrayRef<unsigned> ReorderIndices = None) {
VectorizableTree.emplace_back(VectorizableTree);		VectorizableTree.emplace_back(VectorizableTree);
int idx = VectorizableTree.size() - 1;		int idx = VectorizableTree.size() - 1;
TreeEntry *Last = &VectorizableTree[idx];		TreeEntry *Last = &VectorizableTree[idx];
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->NeedToGather = !Vectorized;		Last->NeedToGather = !Vectorized;
Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),		Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),
ReuseShuffleIndices.end());		ReuseShuffleIndices.end());
		Last->ReorderIndices = ReorderIndices;
if (Vectorized) {		if (Vectorized) {
for (int i = 0, e = VL.size(); i != e; ++i) {		for (int i = 0, e = VL.size(); i != e; ++i) {
assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");		assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");
ScalarToTreeEntry[VL[i]] = idx;		ScalarToTreeEntry[VL[i]] = idx;
}		}
} else {		} else {
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
}		}
▲ Show 20 Lines • Show All 444 Lines • ▼ Show 20 Lines	#endif
MapVector<BasicBlock *, std::unique_ptr<BlockScheduling>> BlocksSchedules;		MapVector<BasicBlock *, std::unique_ptr<BlockScheduling>> BlocksSchedules;

/// Performs the "real" scheduling. Done before vectorization is actually		/// Performs the "real" scheduling. Done before vectorization is actually
/// performed in a basic block.		/// performed in a basic block.
void scheduleBlock(BlockScheduling *BS);		void scheduleBlock(BlockScheduling *BS);

/// List of users to ignore during scheduling and that don't need extracting.		/// List of users to ignore during scheduling and that don't need extracting.
ArrayRef<Value *> UserIgnoreList;		ArrayRef<Value *> UserIgnoreList;

/// Number of operation bundles that contain consecutive operations - number		using OrdersType = SmallVector<unsigned, 4>;
		AyalUnsubmitted Not Done Reply Inline Actions Update above documentation. Ayal: Update above documentation.
/// of operation bundles that contain consecutive operations in reversed		/// A DenseMapInfo implementation for holding DenseMaps and DenseSets of
/// order.		/// sorted SmallVectors of unsigned.
DenseMap<unsigned, int> NumOpsWantToKeepOrder;		struct OrdersTypeDenseMapInfo {
		AyalUnsubmitted Done Reply Inline Actions Can the permutations be kept inside NumOpsWantsToKeepOrder, using OrdersType as its key, instead of holding them in OpOrders? So that one could later simply do ++NumOpsWantsToKeepOrder[CurrentOrder]; See, e.g., `UniquifierDenseMapInfo` in LSR. Ayal: Can the permutations be kept inside NumOpsWantsToKeepOrder, using OrdersType as its key…
		static OrdersType getEmptyKey() {
		AyalUnsubmitted Not Done Reply Inline Actions Document what DirectOrderNum counts and/or have a more self-explanatory name similar to the original one, e.g., `NumOpsWantToKeepOriginalOrder` Can add that `NumOpsWantToKeepOriginalOrder` holds the count for the identity permutation, instead of holding this count inside `NumOpsWantsToKeepOrder` along with all other permutations. Ayal: Document what DirectOrderNum counts and/or have a more self-explanatory name similar to the…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions I don't want to add the new entry for operations, that do not require reordering. I'd better the code in another way. ABataev: I don't want to add the new entry for operations, that do not require reordering. I'd better…
		AyalUnsubmitted Not Done Reply Inline Actions Understood; suggested before to simply document that one could alternatively hold the count for the identity permutation inside `NumOpsWantsToKeepOrder` map, but we're instead keeping it outside using `NumOpsWantToKeepOriginalOrder` counter. Ayal: Understood; suggested before to simply document that one could alternatively hold the count…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Added a comment to the declaration of `NumOpsWantToKeepOrder` ABataev: Added a comment to the declaration of `NumOpsWantToKeepOrder`
		OrdersType V;
		V.push_back(~1U);
		return V;
		}

		static OrdersType getTombstoneKey() {
		OrdersType V;
		V.push_back(~2U);
		return V;
		}

		static unsigned getHashValue(const OrdersType &V) {
		return static_cast<unsigned>(hash_combine_range(V.begin(), V.end()));
		}

		static bool isEqual(const OrdersType &LHS, const OrdersType &RHS) {
		return LHS == RHS;
		}
		};

		/// Contains orders of operations along with the number of bundles that have
		/// operations in this order.
		DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder;
		AyalUnsubmitted Done Reply Inline Actions "\a" >> "\p" Ayal: "\a" >> "\p"
		/// Number of bundles that do not require reordering.
		AyalUnsubmitted Done Reply Inline Actions `/// DirectOrderNum.` >> `/// NumOpsWantToKeepOriginalOrder.` Ayal: `/// DirectOrderNum.` >> `/// NumOpsWantToKeepOriginalOrder.`
		unsigned DirectOrderNum = 0;

// Analysis and block reference.		// Analysis and block reference.
		AyalUnsubmitted Done Reply Inline Actions "DirectOrderNum" >> "NumOpsWantToKeepOriginalOrder" instead of adding the comment? Ayal: "DirectOrderNum" >> "NumOpsWantToKeepOriginalOrder" instead of adding the comment?
Function *F;		Function *F;
ScalarEvolution *SE;		ScalarEvolution *SE;
TargetTransformInfo *TTI;		TargetTransformInfo *TTI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
AliasAnalysis *AA;		AliasAnalysis *AA;
LoopInfo *LI;		LoopInfo *LI;
DominatorTree *DT;		DominatorTree *DT;
AssumptionCache *AC;		AssumptionCache *AC;
▲ Show 20 Lines • Show All 333 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
PH->getIncomingBlock(i)));		PH->getIncomingBlock(i)));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx);
}		}
return;		return;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
bool Reuse = canReuseExtract(VL, VL0);		OrdersType CurrentOrder;
		bool Reuse = canReuseExtract(VL, VL0, CurrentOrder);
		AyalUnsubmitted Done Reply Inline Actions BestOrder >> CurrentOrder? Ayal: BestOrder >> CurrentOrder?
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions I think it does not matter because the current order is the best order. ABataev: I think it does not matter because the current order is the best order.
		AyalUnsubmitted Not Done Reply Inline Actions The current order may not be the best order, the first time the tree is built, right? Ayal: The current order may not be the best order, the first time the tree is built, right?
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Yes, but this is the best order for this particular bundle. Anyway, I renamed it. ABataev: Yes, but this is the best order for this particular bundle. Anyway, I renamed it.
if (Reuse) {		if (Reuse) {
DEBUG(dbgs() << "SLP: Reusing or shuffling extract sequence.\n");		DEBUG(dbgs() << "SLP: Reusing or shuffling extract sequence.\n");
++NumOpsWantToKeepOrder[S.Opcode];		++DirectOrderNum;
} else {		newTreeEntry(VL, /Vectorized=/true, UserTreeIdx,
		AyalUnsubmitted Done Reply Inline Actions Better early exit by returning here. Ayal: Better early exit by returning here.
SmallVector<Value *, 4> ReverseVL(VL.rbegin(), VL.rend());		ReuseShuffleIndicies);
if (canReuseExtract(ReverseVL, VL0))		return;
--NumOpsWantToKeepOrder[S.Opcode];
BS.cancelScheduling(VL, VL0);
}		}
		AyalUnsubmitted Not Done Reply Inline Actions This is pretty hard to follow, and deserves an explanation. Would be better to simply do something like `++NumOpsWantToKeepOrder[BestOrder]`. Ayal: This is pretty hard to follow, and deserves an explanation. Would be better to simply do…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions I need to use an iterator, will rework the code. ABataev: I need to use an iterator, will rework the code.
newTreeEntry(VL, Reuse, UserTreeIdx, ReuseShuffleIndicies);		if (!CurrentOrder.empty()) {
		DEBUG(dbgs()
		<< "SLP: Reusing or shuffling of reordered extract sequence.\n");
		AyalUnsubmitted Done Reply Inline Actions May be helpful to also dump CurrentOrder into dbgs(). Ayal: May be helpful to also dump CurrentOrder into dbgs().
		// Insert new order with initial value 0, if it does not exist,
		// otherwise return the iterator to the existing one.
		auto I = NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;
		++I->getSecond();
		newTreeEntry(VL, /Vectorized=/true, UserTreeIdx, ReuseShuffleIndicies,
		I->getFirst());
		AyalUnsubmitted Not Done Reply Inline Actions Very good, thanks. Can `CurrentOrder` be used instead of `I->getFirst()`? Can `++NumOpsWantToKeepOrder[CurrentOrder]` work? After all the trouble... Ayal: Very good, thanks. Can `CurrentOrder` be used instead of `I->getFirst()`? Can…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions No, we cannot use `CurrentOrder`. I tried to reduce the memory usage and instead of copying the `CurrentOrder` in the `newTreEntry()` function I keep the ArrayRef for this order. But `CurrentOrder` is the local variable and it will be destroyed when we exit out of its declaration scope. And, thus, we will keep the reference to incorrect memory. Instead, I need to use the reference that will exist until the end of the vectorization process (as the key of the map) ABataev: No, we cannot use `CurrentOrder`. I tried to reduce the memory usage and instead of copying the…
		AyalUnsubmitted Not Done Reply Inline Actions OK, right; it's clear why each Tree Entry should not hold an OrdersType object, and that `newTreeEntry()` should be given the object stored inside `NumOpsWantToKeepOrder` rather than the equivalent temporary `CurrentOrder`. So `++NumOpsWantToKeepOrder[CurrentOrder]` can work, but then `newTreeEntry()` will need to be given `NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst()`? Ayal: OK, right; it's clear why each Tree Entry should not hold an OrdersType object, and that…
		return;
		}
		DEBUG(dbgs() << "SLP: Gather extract sequence.\n");
		newTreeEntry(VL, /Vectorized=/false, UserTreeIdx, ReuseShuffleIndicies);
		BS.cancelScheduling(VL, VL0);
		AyalUnsubmitted Not Done Reply Inline Actions Would the following work and be easier to read? ++NumOpsWantToKeepOrder[CurrentOrder]; auto &StoredCurrentOrder = NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst(); newTreeEntry(VL, /Vectorized=/true, UserTreeIdx, ReuseShuffleIndicies, StoredCurrentOrder); or alternatively rename `I` to something more meaningful like `StoredCurrentOrderAndNum`. Ayal: Would the following work and be easier to read? > ++NumOpsWantToKeepOrder[CurrentOrder]; >…
return;		return;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Check that a vectorized load would load the same memory as a scalar		// Check that a vectorized load would load the same memory as a scalar
// load. For example, we don't want to vectorize loads that are smaller		// load. For example, we don't want to vectorize loads that are smaller
// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM		// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM
// treats loading/storing it as an i8 struct. If we vectorize loads/stores		// treats loading/storing it as an i8 struct. If we vectorize loads/stores
// from such a struct, we read/write packed bits disagreeing with the		// from such a struct, we read/write packed bits disagreeing with the
// unvectorized version.		// unvectorized version.
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();

if (DL->getTypeSizeInBits(ScalarTy) !=		if (DL->getTypeSizeInBits(ScalarTy) !=
DL->getTypeAllocSizeInBits(ScalarTy)) {		DL->getTypeAllocSizeInBits(ScalarTy)) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");		DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");
return;		return;
}		}

// Make sure all loads in the bundle are simple - we can't vectorize		// Make sure all loads in the bundle are simple - we can't vectorize
// atomic or volatile loads.		// atomic or volatile loads.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		SmallVector<Value *, 4> PointerOps;
LoadInst *L = cast<LoadInst>(VL[i]);		PointerOps.reserve(VL.size());
		AyalUnsubmitted Done Reply Inline Actions Fold the size into the constructor. Ayal: Fold the size into the constructor.
		for (Value *V : VL) {
		auto *L = cast<LoadInst>(V);
		AyalUnsubmitted Done Reply Inline Actions Sink the emplace_back to after the handling of non-simple loads? Ayal: Sink the emplace_back to after the handling of non-simple loads?
if (!L->isSimple()) {		if (!L->isSimple()) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
return;		return;
}		}
		PointerOps.emplace_back(L->getPointerOperand());
}		}

// Check if the loads are consecutive, reversed, or neither.		OrdersType CurrentOrder;
// TODO: What we really want is to sort the loads, but for now, check		// Check the order of pointer operands.
// the two likely directions.		if (llvm::sortPtrAccesses(PointerOps, DL, SE, CurrentOrder)) {
bool Consecutive = true;		Value *Ptr0;
bool ReverseConsecutive = true;		Value *PtrN;
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		if (CurrentOrder.empty()) {
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		Ptr0 = PointerOps.front();
Consecutive = false;		PtrN = PointerOps.back();
break;
} else {		} else {
ReverseConsecutive = false;		Ptr0 = PointerOps[CurrentOrder.front()];
		PtrN = PointerOps[CurrentOrder.back()];
		AyalUnsubmitted Done Reply Inline Actions "BestOrder" >> "CurrentOrder", or "VLOrder"? Ayal: "BestOrder" >> "CurrentOrder", or "VLOrder"?
}		}
}		const SCEV *Scev0 = SE->getSCEV(Ptr0);
		const SCEV *ScevN = SE->getSCEV(PtrN);
		AyalUnsubmitted Done Reply Inline Actions Reuse PointerOps and have Value P0,1 = PointerOps.front,back() instead of LoadInst L0,1 just to get their PointerOperand (and SCEV) later? Ayal: Reuse PointerOps and have Value P0,1 = PointerOps.front,back() instead of LoadInst L0,1 just…
if (Consecutive) {		const auto *Diff =
++NumOpsWantToKeepOrder[S.Opcode];		dyn_cast<SCEVConstant>(SE->getMinusSCEV(ScevN, Scev0));
newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);		uint64_t Size = DL->getTypeAllocSize(ScalarTy);
		// Check that the sorted loads are consecutive.
		if (Diff && Diff->getAPInt().getZExtValue() == (VL.size() - 1) * Size) {
		if (CurrentOrder.empty()) {
		// Original loads are consecutive and does not require reordering.
		++DirectOrderNum;
		newTreeEntry(VL, /Vectorized=/true, UserTreeIdx,
		AyalUnsubmitted Not Done Reply Inline Actions Better have sortPtrAccess() set "BestOrder" only if the given pointers are indeed consecutive once permuted, instead of checking here the Diff of max - min. Ayal: Better have sortPtrAccess() set "BestOrder" only if the given pointers are indeed consecutive…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Just like I said, I did it for future support of masked loads\|stores. ABataev: Just like I said, I did it for future support of masked loads\|stores.
		AyalUnsubmitted Not Done Reply Inline Actions OK, so `sortPtrAccesses()` can serve more general cases where max - min > VF. It would still be helpful to wrap it, and refactor the above code/usage inside an `isPermutedConsecutiveAccess()` method which would call `sortPtrAccesses()`. Ayal: OK, so `sortPtrAccesses()` can serve more general cases where max - min > VF. It would still be…
		ReuseShuffleIndicies);
DEBUG(dbgs() << "SLP: added a vector of loads.\n");		DEBUG(dbgs() << "SLP: added a vector of loads.\n");
return;		} else {
}		// Need to reorder.
		auto I = NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;
// If none of the load pairs were consecutive when checked in order,		++I->getSecond();
// check the reverse order.		newTreeEntry(VL, /Vectorized=/true, UserTreeIdx,
if (ReverseConsecutive)		ReuseShuffleIndicies, I->getFirst());
for (unsigned i = VL.size() - 1; i > 0; --i)		DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");
if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {
ReverseConsecutive = false;
break;
}		}

if (ReverseConsecutive) {
--NumOpsWantToKeepOrder[S.Opcode];
newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);
DEBUG(dbgs() << "SLP: added a vector of reversed loads.\n");
return;		return;
}		}
		}

DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
return;		return;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
▲ Show 20 Lines • Show All 293 Lines • ▼ Show 20 Lines	if (ST) {
// Check that struct is homogeneous.		// Check that struct is homogeneous.
for (const auto *Ty : ST->elements())		for (const auto *Ty : ST->elements())
if (Ty != EltTy)		if (Ty != EltTy)
return 0;		return 0;
}		}
return N;		return N;
}		}

bool BoUpSLP::canReuseExtract(ArrayRef<Value > VL, Value OpValue) const {		bool BoUpSLP::canReuseExtract(ArrayRef<Value > VL, Value OpValue,
		SmallVectorImpl<unsigned> &CurrentOrder) const {
Instruction *E0 = cast<Instruction>(OpValue);		Instruction *E0 = cast<Instruction>(OpValue);
assert(E0->getOpcode() == Instruction::ExtractElement \|\|		assert(E0->getOpcode() == Instruction::ExtractElement \|\|
E0->getOpcode() == Instruction::ExtractValue);		E0->getOpcode() == Instruction::ExtractValue);
assert(E0->getOpcode() == getSameOpcode(VL).Opcode && "Invalid opcode");		assert(E0->getOpcode() == getSameOpcode(VL).Opcode && "Invalid opcode");
// Check if all of the extracts come from the same vector and from the		// Check if all of the extracts come from the same vector and from the
// correct offset.		// correct offset.
Value *Vec = E0->getOperand(0);		Value *Vec = E0->getOperand(0);

		CurrentOrder.clear();

// We have to extract from a vector/aggregate with the same number of elements.		// We have to extract from a vector/aggregate with the same number of elements.
unsigned NElts;		unsigned NElts;
if (E0->getOpcode() == Instruction::ExtractValue) {		if (E0->getOpcode() == Instruction::ExtractValue) {
const DataLayout &DL = E0->getModule()->getDataLayout();		const DataLayout &DL = E0->getModule()->getDataLayout();
NElts = canMapToVector(Vec->getType(), DL);		NElts = canMapToVector(Vec->getType(), DL);
if (!NElts)		if (!NElts)
return false;		return false;
// Check if load can be rewritten as load of vector.		// Check if load can be rewritten as load of vector.
LoadInst *LI = dyn_cast<LoadInst>(Vec);		LoadInst *LI = dyn_cast<LoadInst>(Vec);
if (!LI \|\| !LI->isSimple() \|\| !LI->hasNUses(VL.size()))		if (!LI \|\| !LI->isSimple() \|\| !LI->hasNUses(VL.size()))
return false;		return false;
} else {		} else {
NElts = Vec->getType()->getVectorNumElements();		NElts = Vec->getType()->getVectorNumElements();
}		}

if (NElts != VL.size())		if (NElts != VL.size())
return false;		return false;

// Check that all of the indices extract from the correct offset.		// Check that all of the indices extract from the correct offset.
for (unsigned I = 0, E = VL.size(); I < E; ++I) {		bool ShouldKeepOrder = true;
Instruction *Inst = cast<Instruction>(VL[I]);		unsigned E = VL.size();
		AyalUnsubmitted Done Reply Inline Actions Comment that BestOrder is initialize to invalid values. Perhaps set `E = VL.size()` here and assign `E + 1`, to match the later checks for initialized/unset values. Ayal: Comment that BestOrder is initialize to invalid values. Perhaps set `E = VL.size()` here and…
if (!matchExtractIndex(Inst, I, Inst->getOpcode()))		// Assign initial value to of all items to E + 1 so we can check if the
		AyalUnsubmitted Done Reply Inline Actions // Assign initial value to of all items to E + 1 so we can check if the // Assign to all items the initial value `E + 1` so we can check if the Ayal: > // Assign initial value to of all items to E + 1 so we can check if the // Assign to all…
		// extract instruction index was used already.
		// Also, later we can check that all the indices are used and we have a
		// consecutive access in the extract instructions (if at least one element of
		AyalUnsubmitted Done Reply Inline Actions (if at least one element of ... ). ", by checking that no element of `CurrentOrder` still has value `E + 1`." Note that there is no such check later. Ayal: > (if at least one element of ... ). ", by checking that no element of `CurrentOrder` still…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions This check is not required, it is checked automatically. We check that number of the extract elements is the same as the vector length at first. If later we try to write the index to element that does not equals `E+1` it means that at least one element will still have `E+1` as the value and we're have at least 2 elements with the same index. ABataev: This check is not required, it is checked automatically. We check that number of the extract…
		// the BestOrder is still E + 1, we don't have consecutive extract
		// instructions).
		CurrentOrder.assign(VL.size(), E + 1);
		for (unsigned I = 0; I < E; ++I) {
		auto *Inst = cast<Instruction>(VL[I]);
		if (Inst->getOperand(0) != Vec) {
		CurrentOrder.clear();
		return false;
		}
		Optional<unsigned> Idx = getExtractIndex(Inst);
		if (!Idx) {
		CurrentOrder.clear();
		return false;
		}
		const unsigned ExtIdx = *Idx;
		AyalUnsubmitted Not Done Reply Inline Actions Can simplify by checking if BestOrder is the identify permutation at the end, as done at the end of sortPtrAccesses(); using `getExtractIndex(Inst)` which returns Idx even if it's equal to I. Better rename BestOrder here too. Ayal: Can simplify by checking if BestOrder is the identify permutation at the end, as done at the…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Check the boolean flag is faster than to perform N comparisons. That's why I'd prefer to leave it as is. ABataev: Check the boolean flag is faster than to perform N comparisons. That's why I'd prefer to leave…
		AyalUnsubmitted Not Done Reply Inline Actions Right; it could simplify the code though, and N is usually very small. Ayal: Right; it could simplify the code though, and N is usually very small.
		if (ExtIdx != I) {
		if (ExtIdx >= E \|\| CurrentOrder[ExtIdx] != E + 1) {
		CurrentOrder.clear();
return false;		return false;
if (Inst->getOperand(0) != Vec)		}
		ShouldKeepOrder = false;
		AyalUnsubmitted Done Reply Inline Actions May as well do `CurrentOrder.assign(E, E+1);` Ayal: May as well do `CurrentOrder.assign(E, E+1);`
		CurrentOrder[ExtIdx] = I;
		} else {
		if (CurrentOrder[I] != E + 1) {
		CurrentOrder.clear();
return false;		return false;
}		}
		CurrentOrder[I] = I;
		}
		}
		AyalUnsubmitted Done Reply Inline Actions It may be easier to read if instead of clearing CurrentOrder at each early exit, we break from the loop, and right after the loop check if it exited early and if (`I < E`) clear CurrentOrder and return false. Ayal: It may be easier to read if instead of clearing CurrentOrder at each early exit, we break from…
		AyalUnsubmitted Done Reply Inline Actions The suggestion for easier reading was for something like: for (unsigned I = 0; I < E; ++I) { auto Inst = cast<Instruction>(VL[I]); if (Inst->getOperand(0) != Vec) break; Optional<unsigned> Idx = getExtractIndex(Inst); if (!Idx) break; const unsigned ExtIdx = Idx; if (ExtIdx >= E \|\| CurrentOrder[ExtIdx] != E + 1) break; CurrentOrder[ExtIdx] = I; if (ExtIdx != I) ShouldKeepOrder = false; } if (I < E) { CurrentOrder.clear(); return false; } return ShouldKeepOrder; Ayal: The suggestion for easier reading was for something like: ``` for (unsigned I = 0; I < E…

return true;		return ShouldKeepOrder;
}		}

bool BoUpSLP::areAllUsersVectorized(Instruction *I) const {		bool BoUpSLP::areAllUsersVectorized(Instruction *I) const {
return I->hasOneUse() \|\|		return I->hasOneUse() \|\|
std::all_of(I->user_begin(), I->user_end(), [this](User *U) {		std::all_of(I->user_begin(), I->user_end(), [this](User *U) {
return ScalarToTreeEntry.count(U) > 0;		return ScalarToTreeEntry.count(U) > 0;
});		});
}		}
▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	case Instruction::ExtractElement:
Idx = IO->getZExtValue();		Idx = IO->getZExtValue();
} else {		} else {
--Idx;		--Idx;
}		}
ReuseShuffleCost +=		ReuseShuffleCost +=
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, Idx);		TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, Idx);
}		}
}		}
if (canReuseExtract(VL, S.OpValue)) {		if (!E->NeedToGather) {
int DeadCost = ReuseShuffleCost;		int DeadCost = ReuseShuffleCost;
		if (!E->ReorderIndices.empty()) {
		// TODO: Merge this shuffle with the ReuseShuffleCost.
		DeadCost += TTI->getShuffleCost(
		TargetTransformInfo::SK_PermuteSingleSrc, VecTy);
		}
for (unsigned i = 0, e = VL.size(); i < e; ++i) {		for (unsigned i = 0, e = VL.size(); i < e; ++i) {
Instruction *E = cast<Instruction>(VL[i]);		Instruction *E = cast<Instruction>(VL[i]);
// If all users are going to be vectorized, instruction can be		// If all users are going to be vectorized, instruction can be
// considered as dead.		// considered as dead.
// The same, if have only one user, it will be vectorized for sure.		// The same, if have only one user, it will be vectorized for sure.
if (areAllUsersVectorized(E))		if (areAllUsersVectorized(E))
// Take credit for instruction that will become dead.		// Take credit for instruction that will become dead.
DeadCost -=		DeadCost -=
▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	case Instruction::Load: {
ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) *		ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) *
TTI->getMemoryOpCost(Instruction::Load, ScalarTy,		TTI->getMemoryOpCost(Instruction::Load, ScalarTy,
alignment, 0, VL0);		alignment, 0, VL0);
}		}
int ScalarLdCost = VecTy->getNumElements() *		int ScalarLdCost = VecTy->getNumElements() *
TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0, VL0);		TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0, VL0);
int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,		int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,
VecTy, alignment, 0, VL0);		VecTy, alignment, 0, VL0);
if (!isConsecutiveAccess(VL[0], VL[1], DL, SE)) {		if (!E->ReorderIndices.empty()) {
		// TODO: Merge this shuffle with the ReuseShuffleCost.
VecLdCost += TTI->getShuffleCost(		VecLdCost += TTI->getShuffleCost(
TargetTransformInfo::SK_PermuteSingleSrc, VecTy);		TargetTransformInfo::SK_PermuteSingleSrc, VecTy);
}		}
return ReuseShuffleCost + VecLdCost - ScalarLdCost;		return ReuseShuffleCost + VecLdCost - ScalarLdCost;
}		}
case Instruction::Store: {		case Instruction::Store: {
// We know that we can merge the stores. Calculate the cost.		// We know that we can merge the stores. Calculate the cost.
unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();		unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();
▲ Show 20 Lines • Show All 660 Lines • ▼ Show 20 Lines	if (!ReuseShuffleIndicies.empty()) {
if (auto *I = dyn_cast<Instruction>(V)) {		if (auto *I = dyn_cast<Instruction>(V)) {
GatherSeq.insert(I);		GatherSeq.insert(I);
CSEBlocks.insert(I->getParent());		CSEBlocks.insert(I->getParent());
}		}
}		}
return V;		return V;
}		}

		static void inversePermutation(ArrayRef<unsigned> Indices,
		SmallVectorImpl<unsigned> &Mask) {
		Mask.clear();
		const unsigned E = Indices.size();
		Mask.resize(Indices.size());
		AyalUnsubmitted Not Done Reply Inline Actions May as well use `E` when resizing `Mask`. Ayal: May as well use `E` when resizing `Mask`.
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Oh, yes, missed it, thanks. ABataev: Oh, yes, missed it, thanks.
		for (unsigned I = 0; I < E; ++I)
		Mask[Indices[I]] = I;
		}

Value BoUpSLP::vectorizeTree(TreeEntry E) {		Value BoUpSLP::vectorizeTree(TreeEntry E) {
IRBuilder<>::InsertPointGuard Guard(Builder);		IRBuilder<>::InsertPointGuard Guard(Builder);

if (E->VectorizedValue) {		if (E->VectorizedValue) {
DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");		DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
}		}

assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&		assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&
"Invalid number of incoming values");		"Invalid number of incoming values");
return V;		return V;
}		}

case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
if (canReuseExtract(E->Scalars, VL0)) {		if (!E->NeedToGather) {
Value *V = VL0->getOperand(0);		Value *V = VL0->getOperand(0);
		if (!E->ReorderIndices.empty()) {
		OrdersType Mask;
		inversePermutation(E->ReorderIndices, Mask);
		Builder.SetInsertPoint(VL0);
		AyalUnsubmitted Done Reply Inline Actions Capture this "inversePermutation" in a method, to be called again below? Ayal: Capture this "inversePermutation" in a method, to be called again below?
		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy), Mask,
		"reorder_shuffle");
		}
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
		// TODO: Merge this shuffle with the ReorderShuffleMask.
		if (!E->ReorderIndices.empty())
Builder.SetInsertPoint(VL0);		Builder.SetInsertPoint(VL0);
		AyalUnsubmitted Done Reply Inline Actions InsertPoint may be set twice to VL0? Ayal: InsertPoint may be set twice to VL0?
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Missed it, thanks. ABataev: Missed it, thanks.
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
return V;		return V;
}		}
setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);
auto *V = Gather(E->Scalars, VecTy);		auto *V = Gather(E->Scalars, VecTy);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
if (auto *I = dyn_cast<Instruction>(V)) {		if (auto *I = dyn_cast<Instruction>(V)) {
GatherSeq.insert(I);		GatherSeq.insert(I);
CSEBlocks.insert(I->getParent());		CSEBlocks.insert(I->getParent());
}		}
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
return V;		return V;
}		}
case Instruction::ExtractValue: {		case Instruction::ExtractValue: {
if (canReuseExtract(E->Scalars, VL0)) {		if (!E->NeedToGather) {
LoadInst *LI = cast<LoadInst>(VL0->getOperand(0));		LoadInst *LI = cast<LoadInst>(VL0->getOperand(0));
Builder.SetInsertPoint(LI);		Builder.SetInsertPoint(LI);
PointerType *PtrTy = PointerType::get(VecTy, LI->getPointerAddressSpace());		PointerType *PtrTy = PointerType::get(VecTy, LI->getPointerAddressSpace());
Value *Ptr = Builder.CreateBitCast(LI->getOperand(0), PtrTy);		Value *Ptr = Builder.CreateBitCast(LI->getOperand(0), PtrTy);
LoadInst *V = Builder.CreateAlignedLoad(Ptr, LI->getAlignment());		LoadInst *V = Builder.CreateAlignedLoad(Ptr, LI->getAlignment());
Value *NewV = propagateMetadata(V, E->Scalars);		Value *NewV = propagateMetadata(V, E->Scalars);
		if (!E->ReorderIndices.empty()) {
		OrdersType Mask;
		inversePermutation(E->ReorderIndices, Mask);
		NewV = Builder.CreateShuffleVector(NewV, UndefValue::get(VecTy), Mask,
		"reorder_shuffle");
		}
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
		// TODO: Merge this shuffle with the ReorderShuffleMask.
NewV = Builder.CreateShuffleVector(		NewV = Builder.CreateShuffleVector(
NewV, UndefValue::get(VecTy), E->ReuseShuffleIndices, "shuffle");		NewV, UndefValue::get(VecTy), E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = NewV;		E->VectorizedValue = NewV;
return NewV;		return NewV;
}		}
setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);
auto *V = Gather(E->Scalars, VecTy);		auto *V = Gather(E->Scalars, VecTy);
▲ Show 20 Lines • Show All 157 Lines • ▼ Show 20 Lines	case Instruction::Xor: {
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

return V;		return V;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Loads are inserted at the head of the tree because we don't want to		// Loads are inserted at the head of the tree because we don't want to
// sink them all the way down past store instructions.		// sink them all the way down past store instructions.
bool IsReversed =		bool IsReorder = !E->ReorderIndices.empty();
!isConsecutiveAccess(E->Scalars[0], E->Scalars[1], DL, SE);		if (IsReorder)
if (IsReversed)		VL0 = cast<Instruction>(E->Scalars[E->ReorderIndices.front()]);
		AyalUnsubmitted Not Done Reply Inline Actions Should we first inverse the permutation and then take its front()? Would be good to have a testcase where this makes a difference and check it (one way or the other), if there isn't one already. Ayal: Should we first inverse the permutation and then take its front()? Would be good to have a…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions There are several tests already that test that this is correct code. SLPVectorizer/X86/PR32086.ll, SLPVectorizer/X86/jumbled-load.ll, SLPVectorizer/X86/store-jumbled.ll etc. ABataev: There are several tests already that test that this is correct code. SLPVectorizer/X86/PR32086.
VL0 = cast<Instruction>(E->Scalars.back());
setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

LoadInst *LI = cast<LoadInst>(VL0);		LoadInst *LI = cast<LoadInst>(VL0);
Type *ScalarLoadTy = LI->getType();		Type *ScalarLoadTy = LI->getType();
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = LI->getPointerAddressSpace();

Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),		Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),
VecTy->getPointerTo(AS));		VecTy->getPointerTo(AS));

// The pointer operand uses an in-tree scalar so we add the new BitCast to		// The pointer operand uses an in-tree scalar so we add the new BitCast to
// ExternalUses list to make sure that an extract will be generated in the		// ExternalUses list to make sure that an extract will be generated in the
// future.		// future.
Value *PO = LI->getPointerOperand();		Value *PO = LI->getPointerOperand();
if (getTreeEntry(PO))		if (getTreeEntry(PO))
ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));		ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));

unsigned Alignment = LI->getAlignment();		unsigned Alignment = LI->getAlignment();
LI = Builder.CreateLoad(VecPtr);		LI = Builder.CreateLoad(VecPtr);
if (!Alignment) {		if (!Alignment) {
Alignment = DL->getABITypeAlignment(ScalarLoadTy);		Alignment = DL->getABITypeAlignment(ScalarLoadTy);
}		}
LI->setAlignment(Alignment);		LI->setAlignment(Alignment);
Value *V = propagateMetadata(LI, E->Scalars);		Value *V = propagateMetadata(LI, E->Scalars);
if (IsReversed) {		if (IsReorder) {
SmallVector<uint32_t, 4> Mask(E->Scalars.size());		SmallVector<unsigned, 4> Mask(E->Scalars.size());
std::iota(Mask.rbegin(), Mask.rend(), 0);		for (unsigned I = 0, End = E->Scalars.size(); I < End; ++I)
V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()), Mask);		Mask[E->ReorderIndices[I]] = I;
		AyalUnsubmitted Not Done Reply Inline Actions Another inversePermutation? Ayal: Another inversePermutation?
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Yup, thanks. ABataev: Yup, thanks.
		V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()),
		Mask, "reorder_shuffle");
}		}
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
		// TODO: Merge this shuffle with the ReorderShuffleMask.
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::Store: {		case Instruction::Store: {
▲ Show 20 Lines • Show All 1,562 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
if (hasValueBeenRAUWed(VL, TrackValues, I, OpsWidth))		if (hasValueBeenRAUWed(VL, TrackValues, I, OpsWidth))
continue;		continue;

DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "		DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "
<< "\n");		<< "\n");
ArrayRef<Value *> Ops = VL.slice(I, OpsWidth);		ArrayRef<Value *> Ops = VL.slice(I, OpsWidth);

R.buildTree(Ops);		R.buildTree(Ops);
		Optional<ArrayRef<unsigned>> Order = R.bestOrder();
// TODO: check if we can allow reordering for more cases.		// TODO: check if we can allow reordering for more cases.
if (AllowReorder && R.shouldReorder()) {		if (AllowReorder && Order) {
// Conceptually, there is nothing actually preventing us from trying to		// Conceptually, there is nothing actually preventing us from trying to
// reorder a larger list. In fact, we do exactly this when vectorizing		// reorder a larger list. In fact, we do exactly this when vectorizing
// reductions. However, at this point, we only expect to get here when		// reductions. However, at this point, we only expect to get here when
// there are exactly two operations.		// there are exactly two operations.
assert(Ops.size() == 2);		assert(Ops.size() == 2);
		AyalUnsubmitted Done Reply Inline Actions If only two operations are still allowed, ReorderedOps may as well stay Ops[1], Ops[0]. Provide the generalization below to any permutation when the assert can be dropped, i.e., in a separate patch which handles this TODO? Ayal: If only two operations are still allowed, ReorderedOps may as well stay Ops[1], Ops[0]. Provide…
Value *ReorderedOps[] = {Ops[1], Ops[0]};		Value *ReorderedOps[] = {Ops[1], Ops[0]};
R.buildTree(ReorderedOps, None);		R.buildTree(ReorderedOps, None);
		AyalUnsubmitted Not Done Reply Inline Actions Provide Ops.size() as operand to constructor of ReorderedOps (multiple similar occurrences). Ayal: Provide Ops.size() as operand to constructor of ReorderedOps (multiple similar occurrences).
}		}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
int Cost = R.getTreeCost() - UserCost;		int Cost = R.getTreeCost() - UserCost;
CandidateFound = true;		CandidateFound = true;
MinCost = std::min(MinCost, Cost);		MinCost = std::min(MinCost, Cost);
▲ Show 20 Lines • Show All 722 Lines • ▼ Show 20 Lines	bool tryToReduce(BoUpSLP &V, TargetTransformInfo *TTI) {
for (auto &Pair : ExtraArgs)		for (auto &Pair : ExtraArgs)
ExternallyUsedValues[Pair.second].push_back(Pair.first);		ExternallyUsedValues[Pair.second].push_back(Pair.first);
SmallVector<Value *, 16> IgnoreList;		SmallVector<Value *, 16> IgnoreList;
for (auto &V : ReductionOps)		for (auto &V : ReductionOps)
IgnoreList.append(V.begin(), V.end());		IgnoreList.append(V.begin(), V.end());
while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {		while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {
auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);		auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);
V.buildTree(VL, ExternallyUsedValues, IgnoreList);		V.buildTree(VL, ExternallyUsedValues, IgnoreList);
if (V.shouldReorder()) {		Optional<ArrayRef<unsigned>> Order = V.bestOrder();
SmallVector<Value *, 8> Reversed(VL.rbegin(), VL.rend());		if (Order) {
V.buildTree(Reversed, ExternallyUsedValues, IgnoreList);		SmallVector<Value *, 4> ReorderedOps;
		ReorderedOps.reserve(VL.size());
		AyalUnsubmitted Not Done Reply Inline Actions Fold by feeding the constructor with the size. Ayal: Fold by feeding the constructor with the size.
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Reworked it, thanks. ABataev: Reworked it, thanks.
		for (const unsigned Idx : *Order)
		ReorderedOps.emplace_back(VL[Idx]);
		V.buildTree(ReorderedOps, ExternallyUsedValues, IgnoreList);
}		}
if (V.isTreeTinyAndNotFullyVectorizable())		if (V.isTreeTinyAndNotFullyVectorizable())
break;		break;

V.computeMinimumValueSizes();		V.computeMinimumValueSizes();

// Estimate cost.		// Estimate cost.
int Cost =		int Cost =
▲ Show 20 Lines • Show All 710 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s

	@array = external global [20 x [13 x i32]]			@array = external global [20 x [13 x i32]]

	define void @hoge(i64 %idx, <4 x i32>* %sink) {			define void @hoge(i64 %idx, <4 x i32>* %sink) {
	; CHECK-LABEL: @hoge(			; CHECK-LABEL: @hoge(
	; CHECK-NEXT: bb:			; CHECK-NEXT: bb:
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX:%.*]], i64 5			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX:%.*]], i64 5
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 6			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 6
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 7			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 7
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 8			; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]] @array, i64 0, i64 [[IDX]], i64 8
	; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP1]] to <2 x i32>*			; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[TMP5:%.]] = load <2 x i32>, <2 x i32> [[TMP4]], align 4			; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x i32> [[TMP5]], i32 0			; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 0
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP5]], i32 1			; CHECK-NEXT: [[TMP8:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 1
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP8]], i32 1			; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP8]], i32 1
	; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[TMP3]], align 4			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 2
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[TMP10]], i32 2			; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[TMP10]], i32 2
	; CHECK-NEXT: [[TMP12:%.]] = load i32, i32 [[TMP0]], align 4			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 3
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP12]], i32 3			; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP12]], i32 3
	; CHECK-NEXT: store <4 x i32> [[TMP13]], <4 x i32>* [[SINK:%.*]]			; CHECK-NEXT: store <4 x i32> [[TMP13]], <4 x i32>* [[SINK:%.*]]
				AyalUnsubmitted Not Done Reply Inline Actions This redundant sequence of extractions from REORDER_SHUFFLE and insertions into TMP13 is hopefully eliminated later. Is the cost model ignoring it, or can we avoid generating it? Would be good to have the test CHECK the cost. Ayal: This redundant sequence of extractions from REORDER_SHUFFLE and insertions into TMP13 is…
				ABataevAuthorUnsubmitted Not Done Reply Inline Actions Yes, InstCombiner will squash all these instructions into one shuffle. Yes, cost model is aware of these operations and ignores their cost (canReuseExtract() is intended to do this). ABataev: Yes, InstCombiner will squash all these instructions into one shuffle. Yes, cost model is aware…
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	bb:			bb:
	%0 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 5			%0 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 5
	%1 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 6			%1 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 6
	%2 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 7			%2 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 7
	%3 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 8			%3 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 8
	%4 = load i32, i32* %1, align 4			%4 = load i32, i32* %1, align 4
	Show All 11 Lines

test/Transforms/SLPVectorizer/X86/extract.ll

Show All 24 Lines	entry:
store double %A1, double* %P1, align 4		store double %A1, double* %P1, align 4
ret void		ret void
}		}

define void @fextr1(double* %ptr) {		define void @fextr1(double* %ptr) {
; CHECK-LABEL: @fextr1(		; CHECK-LABEL: @fextr1(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[LD:%.]] = load <2 x double>, <2 x double> undef		; CHECK-NEXT: [[LD:%.]] = load <2 x double>, <2 x double> undef
; CHECK-NEXT: [[V0:%.*]] = extractelement <2 x double> [[LD]], i32 0		; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <2 x double> [[LD]], <2 x double> undef, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[V1:%.*]] = extractelement <2 x double> [[LD]], i32 1
; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds double, double [[PTR:%.*]], i64 0		; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds double, double [[PTR:%.*]], i64 0
; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> undef, double [[V1]], i32 0		; CHECK-NEXT: [[TMP0:%.*]] = fadd <2 x double> <double 3.400000e+00, double 1.200000e+00>, [[REORDER_SHUFFLE]]
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[V0]], i32 1		; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[P1]] to <2 x double>*
; CHECK-NEXT: [[TMP2:%.*]] = fadd <2 x double> <double 3.400000e+00, double 1.200000e+00>, [[TMP1]]		; CHECK-NEXT: store <2 x double> [[TMP0]], <2 x double>* [[TMP1]], align 4
; CHECK-NEXT: [[TMP3:%.]] = bitcast double [[P1]] to <2 x double>*
; CHECK-NEXT: store <2 x double> [[TMP2]], <2 x double>* [[TMP3]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%LD = load <2 x double>, <2 x double>* undef		%LD = load <2 x double>, <2 x double>* undef
%V0 = extractelement <2 x double> %LD, i32 0		%V0 = extractelement <2 x double> %LD, i32 0
%V1 = extractelement <2 x double> %LD, i32 1		%V1 = extractelement <2 x double> %LD, i32 1
%P0 = getelementptr inbounds double, double* %ptr, i64 1 ; <--- incorrect order		%P0 = getelementptr inbounds double, double* %ptr, i64 1 ; <--- incorrect order
%P1 = getelementptr inbounds double, double* %ptr, i64 0		%P1 = getelementptr inbounds double, double* %ptr, i64 0
Show All 34 Lines

test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4
	@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4

	define i32 @fn1() {			define i32 @fn1() {
	; CHECK-LABEL: @fn1(			; CHECK-LABEL: @fn1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> bitcast (i32* getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 1) to <2 x i32>*), align 4			; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP0]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
	; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 3), align 4			; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt <4 x i32> [[REORDER_SHUFFLE]], zeroinitializer
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 0
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP3]], i32 0			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> undef, i32 [[TMP2]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1			; CHECK-NEXT: [[TMP4:%.]] = insertelement <4 x i32> [[TMP3]], i32 ptrtoint (i32 () @fn1 to i32), i32 1
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP5]], i32 1			; CHECK-NEXT: [[TMP5:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP2]], i32 2			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 8, i32 3
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP0]], i32 3			; CHECK-NEXT: [[TMP7:%.*]] = select <4 x i1> [[TMP1]], <4 x i32> [[TMP6]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
	; CHECK-NEXT: [[TMP9:%.*]] = icmp sgt <4 x i32> [[TMP8]], zeroinitializer			; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP10:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 1
	; CHECK-NEXT: [[TMP11:%.]] = insertelement <4 x i32> [[TMP10]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 8, i32 3
	; CHECK-NEXT: [[TMP13:%.*]] = select <4 x i1> [[TMP9]], <4 x i32> [[TMP12]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
	; CHECK-NEXT: store <4 x i32> [[TMP13]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;
	entry:			entry:
	%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4
	%cmp = icmp sgt i32 %0, 0			%cmp = icmp sgt i32 %0, 0
	%cond = select i1 %cmp, i32 8, i32 0			%cond = select i1 %cmp, i32 8, i32 0
	store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4			store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4
	%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4			%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4
	Show All 13 Lines

test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll

Show All 15 Lines	;void jumble (int * restrict A, int * restrict B) {

; Function Attrs: norecurse nounwind uwtable		; Function Attrs: norecurse nounwind uwtable
define void @jumble1(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {		define void @jumble1(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {
; CHECK-LABEL: @jumble1(		; CHECK-LABEL: @jumble1(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10
; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11		; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11
; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1		; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1
; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[A]] to <2 x i32>*
; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 4
; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12		; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12
; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3		; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3
; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX6]], align 4
; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13		; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13
; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*		; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*
; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4		; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2		; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX9]], align 4		; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A]] to <4 x i32>*
; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0		; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4
; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1		; CHECK-NEXT: [[TMP5:%.*]] = mul nsw <4 x i32> [[TMP1]], [[TMP4]]
; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP8]], i32 1
; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[TMP2]], i32 2
; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP5]], i32 3
; CHECK-NEXT: [[TMP12:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP11]]
; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1		; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2		; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3		; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[B]] to <4 x i32>*		; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[B]] to <4 x i32>*
; CHECK-NEXT: store <4 x i32> [[TMP12]], <4 x i32>* [[TMP13]], align 4		; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%arrayidx = getelementptr inbounds i32, i32* %A, i64 10		%arrayidx = getelementptr inbounds i32, i32* %A, i64 10
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
%1 = load i32, i32* %A, align 4		%1 = load i32, i32* %A, align 4
%mul = mul nsw i32 %0, %1		%mul = mul nsw i32 %0, %1
%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11		%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11
Show All 24 Lines
;Reversing the operand of MUL		;Reversing the operand of MUL
; Function Attrs: norecurse nounwind uwtable		; Function Attrs: norecurse nounwind uwtable
define void @jumble2(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {		define void @jumble2(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {
; CHECK-LABEL: @jumble2(		; CHECK-LABEL: @jumble2(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10
; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11		; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11
; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1		; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1
; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[A]] to <2 x i32>*
; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 4
; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12		; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12
; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3		; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3
; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX6]], align 4
; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13		; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13
; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*		; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*
; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4		; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2		; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX9]], align 4		; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A]] to <4 x i32>*
; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0		; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4
; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1		; CHECK-NEXT: [[TMP5:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP1]]
; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP8]], i32 1
; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[TMP2]], i32 2
; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP5]], i32 3
; CHECK-NEXT: [[TMP12:%.*]] = mul nsw <4 x i32> [[TMP11]], [[TMP4]]
; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1		; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2		; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3		; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[B]] to <4 x i32>*		; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[B]] to <4 x i32>*
; CHECK-NEXT: store <4 x i32> [[TMP12]], <4 x i32>* [[TMP13]], align 4		; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%arrayidx = getelementptr inbounds i32, i32* %A, i64 10		%arrayidx = getelementptr inbounds i32, i32* %A, i64 10
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
%1 = load i32, i32* %A, align 4		%1 = load i32, i32* %A, align 4
%mul = mul nsw i32 %1, %0		%mul = mul nsw i32 %1, %0
%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11		%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11
Show All 24 Lines

test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll

	Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[ARRAYIDX28:%.]] = getelementptr inbounds i32, i32 [[A]], i64 50			; CHECK-NEXT: [[ARRAYIDX28:%.]] = getelementptr inbounds i32, i32 [[A]], i64 50
	; CHECK-NEXT: [[ARRAYIDX44:%.]] = getelementptr inbounds i32, i32 [[A]], i64 75			; CHECK-NEXT: [[ARRAYIDX44:%.]] = getelementptr inbounds i32, i32 [[A]], i64 75
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: [[ARRAYIDX64:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1			; CHECK-NEXT: [[ARRAYIDX64:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
	; CHECK-NEXT: [[ARRAYIDX65:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2			; CHECK-NEXT: [[ARRAYIDX65:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
	; CHECK-NEXT: [[ARRAYIDX66:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3			; CHECK-NEXT: [[ARRAYIDX66:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[B]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[B]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[TMP34:%.]], <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP27:%.]], <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.]] ]
	; CHECK-NEXT: [[TMP2:%.*]] = phi <4 x i32> [ undef, [[ENTRY]] ], [ [[TMP34]], [[FOR_INC]] ]			; CHECK-NEXT: [[TMP2:%.*]] = phi <4 x i32> [ undef, [[ENTRY]] ], [ [[TMP27]], [[FOR_INC]] ]
	; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.]], label [[IF_ELSE:%.]]			; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.]], label [[IF_ELSE:%.]]
	; CHECK: if.then:			; CHECK: if.then:
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP3]]			; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP3]]
	; CHECK-NEXT: [[TMP4:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2			; CHECK-NEXT: [[TMP4:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
	; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP4]]			; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP4]]
	; CHECK-NEXT: [[TMP5:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3			; CHECK-NEXT: [[TMP5:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
	Show All 34 Lines
	; CHECK: if.else43:			; CHECK: if.else43:
	; CHECK-NEXT: [[TMP20:%.]] = load i32, i32 [[ARRAYIDX44]], align 4			; CHECK-NEXT: [[TMP20:%.]] = load i32, i32 [[ARRAYIDX44]], align 4
	; CHECK-NEXT: [[CMP45:%.*]] = icmp eq i32 [[TMP20]], 0			; CHECK-NEXT: [[CMP45:%.*]] = icmp eq i32 [[TMP20]], 0
	; CHECK-NEXT: br i1 [[CMP45]], label [[IF_THEN46:%.*]], label [[FOR_INC]]			; CHECK-NEXT: br i1 [[CMP45]], label [[IF_THEN46:%.*]], label [[FOR_INC]]
	; CHECK: if.then46:			; CHECK: if.then46:
	; CHECK-NEXT: [[ARRAYIDX49:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX49:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP21:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[TMP21:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[ARRAYIDX52:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP21]]			; CHECK-NEXT: [[ARRAYIDX52:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP21]]
	; CHECK-NEXT: [[TMP22:%.]] = bitcast i32 [[ARRAYIDX49]] to <2 x i32>*			; CHECK-NEXT: [[TMP22:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
	; CHECK-NEXT: [[TMP23:%.]] = load <2 x i32>, <2 x i32> [[TMP22]], align 4			; CHECK-NEXT: [[ARRAYIDX55:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP22]]
	; CHECK-NEXT: [[TMP24:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3			; CHECK-NEXT: [[TMP23:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
	; CHECK-NEXT: [[ARRAYIDX55:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP24]]			; CHECK-NEXT: [[ARRAYIDX58:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP23]]
	; CHECK-NEXT: [[TMP25:%.]] = load i32, i32 [[ARRAYIDX55]], align 4			; CHECK-NEXT: [[TMP24:%.]] = bitcast i32 [[ARRAYIDX49]] to <4 x i32>*
	; CHECK-NEXT: [[TMP26:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2			; CHECK-NEXT: [[TMP25:%.]] = load <4 x i32>, <4 x i32> [[TMP24]], align 4
	; CHECK-NEXT: [[ARRAYIDX58:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP26]]			; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <4 x i32> [[TMP25]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
	; CHECK-NEXT: [[TMP27:%.]] = load i32, i32 [[ARRAYIDX58]], align 4
	; CHECK-NEXT: [[TMP28:%.*]] = extractelement <2 x i32> [[TMP23]], i32 0
	; CHECK-NEXT: [[TMP29:%.*]] = insertelement <4 x i32> undef, i32 [[TMP28]], i32 0
	; CHECK-NEXT: [[TMP30:%.*]] = extractelement <2 x i32> [[TMP23]], i32 1
	; CHECK-NEXT: [[TMP31:%.*]] = insertelement <4 x i32> [[TMP29]], i32 [[TMP30]], i32 1
	; CHECK-NEXT: [[TMP32:%.*]] = insertelement <4 x i32> [[TMP31]], i32 [[TMP25]], i32 2
	; CHECK-NEXT: [[TMP33:%.*]] = insertelement <4 x i32> [[TMP32]], i32 [[TMP27]], i32 3
	; CHECK-NEXT: br label [[FOR_INC]]			; CHECK-NEXT: br label [[FOR_INC]]
	; CHECK: for.inc:			; CHECK: for.inc:
	; CHECK-NEXT: [[TMP34]] = phi <4 x i32> [ [[TMP7]], [[IF_THEN]] ], [ [[TMP13]], [[IF_THEN14]] ], [ [[TMP19]], [[IF_THEN30]] ], [ [[TMP33]], [[IF_THEN46]] ], [ [[TMP2]], [[IF_ELSE43]] ]			; CHECK-NEXT: [[TMP27]] = phi <4 x i32> [ [[TMP7]], [[IF_THEN]] ], [ [[TMP13]], [[IF_THEN14]] ], [ [[TMP19]], [[IF_THEN30]] ], [ [[TMP26]], [[IF_THEN46]] ], [ [[TMP2]], [[IF_ELSE43]] ]
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 100			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 100
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
	;			;
	entry:			entry:
	%0 = load i32, i32* %A, align 4			%0 = load i32, i32* %A, align 4
	%cmp1 = icmp eq i32 %0, 0			%cmp1 = icmp eq i32 %0, 0
	%arrayidx12 = getelementptr inbounds i32, i32* %A, i64 25			%arrayidx12 = getelementptr inbounds i32, i32* %A, i64 25
	▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/jumbled-load.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s		; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {		define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
; CHECK-LABEL: @jumbled-load(		; CHECK-LABEL: @jumbled-load(
; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0		; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3		; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1		; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2		; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
		; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0		; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4
; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2		; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3		; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1		; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4		; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_3]], [[LOAD_5]]		; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_8]]		; CHECK-NEXT: [[REORDER_SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_4]], [[LOAD_7]]		; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i32> [[REORDER_SHUFFLE]], [[REORDER_SHUFFLE1]]
; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_1]], [[LOAD_6]]
; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0		; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_7]], align 4
; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1		; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_8]], align 4
; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2		; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_9]], align 4
; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3		; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_10]], align 4		; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
		; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
; CHECK-NEXT: ret i32 undef		; CHECK-NEXT: ret i32 undef
;		;
%in.addr = getelementptr inbounds i32, i32* %in, i64 0		%in.addr = getelementptr inbounds i32, i32* %in, i64 0
%load.1 = load i32, i32* %in.addr, align 4		%load.1 = load i32, i32* %in.addr, align 4
%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3		%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
%load.2 = load i32, i32* %gep.1, align 4		%load.2 = load i32, i32* %gep.1, align 4
%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1		%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
%load.3 = load i32, i32* %gep.2, align 4		%load.3 = load i32, i32* %gep.2, align 4
Show All 22 Lines	;

ret i32 undef		ret i32 undef
}		}


define i32 @jumbled-load-multiuses(i32* noalias nocapture %in, i32* noalias nocapture %out) {		define i32 @jumbled-load-multiuses(i32* noalias nocapture %in, i32* noalias nocapture %out) {
; CHECK-LABEL: @jumbled-load-multiuses(		; CHECK-LABEL: @jumbled-load-multiuses(
; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0		; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3		; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1		; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2		; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_3]], [[LOAD_4]]		; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_2]]		; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_4]], [[LOAD_1]]		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 2
; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_1]], [[LOAD_3]]		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP3]], i32 0
		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 1
		; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP5]], i32 1
		; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 3
		; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP7]], i32 2
		; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i32> [[REORDER_SHUFFLE]], i32 0
		; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP8]], i32 [[TMP9]], i32 3
		; CHECK-NEXT: [[TMP11:%.*]] = mul <4 x i32> [[REORDER_SHUFFLE]], [[TMP10]]
; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0		; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_7]], align 4
; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1		; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_8]], align 4
; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2		; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_9]], align 4
; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3		; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_10]], align 4		; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
		; CHECK-NEXT: store <4 x i32> [[TMP11]], <4 x i32>* [[TMP12]], align 4
; CHECK-NEXT: ret i32 undef		; CHECK-NEXT: ret i32 undef
;		;
%in.addr = getelementptr inbounds i32, i32* %in, i64 0		%in.addr = getelementptr inbounds i32, i32* %in, i64 0
%load.1 = load i32, i32* %in.addr, align 4		%load.1 = load i32, i32* %in.addr, align 4
%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3		%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
%load.2 = load i32, i32* %gep.1, align 4		%load.2 = load i32, i32* %gep.1, align 4
%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1		%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
%load.3 = load i32, i32* %gep.2, align 4		%load.3 = load i32, i32* %gep.2, align 4
Show All 17 Lines

test/Transforms/SLPVectorizer/X86/reassociated-loads.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -reassociate -slp-vectorizer -slp-vectorize-hor -slp-vectorize-hor-store -S < %s -mtriple=x86_64-apple-macosx -mcpu=corei7-avx -mattr=+avx2 \| FileCheck %s			; RUN: opt -reassociate -slp-vectorizer -slp-vectorize-hor -slp-vectorize-hor-store -S < %s -mtriple=x86_64-apple-macosx -mcpu=corei7-avx -mattr=+avx2 \| FileCheck %s

	define signext i8 @Foo(<32 x i8>* %__v) {			define signext i8 @Foo(<32 x i8>* %__v) {
	; CHECK-LABEL: @Foo(			; CHECK-LABEL: @Foo(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load <32 x i8>, <32 x i8> [[__V:%.*]], align 32			; CHECK-NEXT: [[TMP0:%.]] = load <32 x i8>, <32 x i8> [[__V:%.*]], align 32
	; CHECK-NEXT: [[VECEXT_I_I_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 0			; CHECK-NEXT: [[ADD_I_1_I:%.*]] = add i8 undef, undef
	; CHECK-NEXT: [[VECEXT_I_I_1_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 1			; CHECK-NEXT: [[ADD_I_2_I:%.*]] = add i8 [[ADD_I_1_I]], undef
	; CHECK-NEXT: [[ADD_I_1_I:%.*]] = add i8 [[VECEXT_I_I_1_I]], [[VECEXT_I_I_I]]			; CHECK-NEXT: [[ADD_I_3_I:%.*]] = add i8 [[ADD_I_2_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_2_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 2			; CHECK-NEXT: [[ADD_I_4_I:%.*]] = add i8 [[ADD_I_3_I]], undef
	; CHECK-NEXT: [[ADD_I_2_I:%.*]] = add i8 [[ADD_I_1_I]], [[VECEXT_I_I_2_I]]			; CHECK-NEXT: [[ADD_I_5_I:%.*]] = add i8 [[ADD_I_4_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_3_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 3			; CHECK-NEXT: [[ADD_I_6_I:%.*]] = add i8 [[ADD_I_5_I]], undef
	; CHECK-NEXT: [[ADD_I_3_I:%.*]] = add i8 [[ADD_I_2_I]], [[VECEXT_I_I_3_I]]			; CHECK-NEXT: [[ADD_I_7_I:%.*]] = add i8 [[ADD_I_6_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_4_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 4			; CHECK-NEXT: [[ADD_I_8_I:%.*]] = add i8 [[ADD_I_7_I]], undef
	; CHECK-NEXT: [[ADD_I_4_I:%.*]] = add i8 [[ADD_I_3_I]], [[VECEXT_I_I_4_I]]			; CHECK-NEXT: [[ADD_I_9_I:%.*]] = add i8 [[ADD_I_8_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_5_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 5			; CHECK-NEXT: [[ADD_I_10_I:%.*]] = add i8 [[ADD_I_9_I]], undef
	; CHECK-NEXT: [[ADD_I_5_I:%.*]] = add i8 [[ADD_I_4_I]], [[VECEXT_I_I_5_I]]			; CHECK-NEXT: [[ADD_I_11_I:%.*]] = add i8 [[ADD_I_10_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_6_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 6			; CHECK-NEXT: [[ADD_I_12_I:%.*]] = add i8 [[ADD_I_11_I]], undef
	; CHECK-NEXT: [[ADD_I_6_I:%.*]] = add i8 [[ADD_I_5_I]], [[VECEXT_I_I_6_I]]			; CHECK-NEXT: [[ADD_I_13_I:%.*]] = add i8 [[ADD_I_12_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_7_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 7			; CHECK-NEXT: [[ADD_I_14_I:%.*]] = add i8 [[ADD_I_13_I]], undef
	; CHECK-NEXT: [[ADD_I_7_I:%.*]] = add i8 [[ADD_I_6_I]], [[VECEXT_I_I_7_I]]			; CHECK-NEXT: [[ADD_I_15_I:%.*]] = add i8 [[ADD_I_14_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_8_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 8			; CHECK-NEXT: [[ADD_I_16_I:%.*]] = add i8 [[ADD_I_15_I]], undef
	; CHECK-NEXT: [[ADD_I_8_I:%.*]] = add i8 [[ADD_I_7_I]], [[VECEXT_I_I_8_I]]			; CHECK-NEXT: [[ADD_I_17_I:%.*]] = add i8 [[ADD_I_16_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_9_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 9			; CHECK-NEXT: [[ADD_I_18_I:%.*]] = add i8 [[ADD_I_17_I]], undef
	; CHECK-NEXT: [[ADD_I_9_I:%.*]] = add i8 [[ADD_I_8_I]], [[VECEXT_I_I_9_I]]			; CHECK-NEXT: [[ADD_I_19_I:%.*]] = add i8 [[ADD_I_18_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_10_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 10			; CHECK-NEXT: [[ADD_I_20_I:%.*]] = add i8 [[ADD_I_19_I]], undef
	; CHECK-NEXT: [[ADD_I_10_I:%.*]] = add i8 [[ADD_I_9_I]], [[VECEXT_I_I_10_I]]			; CHECK-NEXT: [[ADD_I_21_I:%.*]] = add i8 [[ADD_I_20_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_11_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 11			; CHECK-NEXT: [[ADD_I_22_I:%.*]] = add i8 [[ADD_I_21_I]], undef
	; CHECK-NEXT: [[ADD_I_11_I:%.*]] = add i8 [[ADD_I_10_I]], [[VECEXT_I_I_11_I]]			; CHECK-NEXT: [[ADD_I_23_I:%.*]] = add i8 [[ADD_I_22_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_12_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 12			; CHECK-NEXT: [[ADD_I_24_I:%.*]] = add i8 [[ADD_I_23_I]], undef
	; CHECK-NEXT: [[ADD_I_12_I:%.*]] = add i8 [[ADD_I_11_I]], [[VECEXT_I_I_12_I]]			; CHECK-NEXT: [[ADD_I_25_I:%.*]] = add i8 [[ADD_I_24_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_13_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 13			; CHECK-NEXT: [[ADD_I_26_I:%.*]] = add i8 [[ADD_I_25_I]], undef
	; CHECK-NEXT: [[ADD_I_13_I:%.*]] = add i8 [[ADD_I_12_I]], [[VECEXT_I_I_13_I]]			; CHECK-NEXT: [[ADD_I_27_I:%.*]] = add i8 [[ADD_I_26_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_14_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 14			; CHECK-NEXT: [[ADD_I_28_I:%.*]] = add i8 [[ADD_I_27_I]], undef
	; CHECK-NEXT: [[ADD_I_14_I:%.*]] = add i8 [[ADD_I_13_I]], [[VECEXT_I_I_14_I]]			; CHECK-NEXT: [[ADD_I_29_I:%.*]] = add i8 [[ADD_I_28_I]], undef
	; CHECK-NEXT: [[VECEXT_I_I_15_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 15			; CHECK-NEXT: [[ADD_I_30_I:%.*]] = add i8 [[ADD_I_29_I]], undef
	; CHECK-NEXT: [[ADD_I_15_I:%.*]] = add i8 [[ADD_I_14_I]], [[VECEXT_I_I_15_I]]			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <32 x i8> [[TMP0]], <32 x i8> undef, <32 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[VECEXT_I_I_16_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 16			; CHECK-NEXT: [[BIN_RDX:%.*]] = add <32 x i8> [[TMP0]], [[RDX_SHUF]]
	; CHECK-NEXT: [[ADD_I_16_I:%.*]] = add i8 [[ADD_I_15_I]], [[VECEXT_I_I_16_I]]			; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <32 x i8> [[BIN_RDX]], <32 x i8> undef, <32 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[VECEXT_I_I_17_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 17			; CHECK-NEXT: [[BIN_RDX2:%.*]] = add <32 x i8> [[BIN_RDX]], [[RDX_SHUF1]]
	; CHECK-NEXT: [[ADD_I_17_I:%.*]] = add i8 [[ADD_I_16_I]], [[VECEXT_I_I_17_I]]			; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <32 x i8> [[BIN_RDX2]], <32 x i8> undef, <32 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[VECEXT_I_I_18_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 18			; CHECK-NEXT: [[BIN_RDX4:%.*]] = add <32 x i8> [[BIN_RDX2]], [[RDX_SHUF3]]
	; CHECK-NEXT: [[ADD_I_18_I:%.*]] = add i8 [[ADD_I_17_I]], [[VECEXT_I_I_18_I]]			; CHECK-NEXT: [[RDX_SHUF5:%.*]] = shufflevector <32 x i8> [[BIN_RDX4]], <32 x i8> undef, <32 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[VECEXT_I_I_19_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 19			; CHECK-NEXT: [[BIN_RDX6:%.*]] = add <32 x i8> [[BIN_RDX4]], [[RDX_SHUF5]]
	; CHECK-NEXT: [[ADD_I_19_I:%.*]] = add i8 [[ADD_I_18_I]], [[VECEXT_I_I_19_I]]			; CHECK-NEXT: [[RDX_SHUF7:%.*]] = shufflevector <32 x i8> [[BIN_RDX6]], <32 x i8> undef, <32 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[VECEXT_I_I_20_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 20			; CHECK-NEXT: [[BIN_RDX8:%.*]] = add <32 x i8> [[BIN_RDX6]], [[RDX_SHUF7]]
	; CHECK-NEXT: [[ADD_I_20_I:%.*]] = add i8 [[ADD_I_19_I]], [[VECEXT_I_I_20_I]]			; CHECK-NEXT: [[TMP1:%.*]] = extractelement <32 x i8> [[BIN_RDX8]], i32 0
	; CHECK-NEXT: [[VECEXT_I_I_21_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 21			; CHECK-NEXT: [[ADD_I_31_I:%.*]] = add i8 [[ADD_I_30_I]], undef
	; CHECK-NEXT: [[ADD_I_21_I:%.*]] = add i8 [[ADD_I_20_I]], [[VECEXT_I_I_21_I]]			; CHECK-NEXT: ret i8 [[TMP1]]
	; CHECK-NEXT: [[VECEXT_I_I_22_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 22
	; CHECK-NEXT: [[ADD_I_22_I:%.*]] = add i8 [[ADD_I_21_I]], [[VECEXT_I_I_22_I]]
	; CHECK-NEXT: [[VECEXT_I_I_23_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 23
	; CHECK-NEXT: [[ADD_I_23_I:%.*]] = add i8 [[ADD_I_22_I]], [[VECEXT_I_I_23_I]]
	; CHECK-NEXT: [[VECEXT_I_I_24_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 24
	; CHECK-NEXT: [[ADD_I_24_I:%.*]] = add i8 [[ADD_I_23_I]], [[VECEXT_I_I_24_I]]
	; CHECK-NEXT: [[VECEXT_I_I_25_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 25
	; CHECK-NEXT: [[ADD_I_25_I:%.*]] = add i8 [[ADD_I_24_I]], [[VECEXT_I_I_25_I]]
	; CHECK-NEXT: [[VECEXT_I_I_26_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 26
	; CHECK-NEXT: [[ADD_I_26_I:%.*]] = add i8 [[ADD_I_25_I]], [[VECEXT_I_I_26_I]]
	; CHECK-NEXT: [[VECEXT_I_I_27_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 27
	; CHECK-NEXT: [[ADD_I_27_I:%.*]] = add i8 [[ADD_I_26_I]], [[VECEXT_I_I_27_I]]
	; CHECK-NEXT: [[VECEXT_I_I_28_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 28
	; CHECK-NEXT: [[ADD_I_28_I:%.*]] = add i8 [[ADD_I_27_I]], [[VECEXT_I_I_28_I]]
	; CHECK-NEXT: [[VECEXT_I_I_29_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 29
	; CHECK-NEXT: [[ADD_I_29_I:%.*]] = add i8 [[ADD_I_28_I]], [[VECEXT_I_I_29_I]]
	; CHECK-NEXT: [[VECEXT_I_I_30_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 30
	; CHECK-NEXT: [[ADD_I_30_I:%.*]] = add i8 [[ADD_I_29_I]], [[VECEXT_I_I_30_I]]
	; CHECK-NEXT: [[VECEXT_I_I_31_I:%.*]] = extractelement <32 x i8> [[TMP0]], i64 31
	; CHECK-NEXT: [[ADD_I_31_I:%.*]] = add i8 [[ADD_I_30_I]], [[VECEXT_I_I_31_I]]
	; CHECK-NEXT: ret i8 [[ADD_I_31_I]]
	;			;
	entry:			entry:
	%0 = load <32 x i8>, <32 x i8>* %__v, align 32			%0 = load <32 x i8>, <32 x i8>* %__v, align 32
	%vecext.i.i.i = extractelement <32 x i8> %0, i64 0			%vecext.i.i.i = extractelement <32 x i8> %0, i64 0
	%vecext.i.i.1.i = extractelement <32 x i8> %0, i64 1			%vecext.i.i.1.i = extractelement <32 x i8> %0, i64 1
	%add.i.1.i = add i8 %vecext.i.i.1.i, %vecext.i.i.i			%add.i.1.i = add i8 %vecext.i.i.1.i, %vecext.i.i.i
	%vecext.i.i.2.i = extractelement <32 x i8> %0, i64 2			%vecext.i.i.2.i = extractelement <32 x i8> %0, i64 2
	%add.i.2.i = add i8 %vecext.i.i.2.i, %add.i.1.i			%add.i.2.i = add i8 %vecext.i.i.2.i, %add.i.1.i
	▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/store-jumbled.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



	define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {			define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load(			; CHECK-LABEL: @jumbled-load(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
				; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0			; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_1]], [[LOAD_5]]			; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_6]]			; CHECK-NEXT: [[REORDER_SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_3]], [[LOAD_7]]			; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i32> [[REORDER_SHUFFLE]], [[REORDER_SHUFFLE1]]
	; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_4]], [[LOAD_8]]
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
	; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_9]], align 4			; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_7]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
	; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_10]], align 4
	; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_8]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 25 Lines