This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Fix for a bug in jumbled memory access vectorization introduced in r293386
ClosedPublic

Authored by • ashahid on Feb 19 2017, 10:24 PM.

Download Raw Diff

Details

Reviewers

Commits

rGbdac9f30c0bf: [SLP] Fixes the bug due to absence of in order uses of scalars which needs to…
rG175ffa8c35c7: [SLP] Fixes the bug due to absence of in order uses of scalars which needs to…
rL296863: [SLP] Fixes the bug due to absence of in order uses of scalars which needs to…
rL296575: [SLP] Fixes the bug due to absence of in order uses of scalars which needs to…

Summary

The bug is due to absence of in order uses of scalars which needs to be available for VectorizeTree() API. This API uses it for proper mask computation to be used in "shufflevector" IR.

The fix is to compute the mask for out of order memory accesses while building the vectorizable tree instead of actual vectorization of vectorizable tree.

Diff Detail

Event Timeline

• ashahid created this revision.Feb 19 2017, 10:24 PM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptFeb 19 2017, 10:24 PM

This is getting a bit hairy, I think - we have too many valuelists flying around.

How about something like this:
When we sort the vectors, we can also use the sort to generate the corresponding shuffle mask (instead of recomputing it in O(|VL|^2) time in line 2623 - 2634), by sorting pairs of <Value, Index>.
We can then keep an optional shuffle mask instead of the NeedToShuffle bit as part of the tree node, and use it to get the correctly ordered list of scalars back from VectorizableTree[0].Scalars when we need to.

What do you think? My gut feeling is that keeping the mask around is worth it just to avoid the N^2 loop in 2623, and the fix to the bug falls out of it.

In D30159#681835, @mkuper wrote:

This is getting a bit hairy, I think - we have too many valuelists flying around.

How about something like this:
When we sort the vectors, we can also use the sort to generate the corresponding shuffle mask (instead of recomputing it in O(|VL|^2) time in line 2623 - 2634), by sorting pairs of <Value, Index>.

The basic idea to keep the mask computation at the current place was because we wanted to compute the mask, (N^2 loop), only if vectorization is found to be beneficial.

OTOH if we do mask computation while sorting memory accesses, the utility of SortMemAccesses API may be reduced, however this is not a bigger problem and can be
worked around by keeping a bit to decide whether to compute the mask or not.

We can then keep an optional shuffle mask instead of the NeedToShuffle bit as part of the tree node, and use it to get the correctly ordered list of scalars back from VectorizableTree[0].Scalars when we need to.

What do you think? My gut feeling is that keeping the mask around is worth it just to avoid the N^2 loop in 2623, and the fix to the bug falls out of it.

Although I am not an expert, I think keeping another ValueList for unordered scalars is a not a bad design. However I am open to other views.

Regards,
Shahid

In D30159#681901, @ashahid wrote:

In D30159#681835, @mkuper wrote:

This is getting a bit hairy, I think - we have too many valuelists flying around.

How about something like this:
When we sort the vectors, we can also use the sort to generate the corresponding shuffle mask (instead of recomputing it in O(|VL|^2) time in line 2623 - 2634), by sorting pairs of <Value, Index>.

The basic idea to keep the mask computation at the current place was because we wanted to compute the mask, (N^2 loop), only if vectorization is found to be beneficial.

Yes, but we can avoid the O(N^2) computation altogether, right? I mean, doing that doesn't add any cost to the sort. Or am I missing something?

OTOH if we do mask computation while sorting memory accesses, the utility of SortMemAccesses API may be reduced, however this is not a bigger problem and can be
worked around by keeping a bit to decide whether to compute the mask or not.

Don't we sort before we know whether we're going to need the mask?

We can then keep an optional shuffle mask instead of the NeedToShuffle bit as part of the tree node, and use it to get the correctly ordered list of scalars back from VectorizableTree[0].Scalars when we need to.

What do you think? My gut feeling is that keeping the mask around is worth it just to avoid the N^2 loop in 2623, and the fix to the bug falls out of it.

Although I am not an expert, I think keeping another ValueList for unordered scalars is a not a bad design. However I am open to other views.

It seems odd to have both the sorted and the unsorted list. The other thing that bothers me is that we have some invariants which don't seem to be well-defined. Is the guarantee that "Scalars" is ordered to have the "best" order to allow vectorization, while InOrdScalars is ordered according to the original IR order? Also, preserving "program order" is kind of meaningless. The fact we care about order at all is just an artifact of the way we build the tree. What we really care about is mapping each scalar leaf to the vector lane it eventually ends up in, which is precisely the shuffle mask.

Anyway, I'm also open to other opinions, if more people care about this.
What I'm really trying to do here is to make the code make a bit more sense - right now, it makes almost zero sense to me. (This isn't the fault of your patch, I'm talking about the general state of the SLP vectorizer).

In D30159#681919, @mkuper wrote:

In D30159#681901, @ashahid wrote:

In D30159#681835, @mkuper wrote:

This is getting a bit hairy, I think - we have too many valuelists flying around.

How about something like this:
When we sort the vectors, we can also use the sort to generate the corresponding shuffle mask (instead of recomputing it in O(|VL|^2) time in line 2623 - 2634), by sorting pairs of <Value, Index>.

The basic idea to keep the mask computation at the current place was because we wanted to compute the mask, (N^2 loop), only if vectorization is found to be beneficial.

Yes, but we can avoid the O(N^2) computation altogether, right? I mean, doing that doesn't add any cost to the sort. Or am I missing something?

OTOH if we do mask computation while sorting memory accesses, the utility of SortMemAccesses API may be reduced, however this is not a bigger problem and can be
worked around by keeping a bit to decide whether to compute the mask or not.

Don't we sort before we know whether we're going to need the mask?

We can then keep an optional shuffle mask instead of the NeedToShuffle bit as part of the tree node, and use it to get the correctly ordered list of scalars back from VectorizableTree[0].Scalars when we need to.

What do you think? My gut feeling is that keeping the mask around is worth it just to avoid the N^2 loop in 2623, and the fix to the bug falls out of it.

Although I am not an expert, I think keeping another ValueList for unordered scalars is a not a bad design. However I am open to other views.

It seems odd to have both the sorted and the unsorted list. The other thing that bothers me is that we have some invariants which don't seem to be well-defined. Is the guarantee that "Scalars" is ordered to have the "best" order to allow vectorization, while InOrdScalars is ordered according to the original IR order? Also, preserving "program order" is kind of meaningless. The fact we care about order at all is just an artifact of the way we build the tree. What we really care about is mapping each scalar leaf to the vector lane it eventually ends up in, which is precisely the shuffle mask.

Anyway, I'm also open to other opinions, if more people care about this.
What I'm really trying to do here is to make the code make a bit more sense - right now, it makes almost zero sense to me. (This isn't the fault of your patch, I'm talking about the general state of the SLP vectorizer).

Thanks Michael, my bad, I completely missed that we can avoid O(N^2) mask computation altogether. Moreover, we don't need to have extra VL and NeedToShuffle.
I will update the patch accordingly.

Updated the patch to incorporate the review comments.

Herald added a subscriber: sanjoy. · View Herald TranscriptFeb 23 2017, 9:30 PM

mkuper added inline comments.Feb 24 2017, 6:44 PM

lib/Analysis/LoopAccessAnalysis.cpp
1043–1044	The API seems a bit odd - having to pass a SmallVector when you know you don't want it to be filled doesn't look right. I'd prefer something like passing in a pointer, and filling the vector in if it's not null (or passing in an Optional<> reference to a similar effect.)
1080–1088	I was thinking along the lines of directly sorting an array of tuples <Offset, Value*, Index> - and then you can pull the mask out of it. But the that may turn out uglier than what you have here. Your call.
1091–1103	Redundant braces.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2628	This seems unrelated. Should it have been here before as well?

• ashahid added inline comments.Feb 27 2017, 10:36 PM

lib/Analysis/LoopAccessAnalysis.cpp
1043–1044	Ok
1080–1088	With tuple, I think we still need 2 std::sort() call. I am opting for the simpler than uglier.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2628	Earlier this path was not tested properly but thanks to this bug we found. As in this particular case the vectorized value is being used from the tree entry data instead of the return value, if we don't update E->VectorizedValue "shufflvector" IR will not be used, rather vectorized load IR will be used which is not the intention.

Updated the path to incorporate review comments.

LGTM, with some style comments inline.

lib/Analysis/LoopAccessAnalysis.cpp
1080–1088	Ok. I think we can do with 1, but I'm still not sure it's an improvement, so we can keep this as is.
1080–1088	IIRC, you can do something like "SmallVector<unsigned, 4> UseOrder(VL.size())", and then "UseOrder[i] = i". This is slightly shorter, and maybe a bit prettier, but feel free to leave as is if you prefer it.
1095	I think I would group everything that has to do with Mask together - something like: if (Mask) { Mask->reserve(VL.size()); for (unsigned i = 0; i < VL.size(); i++) Mask->emplace_back(i); std::sort(Mask->begin(), Mask->end(), [&UseOrder](unsigned Left, unsigned Right) { return UseOrder[Left] < UseOrder[Right]; }); } Instead of having three "if(Mask)"-s. Sure, you get another for loop, but I think it's cleaner this way.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2628	Right, that's more or less what I thought.

This revision is now accepted and ready to land.Feb 27 2017, 11:57 PM

• ashahid added inline comments.Feb 28 2017, 2:33 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2628	Oh I missed one thing, I think you added a test jumbled-same.ll for a recent fix. I think this test needs to be updated now to account for the use of shuffled value of the loaded vector in "extractelement"?

mkuper added inline comments.Feb 28 2017, 8:55 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2628	Argh, yes, I didn't notice the shuffle is used in the icmp, but not in the extract. Go ahead and update it. Thanks!

Closed by commit rL296575: [SLP] Fixes the bug due to absence of in order uses of scalars which needs to… (authored by • ashahid). · Explain WhyFeb 28 2017, 8:03 PM

This revision was automatically updated to reflect the committed changes.

Updated the patch to do a proper lane computation for external uses.

mkuper reopened this revision.Mar 2 2017, 11:13 PM

This revision is now accepted and ready to land.Mar 2 2017, 11:13 PM

mkuper added inline comments.Mar 2 2017, 11:24 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2861	This is the only change in the patch w.r.t the previous version, right?
2864	Isn't this condition equivalent to E->ShuffleMask[i] == ExternalUse.Lane? I think that's a bit clearer, maybe. (I'm not really certain it's clearer, but I'd like to make sure I understand what's going on here) Regardless, please add a comment describing what this whole thing does.
test/Transforms/SLPVectorizer/X86/jumbled-same.ll
16	Ok, now this makes sense. :-)

• ashahid added inline comments.Mar 3 2017, 1:22 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2861	Yes
2864	Yes, this condition is equivalent to E->ShuffleMask[i] == ExternalUse.Lane and it does look clearer. Oh, sure I will add the comment.

Closed by commit rL296863: [SLP] Fixes the bug due to absence of in order uses of scalars which needs to… (authored by • ashahid). · Explain WhyMar 3 2017, 2:14 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

Analysis/

LoopAccessAnalysis.h

7 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

32 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

149 lines

test/

Transforms/

SLPVectorizer/

X86/

jumbled-load-bug.ll

43 lines

jumbled-same.ll

2 lines

Diff 90429

include/llvm/Analysis/LoopAccessAnalysis.h

	Show First 20 Lines • Show All 654 Lines • ▼ Show 20 Lines
	/// run-time assumptions.			/// run-time assumptions.
	int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,			int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,
	const ValueToValueMap &StridesMap = ValueToValueMap(),			const ValueToValueMap &StridesMap = ValueToValueMap(),
	bool Assume = false, bool ShouldCheckWrap = true);			bool Assume = false, bool ShouldCheckWrap = true);

	/// \brief Try to sort an array of loads / stores.			/// \brief Try to sort an array of loads / stores.
	///			///
	/// An array of loads / stores can only be sorted if all pointer operands			/// An array of loads / stores can only be sorted if all pointer operands
	/// refer to the same object, and the differences between these pointers			/// refer to the same object, and the differences between these pointers
	/// are known to be constant. If that is the case, this returns true, and the			/// are known to be constant. If that is the case, this returns true, and the
	/// sorted array is returned in \p Sorted. Otherwise, this returns false, and			/// sorted array is returned in \p Sorted. Otherwise, this returns false, and
	/// \p Sorted is invalid.			/// \p Sorted is invalid.
				// If \p Mask is not null, it also returns the \p Mask which is the shuffle
				// mask for actual memory access order.
	bool sortMemAccesses(ArrayRef<Value *> VL, const DataLayout &DL,			bool sortMemAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
	ScalarEvolution &SE, SmallVectorImpl<Value *> &Sorted);			ScalarEvolution &SE, SmallVectorImpl<Value *> &Sorted,
				SmallVectorImpl<unsigned> *Mask = nullptr);

	/// \brief Returns true if the memory operations \p A and \p B are consecutive.			/// \brief Returns true if the memory operations \p A and \p B are consecutive.
	/// This is a simple API that does not depend on the analysis pass.			/// This is a simple API that does not depend on the analysis pass.
	bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType = true);			ScalarEvolution &SE, bool CheckType = true);

	/// \brief This analysis provides dependence information for the memory accesses			/// \brief This analysis provides dependence information for the memory accesses
	/// of a loop.			/// of a loop.
	▲ Show 20 Lines • Show All 73 Lines • Show Last 20 Lines

lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 1,034 Lines • ▼ Show 20 Lines	if (LoadInst *L = dyn_cast<LoadInst>(I))
return L->getPointerAddressSpace();		return L->getPointerAddressSpace();
if (StoreInst *S = dyn_cast<StoreInst>(I))		if (StoreInst *S = dyn_cast<StoreInst>(I))
return S->getPointerAddressSpace();		return S->getPointerAddressSpace();
return -1;		return -1;
}		}

bool llvm::sortMemAccesses(ArrayRef<Value *> VL, const DataLayout &DL,		bool llvm::sortMemAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
ScalarEvolution &SE,		ScalarEvolution &SE,
SmallVectorImpl<Value *> &Sorted) {		SmallVectorImpl<Value *> &Sorted,
		SmallVectorImpl<unsigned> *Mask) {
		mkuperUnsubmitted Not Done Reply Inline Actions The API seems a bit odd - having to pass a SmallVector when you know you don't want it to be filled doesn't look right. I'd prefer something like passing in a pointer, and filling the vector in if it's not null (or passing in an Optional<> reference to a similar effect.) mkuper: The API seems a bit odd - having to pass a SmallVector when you know you don't want it to be…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Ok ashahid: Ok
SmallVector<std::pair<int64_t, Value *>, 4> OffValPairs;		SmallVector<std::pair<int64_t, Value *>, 4> OffValPairs;
OffValPairs.reserve(VL.size());		OffValPairs.reserve(VL.size());
Sorted.reserve(VL.size());		Sorted.reserve(VL.size());

// Walk over the pointers, and map each of them to an offset relative to		// Walk over the pointers, and map each of them to an offset relative to
// first pointer in the array.		// first pointer in the array.
Value *Ptr0 = getPointerOperand(VL[0]);		Value *Ptr0 = getPointerOperand(VL[0]);
const SCEV *Scev0 = SE.getSCEV(Ptr0);		const SCEV *Scev0 = SE.getSCEV(Ptr0);
Value *Obj0 = GetUnderlyingObject(Ptr0, DL);		Value *Obj0 = GetUnderlyingObject(Ptr0, DL);

for (auto *Val : VL) {		for (auto *Val : VL) {
// The only kind of access we care about here is load.		// The only kind of access we care about here is load.
if (!isa<LoadInst>(Val))		if (!isa<LoadInst>(Val))
return false;		return false;

Value *Ptr = getPointerOperand(Val);		Value *Ptr = getPointerOperand(Val);
assert(Ptr && "Expected value to have a pointer operand.");		assert(Ptr && "Expected value to have a pointer operand.");

Show All 10 Lines	for (auto *Val : VL) {
// may just not be smart enough to figure out they do. Regardless,		// may just not be smart enough to figure out they do. Regardless,
// there's nothing we can do.		// there's nothing we can do.
if (!Diff)		if (!Diff)
return false;		return false;

OffValPairs.emplace_back(Diff->getAPInt().getSExtValue(), Val);		OffValPairs.emplace_back(Diff->getAPInt().getSExtValue(), Val);
}		}

std::sort(OffValPairs.begin(), OffValPairs.end(),		SmallVector<unsigned, 4> UseOrder(VL.size());
[](const std::pair<int64_t, Value *> &Left,		for (unsigned i = 0; i < VL.size(); i++) {
const std::pair<int64_t, Value *> &Right) {		UseOrder[i] = i;
return Left.first < Right.first;		}

		// Sort the memory accesses and keep the order of their uses in UseOrder.
		std::sort(UseOrder.begin(), UseOrder.end(),
		[&OffValPairs](unsigned Left, unsigned Right) {
		return OffValPairs[Left].first < OffValPairs[Right].first;
		mkuperUnsubmitted Not Done Reply Inline Actions I was thinking along the lines of directly sorting an array of tuples <Offset, Value, Index> - and then you can pull the mask out of it. But the that may turn out uglier than what you have here. Your call. mkuper:* I was thinking along the lines of directly sorting an array of tuples <Offset, Value*, Index>…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions With tuple, I think we still need 2 std::sort() call. I am opting for the simpler than uglier. ashahid: With tuple, I think we still need 2 std::sort() call. I am opting for the simpler than uglier.
		mkuperUnsubmitted Not Done Reply Inline Actions Ok. I think we can do with 1, but I'm still not sure it's an improvement, so we can keep this as is. mkuper: Ok. I think we can do with 1, but I'm still not sure it's an improvement, so we can keep this…
		mkuperUnsubmitted Not Done Reply Inline Actions IIRC, you can do something like "SmallVector<unsigned, 4> UseOrder(VL.size())", and then "UseOrder[i] = i". This is slightly shorter, and maybe a bit prettier, but feel free to leave as is if you prefer it. mkuper: IIRC, you can do something like "SmallVector<unsigned, 4> UseOrder(VL.size())", and then…
});		});

for (auto &it : OffValPairs)		for (unsigned i = 0; i < VL.size(); i++)
Sorted.push_back(it.second);		Sorted.emplace_back(OffValPairs[UseOrder[i]].second);

		// Sort UseOrder to compute the Mask.
		if (Mask) {
		mkuperUnsubmitted Not Done Reply Inline Actions I think I would group everything that has to do with Mask together - something like: if (Mask) { Mask->reserve(VL.size()); for (unsigned i = 0; i < VL.size(); i++) Mask->emplace_back(i); std::sort(Mask->begin(), Mask->end(), [&UseOrder](unsigned Left, unsigned Right) { return UseOrder[Left] < UseOrder[Right]; }); } Instead of having three "if(Mask)"-s. Sure, you get another for loop, but I think it's cleaner this way. mkuper: I think I would group everything that has to do with Mask together - something like: ``` if…
		Mask->reserve(VL.size());
		for (unsigned i = 0; i < VL.size(); i++)
		Mask->emplace_back(i);
		std::sort(Mask->begin(), Mask->end(),
		[&UseOrder](unsigned Left, unsigned Right) {
		return UseOrder[Left] < UseOrder[Right];
		});
		}
		mkuperUnsubmitted Not Done Reply Inline Actions Redundant braces. mkuper: Redundant braces.

return true;		return true;
}		}

/// Returns true if the memory operations \p A and \p B are consecutive.		/// Returns true if the memory operations \p A and \p B are consecutive.
bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,		bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
ScalarEvolution &SE, bool CheckType) {		ScalarEvolution &SE, bool CheckType) {
Value *PtrA = getPointerOperand(A);		Value *PtrA = getPointerOperand(A);
▲ Show 20 Lines • Show All 1,142 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 417 Lines • ▼ Show 20 Lines	private:

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);

/// \returns True if the ExtractElement/ExtractValue instructions in VL can		/// \returns True if the ExtractElement/ExtractValue instructions in VL can
/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).		/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).
bool canReuseExtract(ArrayRef<Value *> VL, unsigned Opcode) const;		bool canReuseExtract(ArrayRef<Value *> VL, unsigned Opcode) const;

/// Vectorize a single entry in the tree. VL icontains all isomorphic scalars		/// Vectorize a single entry in the tree.
/// in order of its usage in a user program, for example ADD1, ADD2 and so on		Value vectorizeTree(TreeEntry E);
/// or LOAD1 , LOAD2 etc.
Value vectorizeTree(ArrayRef<Value > VL, TreeEntry *E);

/// Vectorize a single entry in the tree, starting in \p VL.		/// Vectorize a single entry in the tree, starting in \p VL.
Value vectorizeTree(ArrayRef<Value > VL);		Value vectorizeTree(ArrayRef<Value > VL);

/// \returns the pointer to the vectorized value if \p VL is already		/// \returns the pointer to the vectorized value if \p VL is already
/// vectorized, or NULL. They may happen in cycles.		/// vectorized, or NULL. They may happen in cycles.
Value alreadyVectorized(ArrayRef<Value > VL) const;		Value alreadyVectorized(ArrayRef<Value > VL) const;

Show All 23 Lines	void reorderAltShuffleOperands(ArrayRef<Value *> VL,
SmallVectorImpl<Value *> &Left,		SmallVectorImpl<Value *> &Left,
SmallVectorImpl<Value *> &Right);		SmallVectorImpl<Value *> &Right);
/// \reorder commutative operands to get better probability of		/// \reorder commutative operands to get better probability of
/// generating vectorized code.		/// generating vectorized code.
void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,		void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
SmallVectorImpl<Value *> &Left,		SmallVectorImpl<Value *> &Left,
SmallVectorImpl<Value *> &Right);		SmallVectorImpl<Value *> &Right);
struct TreeEntry {		struct TreeEntry {
TreeEntry() : Scalars(), VectorizedValue(nullptr),		TreeEntry()
NeedToGather(0), NeedToShuffle(0) {}		: Scalars(), VectorizedValue(nullptr), NeedToGather(0), ShuffleMask() {}

/// \returns true if the scalars in VL are equal to this entry.		/// \returns true if the scalars in VL are equal to this entry.
bool isSame(ArrayRef<Value *> VL) const {		bool isSame(ArrayRef<Value *> VL) const {
assert(VL.size() == Scalars.size() && "Invalid size");		assert(VL.size() == Scalars.size() && "Invalid size");
return std::equal(VL.begin(), VL.end(), Scalars.begin());		return std::equal(VL.begin(), VL.end(), Scalars.begin());
}		}

/// \returns true if the scalars in VL are found in this tree entry.		/// \returns true if the scalars in VL are found in this tree entry.
Show All 11 Lines	struct TreeEntry {
ValueList Scalars;		ValueList Scalars;

/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue;		Value *VectorizedValue;

/// Do we need to gather this sequence ?		/// Do we need to gather this sequence ?
bool NeedToGather;		bool NeedToGather;

/// Do we need to shuffle the load ?		/// Records optional suffle mask for jumbled memory accesses in this.
bool NeedToShuffle;		SmallVector<unsigned, 8> ShuffleMask;

};		};

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
TreeEntry newTreeEntry(ArrayRef<Value > VL, bool Vectorized,		TreeEntry newTreeEntry(ArrayRef<Value > VL, bool Vectorized,
bool NeedToShuffle) {		ArrayRef<unsigned> ShuffleMask = None) {
VectorizableTree.emplace_back();		VectorizableTree.emplace_back();
int idx = VectorizableTree.size() - 1;		int idx = VectorizableTree.size() - 1;
TreeEntry *Last = &VectorizableTree[idx];		TreeEntry *Last = &VectorizableTree[idx];
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->NeedToGather = !Vectorized;		Last->NeedToGather = !Vectorized;
Last->NeedToShuffle = NeedToShuffle;		if (!ShuffleMask.empty())
		Last->ShuffleMask.insert(Last->ShuffleMask.begin(), ShuffleMask.begin(),
		ShuffleMask.end());

if (Vectorized) {		if (Vectorized) {
for (int i = 0, e = VL.size(); i != e; ++i) {		for (int i = 0, e = VL.size(); i != e; ++i) {
assert(!ScalarToTreeEntry.count(VL[i]) && "Scalar already in tree!");		assert(!ScalarToTreeEntry.count(VL[i]) && "Scalar already in tree!");
ScalarToTreeEntry[VL[i]] = idx;		ScalarToTreeEntry[VL[i]] = idx;
}		}
} else {		} else {
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
}		}
▲ Show 20 Lines • Show All 506 Lines • ▼ Show 20 Lines


void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth) {		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth) {
bool isAltShuffle = false;		bool isAltShuffle = false;
assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");		assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");

if (Depth == RecursionMaxDepth) {		if (Depth == RecursionMaxDepth) {
DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}

// Don't handle vectors.		// Don't handle vectors.
if (VL[0]->getType()->isVectorTy()) {		if (VL[0]->getType()->isVectorTy()) {
DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");		DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}

if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
if (SI->getValueOperand()->getType()->isVectorTy()) {		if (SI->getValueOperand()->getType()->isVectorTy()) {
DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");		DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}
unsigned Opcode = getSameOpcode(VL);		unsigned Opcode = getSameOpcode(VL);

// Check that this shuffle vector refers to the alternate		// Check that this shuffle vector refers to the alternate
// sequence of opcodes.		// sequence of opcodes.
if (Opcode == Instruction::ShuffleVector) {		if (Opcode == Instruction::ShuffleVector) {
Instruction *I0 = dyn_cast<Instruction>(VL[0]);		Instruction *I0 = dyn_cast<Instruction>(VL[0]);
unsigned Op = I0->getOpcode();		unsigned Op = I0->getOpcode();
if (Op != Instruction::ShuffleVector)		if (Op != Instruction::ShuffleVector)
isAltShuffle = true;		isAltShuffle = true;
}		}

// If all of the operands are identical or constant we have a simple solution.		// If all of the operands are identical or constant we have a simple solution.
if (allConstant(VL) \|\| isSplat(VL) \|\| !allSameBlock(VL) \|\| !Opcode) {		if (allConstant(VL) \|\| isSplat(VL) \|\| !allSameBlock(VL) \|\| !Opcode) {
DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");		DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}

// We now know that this is a vector of instructions of the same type from		// We now know that this is a vector of instructions of the same type from
// the same block.		// the same block.

// Don't vectorize ephemeral values.		// Don't vectorize ephemeral values.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
if (EphValues.count(VL[i])) {		if (EphValues.count(VL[i])) {
DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<		DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<
") is ephemeral.\n");		") is ephemeral.\n");
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}
}		}

// Check if this is a duplicate of another entry.		// Check if this is a duplicate of another entry.
if (ScalarToTreeEntry.count(VL[0])) {		if (ScalarToTreeEntry.count(VL[0])) {
int Idx = ScalarToTreeEntry[VL[0]];		int Idx = ScalarToTreeEntry[VL[0]];
TreeEntry *E = &VectorizableTree[Idx];		TreeEntry *E = &VectorizableTree[Idx];
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
DEBUG(dbgs() << "SLP: \tChecking bundle: " << *VL[i] << ".\n");		DEBUG(dbgs() << "SLP: \tChecking bundle: " << *VL[i] << ".\n");
if (E->Scalars[i] != VL[i]) {		if (E->Scalars[i] != VL[i]) {
DEBUG(dbgs() << "SLP: Gathering due to partial overlap.\n");		DEBUG(dbgs() << "SLP: Gathering due to partial overlap.\n");
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}
}		}
DEBUG(dbgs() << "SLP: Perfect diamond merge at " << *VL[0] << ".\n");		DEBUG(dbgs() << "SLP: Perfect diamond merge at " << *VL[0] << ".\n");
return;		return;
}		}

// Check that none of the instructions in the bundle are already in the tree.		// Check that none of the instructions in the bundle are already in the tree.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
if (ScalarToTreeEntry.count(VL[i])) {		if (ScalarToTreeEntry.count(VL[i])) {
DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<		DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<
") is already in tree.\n");		") is already in tree.\n");
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}
}		}

// If any of the scalars is marked as a value that needs to stay scalar then		// If any of the scalars is marked as a value that needs to stay scalar then
// we need to gather the scalars.		// we need to gather the scalars.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
if (MustGather.count(VL[i])) {		if (MustGather.count(VL[i])) {
DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n");		DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n");
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}
}		}

// Check that all of the users of the scalars that we want to vectorize are		// Check that all of the users of the scalars that we want to vectorize are
// schedulable.		// schedulable.
Instruction *VL0 = cast<Instruction>(VL[0]);		Instruction *VL0 = cast<Instruction>(VL[0]);
BasicBlock *BB = cast<Instruction>(VL0)->getParent();		BasicBlock *BB = cast<Instruction>(VL0)->getParent();

if (!DT->isReachableFromEntry(BB)) {		if (!DT->isReachableFromEntry(BB)) {
// Don't go into unreachable blocks. They may contain instructions with		// Don't go into unreachable blocks. They may contain instructions with
// dependency cycles which confuse the final scheduling.		// dependency cycles which confuse the final scheduling.
DEBUG(dbgs() << "SLP: bundle in unreachable block.\n");		DEBUG(dbgs() << "SLP: bundle in unreachable block.\n");
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}

// Check that every instructions appears once in this bundle.		// Check that every instructions appears once in this bundle.
for (unsigned i = 0, e = VL.size(); i < e; ++i)		for (unsigned i = 0, e = VL.size(); i < e; ++i)
for (unsigned j = i+1; j < e; ++j)		for (unsigned j = i+1; j < e; ++j)
if (VL[i] == VL[j]) {		if (VL[i] == VL[j]) {
DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");		DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}

auto &BSRef = BlocksSchedules[BB];		auto &BSRef = BlocksSchedules[BB];
if (!BSRef) {		if (!BSRef) {
BSRef = llvm::make_unique<BlockScheduling>(BB);		BSRef = llvm::make_unique<BlockScheduling>(BB);
}		}
BlockScheduling &BS = *BSRef.get();		BlockScheduling &BS = *BSRef.get();

if (!BS.tryScheduleBundle(VL, this)) {		if (!BS.tryScheduleBundle(VL, this)) {
DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");		DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");
assert((!BS.getScheduleData(VL[0]) \|\|		assert((!BS.getScheduleData(VL[0]) \|\|
!BS.getScheduleData(VL[0])->isPartOfBundle()) &&		!BS.getScheduleData(VL[0])->isPartOfBundle()) &&
"tryScheduleBundle should cancelScheduling on failure");		"tryScheduleBundle should cancelScheduling on failure");
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}
DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");		DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");

switch (Opcode) {		switch (Opcode) {
case Instruction::PHI: {		case Instruction::PHI: {
PHINode *PH = dyn_cast<PHINode>(VL0);		PHINode *PH = dyn_cast<PHINode>(VL0);

// Check for terminator values (e.g. invoke).		// Check for terminator values (e.g. invoke).
for (unsigned j = 0; j < VL.size(); ++j)		for (unsigned j = 0; j < VL.size(); ++j)
for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
TerminatorInst *Term = dyn_cast<TerminatorInst>(		TerminatorInst *Term = dyn_cast<TerminatorInst>(
cast<PHINode>(VL[j])->getIncomingValueForBlock(PH->getIncomingBlock(i)));		cast<PHINode>(VL[j])->getIncomingValueForBlock(PH->getIncomingBlock(i)));
if (Term) {		if (Term) {
DEBUG(dbgs() << "SLP: Need to swizzle PHINodes (TerminatorInst use).\n");		DEBUG(dbgs() << "SLP: Need to swizzle PHINodes (TerminatorInst use).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}
}		}

newTreeEntry(VL, true, false);		newTreeEntry(VL, true);
DEBUG(dbgs() << "SLP: added a vector of PHINodes.\n");		DEBUG(dbgs() << "SLP: added a vector of PHINodes.\n");

for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(		Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(
PH->getIncomingBlock(i)));		PH->getIncomingBlock(i)));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
bool Reuse = canReuseExtract(VL, Opcode);		bool Reuse = canReuseExtract(VL, Opcode);
if (Reuse) {		if (Reuse) {
DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");		DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");
} else {		} else {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
}		}
newTreeEntry(VL, Reuse, false);		newTreeEntry(VL, Reuse);
return;		return;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Check that a vectorized load would load the same memory as a scalar		// Check that a vectorized load would load the same memory as a scalar
// load.		// load.
// For example we don't want vectorize loads that are smaller than 8 bit.		// For example we don't want vectorize loads that are smaller than 8 bit.
// Even though we have a packed struct {<i2, i2, i2, i2>} LLVM treats		// Even though we have a packed struct {<i2, i2, i2, i2>} LLVM treats
// loading/storing it as an i8 struct. If we vectorize loads/stores from		// loading/storing it as an i8 struct. If we vectorize loads/stores from
// such a struct we read/write packed bits disagreeing with the		// such a struct we read/write packed bits disagreeing with the
// unvectorized version.		// unvectorized version.
Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();

if (DL->getTypeSizeInBits(ScalarTy) !=		if (DL->getTypeSizeInBits(ScalarTy) !=
DL->getTypeAllocSizeInBits(ScalarTy)) {		DL->getTypeAllocSizeInBits(ScalarTy)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");		DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");
return;		return;
}		}

// Make sure all loads in the bundle are simple - we can't vectorize		// Make sure all loads in the bundle are simple - we can't vectorize
// atomic or volatile loads.		// atomic or volatile loads.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
LoadInst *L = cast<LoadInst>(VL[i]);		LoadInst *L = cast<LoadInst>(VL[i]);
if (!L->isSimple()) {		if (!L->isSimple()) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
return;		return;
}		}
}		}

// Check if the loads are consecutive, reversed, or neither.		// Check if the loads are consecutive, reversed, or neither.
bool Consecutive = true;		bool Consecutive = true;
bool ReverseConsecutive = true;		bool ReverseConsecutive = true;
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
Consecutive = false;		Consecutive = false;
break;		break;
} else {		} else {
ReverseConsecutive = false;		ReverseConsecutive = false;
}		}
}		}

if (Consecutive) {		if (Consecutive) {
++NumLoadsWantToKeepOrder;		++NumLoadsWantToKeepOrder;
newTreeEntry(VL, true, false);		newTreeEntry(VL, true);
DEBUG(dbgs() << "SLP: added a vector of loads.\n");		DEBUG(dbgs() << "SLP: added a vector of loads.\n");
return;		return;
}		}

// If none of the load pairs were consecutive when checked in order,		// If none of the load pairs were consecutive when checked in order,
// check the reverse order.		// check the reverse order.
if (ReverseConsecutive)		if (ReverseConsecutive)
for (unsigned i = VL.size() - 1; i > 0; --i)		for (unsigned i = VL.size() - 1; i > 0; --i)
if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {
ReverseConsecutive = false;		ReverseConsecutive = false;
break;		break;
}		}

if (VL.size() > 2 && !ReverseConsecutive) {		if (VL.size() > 2 && !ReverseConsecutive) {
bool ShuffledLoads = true;		bool ShuffledLoads = true;
SmallVector<Value *, 8> Sorted;		SmallVector<Value *, 8> Sorted;
if (sortMemAccesses(VL, DL, SE, Sorted)) {		SmallVector<unsigned, 4> Mask;
		if (sortMemAccesses(VL, DL, SE, Sorted, &Mask)) {
auto NewVL = makeArrayRef(Sorted.begin(), Sorted.end());		auto NewVL = makeArrayRef(Sorted.begin(), Sorted.end());
for (unsigned i = 0, e = NewVL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = NewVL.size() - 1; i < e; ++i) {
if (!isConsecutiveAccess(NewVL[i], NewVL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(NewVL[i], NewVL[i + 1], DL, SE)) {
ShuffledLoads = false;		ShuffledLoads = false;
break;		break;
}		}
}		}
if (ShuffledLoads) {		if (ShuffledLoads) {
newTreeEntry(NewVL, true, true);		newTreeEntry(NewVL, true, makeArrayRef(Mask.begin(), Mask.end()));
return;		return;
}		}
}		}
}		}

BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);

if (ReverseConsecutive) {		if (ReverseConsecutive) {
++NumLoadsWantToChangeOrder;		++NumLoadsWantToChangeOrder;
DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");		DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");
} else {		} else {
DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
}		}
return;		return;
Show All 10 Lines	switch (Opcode) {
case Instruction::Trunc:		case Instruction::Trunc:
case Instruction::FPTrunc:		case Instruction::FPTrunc:
case Instruction::BitCast: {		case Instruction::BitCast: {
Type *SrcTy = VL0->getOperand(0)->getType();		Type *SrcTy = VL0->getOperand(0)->getType();
for (Value *Val : VL) {		for (Value *Val : VL) {
Type *Ty = cast<Instruction>(Val)->getOperand(0)->getType();		Type *Ty = cast<Instruction>(Val)->getOperand(0)->getType();
if (Ty != SrcTy \|\| !isValidElementType(Ty)) {		if (Ty != SrcTy \|\| !isValidElementType(Ty)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Gathering casts with different src types.\n");		DEBUG(dbgs() << "SLP: Gathering casts with different src types.\n");
return;		return;
}		}
}		}
newTreeEntry(VL, true, false);		newTreeEntry(VL, true);
DEBUG(dbgs() << "SLP: added a vector of casts.\n");		DEBUG(dbgs() << "SLP: added a vector of casts.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth+1);		buildTree_rec(Operands, Depth+1);
}		}
return;		return;
}		}
case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::FCmp: {		case Instruction::FCmp: {
// Check that all of the compares have the same predicate.		// Check that all of the compares have the same predicate.
CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();		CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();
Type *ComparedTy = cast<Instruction>(VL[0])->getOperand(0)->getType();		Type *ComparedTy = cast<Instruction>(VL[0])->getOperand(0)->getType();
for (unsigned i = 1, e = VL.size(); i < e; ++i) {		for (unsigned i = 1, e = VL.size(); i < e; ++i) {
CmpInst *Cmp = cast<CmpInst>(VL[i]);		CmpInst *Cmp = cast<CmpInst>(VL[i]);
if (Cmp->getPredicate() != P0 \|\|		if (Cmp->getPredicate() != P0 \|\|
Cmp->getOperand(0)->getType() != ComparedTy) {		Cmp->getOperand(0)->getType() != ComparedTy) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Gathering cmp with different predicate.\n");		DEBUG(dbgs() << "SLP: Gathering cmp with different predicate.\n");
return;		return;
}		}
}		}

newTreeEntry(VL, true, false);		newTreeEntry(VL, true);
DEBUG(dbgs() << "SLP: added a vector of compares.\n");		DEBUG(dbgs() << "SLP: added a vector of compares.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

Show All 15 Lines	switch (Opcode) {
case Instruction::SRem:		case Instruction::SRem:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
newTreeEntry(VL, true, false);		newTreeEntry(VL, true);
DEBUG(dbgs() << "SLP: added a vector of bin op.\n");		DEBUG(dbgs() << "SLP: added a vector of bin op.\n");

// Sort operands of the instructions so that each side is more likely to		// Sort operands of the instructions so that each side is more likely to
// have the same opcode.		// have the same opcode.
if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {		if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {
ValueList Left, Right;		ValueList Left, Right;
reorderInputsAccordingToOpcode(VL, Left, Right);		reorderInputsAccordingToOpcode(VL, Left, Right);
buildTree_rec(Left, Depth + 1);		buildTree_rec(Left, Depth + 1);
Show All 12 Lines	case Instruction::Xor: {
return;		return;
}		}
case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
// We don't combine GEPs with complicated (nested) indexing.		// We don't combine GEPs with complicated (nested) indexing.
for (Value *Val : VL) {		for (Value *Val : VL) {
if (cast<Instruction>(Val)->getNumOperands() != 2) {		if (cast<Instruction>(Val)->getNumOperands() != 2) {
DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");		DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}
}		}

// We can't combine several GEPs into one vector if they operate on		// We can't combine several GEPs into one vector if they operate on
// different types.		// different types.
Type *Ty0 = cast<Instruction>(VL0)->getOperand(0)->getType();		Type *Ty0 = cast<Instruction>(VL0)->getOperand(0)->getType();
for (Value *Val : VL) {		for (Value *Val : VL) {
Type *CurTy = cast<Instruction>(Val)->getOperand(0)->getType();		Type *CurTy = cast<Instruction>(Val)->getOperand(0)->getType();
if (Ty0 != CurTy) {		if (Ty0 != CurTy) {
DEBUG(dbgs() << "SLP: not-vectorizable GEP (different types).\n");		DEBUG(dbgs() << "SLP: not-vectorizable GEP (different types).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}
}		}

// We don't combine GEPs with non-constant indexes.		// We don't combine GEPs with non-constant indexes.
for (Value *Val : VL) {		for (Value *Val : VL) {
auto Op = cast<Instruction>(Val)->getOperand(1);		auto Op = cast<Instruction>(Val)->getOperand(1);
if (!isa<ConstantInt>(Op)) {		if (!isa<ConstantInt>(Op)) {
DEBUG(		DEBUG(
dbgs() << "SLP: not-vectorizable GEP (non-constant indexes).\n");		dbgs() << "SLP: not-vectorizable GEP (non-constant indexes).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
return;		return;
}		}
}		}

newTreeEntry(VL, true, false);		newTreeEntry(VL, true);
DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");		DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");
for (unsigned i = 0, e = 2; i < e; ++i) {		for (unsigned i = 0, e = 2; i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
case Instruction::Store: {		case Instruction::Store: {
// Check if the stores are consecutive or of we need to swizzle them.		// Check if the stores are consecutive or of we need to swizzle them.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Non-consecutive store.\n");		DEBUG(dbgs() << "SLP: Non-consecutive store.\n");
return;		return;
}		}

newTreeEntry(VL, true, false);		newTreeEntry(VL, true);
DEBUG(dbgs() << "SLP: added a vector of stores.\n");		DEBUG(dbgs() << "SLP: added a vector of stores.\n");

ValueList Operands;		ValueList Operands;
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(0));		Operands.push_back(cast<Instruction>(j)->getOperand(0));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
return;		return;
}		}
case Instruction::Call: {		case Instruction::Call: {
// Check if the calls are all to the same vectorizable intrinsic.		// Check if the calls are all to the same vectorizable intrinsic.
CallInst *CI = cast<CallInst>(VL[0]);		CallInst *CI = cast<CallInst>(VL[0]);
// Check if this is an Intrinsic call or something that can be		// Check if this is an Intrinsic call or something that can be
// represented by an intrinsic call		// represented by an intrinsic call
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
if (!isTriviallyVectorizable(ID)) {		if (!isTriviallyVectorizable(ID)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");		DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");
return;		return;
}		}
Function *Int = CI->getCalledFunction();		Function *Int = CI->getCalledFunction();
Value *A1I = nullptr;		Value *A1I = nullptr;
if (hasVectorInstrinsicScalarOpd(ID, 1))		if (hasVectorInstrinsicScalarOpd(ID, 1))
A1I = CI->getArgOperand(1);		A1I = CI->getArgOperand(1);
for (unsigned i = 1, e = VL.size(); i != e; ++i) {		for (unsigned i = 1, e = VL.size(); i != e; ++i) {
CallInst *CI2 = dyn_cast<CallInst>(VL[i]);		CallInst *CI2 = dyn_cast<CallInst>(VL[i]);
if (!CI2 \|\| CI2->getCalledFunction() != Int \|\|		if (!CI2 \|\| CI2->getCalledFunction() != Int \|\|
getVectorIntrinsicIDForCall(CI2, TLI) != ID \|\|		getVectorIntrinsicIDForCall(CI2, TLI) != ID \|\|
!CI->hasIdenticalOperandBundleSchema(*CI2)) {		!CI->hasIdenticalOperandBundleSchema(*CI2)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: mismatched calls:" << CI << "!=" << VL[i]		DEBUG(dbgs() << "SLP: mismatched calls:" << CI << "!=" << VL[i]
<< "\n");		<< "\n");
return;		return;
}		}
// ctlz,cttz and powi are special intrinsics whose second argument		// ctlz,cttz and powi are special intrinsics whose second argument
// should be same in order for them to be vectorized.		// should be same in order for them to be vectorized.
if (hasVectorInstrinsicScalarOpd(ID, 1)) {		if (hasVectorInstrinsicScalarOpd(ID, 1)) {
Value *A1J = CI2->getArgOperand(1);		Value *A1J = CI2->getArgOperand(1);
if (A1I != A1J) {		if (A1I != A1J) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: mismatched arguments in call:" << *CI		DEBUG(dbgs() << "SLP: mismatched arguments in call:" << *CI
<< " argument "<< A1I<<"!=" << A1J		<< " argument "<< A1I<<"!=" << A1J
<< "\n");		<< "\n");
return;		return;
}		}
}		}
// Verify that the bundle operands are identical between the two calls.		// Verify that the bundle operands are identical between the two calls.
if (CI->hasOperandBundles() &&		if (CI->hasOperandBundles() &&
!std::equal(CI->op_begin() + CI->getBundleOperandsStartIndex(),		!std::equal(CI->op_begin() + CI->getBundleOperandsStartIndex(),
CI->op_begin() + CI->getBundleOperandsEndIndex(),		CI->op_begin() + CI->getBundleOperandsEndIndex(),
CI2->op_begin() + CI2->getBundleOperandsStartIndex())) {		CI2->op_begin() + CI2->getBundleOperandsStartIndex())) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: mismatched bundle operands in calls:" << *CI << "!="		DEBUG(dbgs() << "SLP: mismatched bundle operands in calls:" << *CI << "!="
<< *VL[i] << '\n');		<< *VL[i] << '\n');
return;		return;
}		}
}		}

newTreeEntry(VL, true, false);		newTreeEntry(VL, true);
for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {		for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL) {		for (Value *j : VL) {
CallInst *CI2 = dyn_cast<CallInst>(j);		CallInst *CI2 = dyn_cast<CallInst>(j);
Operands.push_back(CI2->getArgOperand(i));		Operands.push_back(CI2->getArgOperand(i));
}		}
buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
case Instruction::ShuffleVector: {		case Instruction::ShuffleVector: {
// If this is not an alternate sequence of opcode like add-sub		// If this is not an alternate sequence of opcode like add-sub
// then do not vectorize this instruction.		// then do not vectorize this instruction.
if (!isAltShuffle) {		if (!isAltShuffle) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");		DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");
return;		return;
}		}
newTreeEntry(VL, true, false);		newTreeEntry(VL, true);
DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");		DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");

// Reorder operands if reordering would enable vectorization.		// Reorder operands if reordering would enable vectorization.
if (isa<BinaryOperator>(VL0)) {		if (isa<BinaryOperator>(VL0)) {
ValueList Left, Right;		ValueList Left, Right;
reorderAltShuffleOperands(VL, Left, Right);		reorderAltShuffleOperands(VL, Left, Right);
buildTree_rec(Left, Depth + 1);		buildTree_rec(Left, Depth + 1);
buildTree_rec(Right, Depth + 1);		buildTree_rec(Right, Depth + 1);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
default:		default:
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");		DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
return;		return;
}		}
}		}

unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {		unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {
unsigned N;		unsigned N;
Type *EltTy;		Type *EltTy;
▲ Show 20 Lines • Show All 222 Lines • ▼ Show 20 Lines	switch (Opcode) {
}		}
case Instruction::Load: {		case Instruction::Load: {
// Cost of wide load - cost of scalar loads.		// Cost of wide load - cost of scalar loads.
unsigned alignment = dyn_cast<LoadInst>(VL0)->getAlignment();		unsigned alignment = dyn_cast<LoadInst>(VL0)->getAlignment();
int ScalarLdCost = VecTy->getNumElements() *		int ScalarLdCost = VecTy->getNumElements() *
TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0);		TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0);
int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,		int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,
VecTy, alignment, 0);		VecTy, alignment, 0);
if (E->NeedToShuffle) {		if (!E->ShuffleMask.empty()) {
VecLdCost += TTI->getShuffleCost(		VecLdCost += TTI->getShuffleCost(
TargetTransformInfo::SK_PermuteSingleSrc, VecTy, 0);		TargetTransformInfo::SK_PermuteSingleSrc, VecTy, 0);
}		}
return VecLdCost - ScalarLdCost;		return VecLdCost - ScalarLdCost;
}		}
case Instruction::Store: {		case Instruction::Store: {
// We know that we can merge the stores. Calculate the cost.		// We know that we can merge the stores. Calculate the cost.
unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();		unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();
▲ Show 20 Lines • Show All 549 Lines • ▼ Show 20 Lines	Value BoUpSLP::alreadyVectorized(ArrayRef<Value > VL) const {
}		}
return nullptr;		return nullptr;
}		}

Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {		Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {
if (ScalarToTreeEntry.count(VL[0])) {		if (ScalarToTreeEntry.count(VL[0])) {
int Idx = ScalarToTreeEntry[VL[0]];		int Idx = ScalarToTreeEntry[VL[0]];
TreeEntry *E = &VectorizableTree[Idx];		TreeEntry *E = &VectorizableTree[Idx];
if (E->isSame(VL) \|\| (E->NeedToShuffle && E->isFoundJumbled(VL, DL, SE)))		if (E->isSame(VL) \|\|
return vectorizeTree(VL, E);		(!E->ShuffleMask.empty() && E->isFoundJumbled(VL, DL, SE)))
		return vectorizeTree(E);
}		}

Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
VectorType *VecTy = VectorType::get(ScalarTy, VL.size());		VectorType *VecTy = VectorType::get(ScalarTy, VL.size());

return Gather(VL, VecTy);		return Gather(VL, VecTy);
}		}

Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL, TreeEntry *E) {		Value BoUpSLP::vectorizeTree(TreeEntry E) {
IRBuilder<>::InsertPointGuard Guard(Builder);		IRBuilder<>::InsertPointGuard Guard(Builder);

if (E->VectorizedValue && !E->NeedToShuffle) {		if (E->VectorizedValue && E->ShuffleMask.empty()) {
DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");		DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

Instruction *VL0 = cast<Instruction>(E->Scalars[0]);		Instruction *VL0 = cast<Instruction>(E->Scalars[0]);
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL0))		if (StoreInst *SI = dyn_cast<StoreInst>(VL0))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
▲ Show 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	case Instruction::Load: {
}		}
LI->setAlignment(Alignment);		LI->setAlignment(Alignment);
E->VectorizedValue = LI;		E->VectorizedValue = LI;
++NumVectorInstructions;		++NumVectorInstructions;
propagateMetadata(LI, E->Scalars);		propagateMetadata(LI, E->Scalars);

// As program order of scalar loads are jumbled, the vectorized 'load'		// As program order of scalar loads are jumbled, the vectorized 'load'
// must be followed by a 'shuffle' with the required jumbled mask.		// must be followed by a 'shuffle' with the required jumbled mask.
if (!VL.empty() && (E->NeedToShuffle)) {		if (!E->ShuffleMask.empty()) {
assert(VL.size() == E->Scalars.size() &&
"Equal number of scalars expected");
SmallVector<Constant *, 8> Mask;		SmallVector<Constant *, 8> Mask;
for (Value *Val : VL) {		for (unsigned Lane = 0, LE = E->ShuffleMask.size(); Lane != LE;
if (ScalarToTreeEntry.count(Val)) {		++Lane) {
int Idx = ScalarToTreeEntry[Val];		Mask.push_back(Builder.getInt32(E->ShuffleMask[Lane]));
TreeEntry *E = &VectorizableTree[Idx];
for (unsigned Lane = 0, LE = VL.size(); Lane != LE; ++Lane) {
if (E->Scalars[Lane] == Val) {
Mask.push_back(Builder.getInt32(Lane));
break;
}
}
}
}		}

// Generate shuffle for jumbled memory access		// Generate shuffle for jumbled memory access
Value *Undef = UndefValue::get(VecTy);		Value *Undef = UndefValue::get(VecTy);
Value Shuf = Builder.CreateShuffleVector((Value )LI, Undef,		Value Shuf = Builder.CreateShuffleVector((Value )LI, Undef,
ConstantVector::get(Mask));		ConstantVector::get(Mask));
		E->VectorizedValue = Shuf;
		mkuperUnsubmitted Not Done Reply Inline Actions This seems unrelated. Should it have been here before as well? mkuper: This seems unrelated. Should it have been here before as well?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Earlier this path was not tested properly but thanks to this bug we found. As in this particular case the vectorized value is being used from the tree entry data instead of the return value, if we don't update E->VectorizedValue "shufflvector" IR will not be used, rather vectorized load IR will be used which is not the intention. ashahid: Earlier this path was not tested properly but thanks to this bug we found. As in this…
		mkuperUnsubmitted Not Done Reply Inline Actions Right, that's more or less what I thought. mkuper: Right, that's more or less what I thought.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Oh I missed one thing, I think you added a test jumbled-same.ll for a recent fix. I think this test needs to be updated now to account for the use of shuffled value of the loaded vector in "extractelement"? ashahid: Oh I missed one thing, I think you added a test jumbled-same.ll for a recent fix. I think this…
		mkuperUnsubmitted Not Done Reply Inline Actions Argh, yes, I didn't notice the shuffle is used in the icmp, but not in the extract. Go ahead and update it. Thanks! mkuper: Argh, yes, I didn't notice the shuffle is used in the icmp, but not in the extract. Go ahead…
		++NumVectorInstructions;
return Shuf;		return Shuf;
}		}

return LI;		return LI;
}		}
case Instruction::Store: {		case Instruction::Store: {
StoreInst *SI = cast<StoreInst>(VL0);		StoreInst *SI = cast<StoreInst>(VL0);
unsigned Alignment = SI->getAlignment();		unsigned Alignment = SI->getAlignment();
▲ Show 20 Lines • Show All 168 Lines • ▼ Show 20 Lines
BoUpSLP::vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues) {		BoUpSLP::vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues) {

// All blocks must be scheduled before any instructions are inserted.		// All blocks must be scheduled before any instructions are inserted.
for (auto &BSIter : BlocksSchedules) {		for (auto &BSIter : BlocksSchedules) {
scheduleBlock(BSIter.second.get());		scheduleBlock(BSIter.second.get());
}		}

Builder.SetInsertPoint(&F->getEntryBlock().front());		Builder.SetInsertPoint(&F->getEntryBlock().front());
auto VectorRoot = vectorizeTree(ArrayRef<Value >(), &VectorizableTree[0]);		auto *VectorRoot = vectorizeTree(&VectorizableTree[0]);

// If the vectorized tree can be rewritten in a smaller type, we truncate the		// If the vectorized tree can be rewritten in a smaller type, we truncate the
// vectorized root. InstCombine will then rewrite the entire expression. We		// vectorized root. InstCombine will then rewrite the entire expression. We
// sign extend the extracted values below.		// sign extend the extracted values below.
auto *ScalarRoot = VectorizableTree[0].Scalars[0];		auto *ScalarRoot = VectorizableTree[0].Scalars[0];
if (MinBWs.count(ScalarRoot)) {		if (MinBWs.count(ScalarRoot)) {
if (auto *I = dyn_cast<Instruction>(VectorRoot))		if (auto *I = dyn_cast<Instruction>(VectorRoot))
Builder.SetInsertPoint(&*++BasicBlock::iterator(I));		Builder.SetInsertPoint(&*++BasicBlock::iterator(I));
Show All 28 Lines	for (const auto &ExternalUse : ExternalUses) {
assert(ScalarToTreeEntry.count(Scalar) && "Invalid scalar");		assert(ScalarToTreeEntry.count(Scalar) && "Invalid scalar");

int Idx = ScalarToTreeEntry[Scalar];		int Idx = ScalarToTreeEntry[Scalar];
TreeEntry *E = &VectorizableTree[Idx];		TreeEntry *E = &VectorizableTree[Idx];
assert(!E->NeedToGather && "Extracting from a gather list");		assert(!E->NeedToGather && "Extracting from a gather list");

Value *Vec = E->VectorizedValue;		Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");		assert(Vec && "Can't find vectorizable value");
		unsigned i = 0;
Value *Lane = Builder.getInt32(ExternalUse.Lane);		Value *Lane;
		if (!E->ShuffleMask.empty()) {
		mkuperUnsubmitted Not Done Reply Inline Actions This is the only change in the patch w.r.t the previous version, right? mkuper: This is the only change in the patch w.r.t the previous version, right?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes ashahid: Yes
		SmallVector<unsigned, 4> Val(E->ShuffleMask.size());
		for (; i < E->ShuffleMask.size(); i++) {
		if (E->Scalars[E->ShuffleMask[i]] == Scalar)
		mkuperUnsubmitted Not Done Reply Inline Actions Isn't this condition equivalent to E->ShuffleMask[i] == ExternalUse.Lane? I think that's a bit clearer, maybe. (I'm not really certain it's clearer, but I'd like to make sure I understand what's going on here) Regardless, please add a comment describing what this whole thing does. mkuper: Isn't this condition equivalent to E->ShuffleMask[i] == ExternalUse.Lane? I think that's a bit…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, this condition is equivalent to E->ShuffleMask[i] == ExternalUse.Lane and it does look clearer. Oh, sure I will add the comment. ashahid: Yes, this condition is equivalent to E->ShuffleMask[i] == ExternalUse.Lane and it does look…
		break;
		}
		Lane = Builder.getInt32(i);
		} else {
		Lane = Builder.getInt32(ExternalUse.Lane);
		}
// If User == nullptr, the Scalar is used as extra arg. Generate		// If User == nullptr, the Scalar is used as extra arg. Generate
// ExtractElement instruction and update the record for this scalar in		// ExtractElement instruction and update the record for this scalar in
// ExternallyUsedValues.		// ExternallyUsedValues.
if (!User) {		if (!User) {
assert(ExternallyUsedValues.count(Scalar) &&		assert(ExternallyUsedValues.count(Scalar) &&
"Scalar with nullptr as an external user must be registered in "		"Scalar with nullptr as an external user must be registered in "
"ExternallyUsedValues map");		"ExternallyUsedValues map");
if (auto *VecI = dyn_cast<Instruction>(Vec)) {		if (auto *VecI = dyn_cast<Instruction>(Vec)) {
▲ Show 20 Lines • Show All 2,240 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/jumbled-load-bug.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -S -slp-vectorizer \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define <4 x i32> @zot() #0 {
				; CHECK-LABEL: @zot(
				; CHECK-NEXT: bb:
				; CHECK-NEXT: [[P0:%.]] = getelementptr inbounds [4 x i8], [4 x i8] undef, i64 undef, i64 0
				; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds [4 x i8], [4 x i8] undef, i64 undef, i64 1
				; CHECK-NEXT: [[P2:%.]] = getelementptr inbounds [4 x i8], [4 x i8] undef, i64 undef, i64 2
				; CHECK-NEXT: [[P3:%.]] = getelementptr inbounds [4 x i8], [4 x i8] undef, i64 undef, i64 3
				; CHECK-NEXT: [[TMP0:%.]] = bitcast i8 [[P0]] to <4 x i8>*
				; CHECK-NEXT: [[TMP1:%.]] = load <4 x i8>, <4 x i8> [[TMP0]], align 1
				; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <4 x i8> [[TMP1]], <4 x i8> undef, <4 x i32> <i32 1, i32 0, i32 2, i32 3>
				; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i8> [[TMP2]], i32 0
				; CHECK-NEXT: [[I0:%.*]] = insertelement <4 x i8> undef, i8 [[TMP3]], i32 0
				; CHECK-NEXT: [[TMP4:%.*]] = extractelement <4 x i8> [[TMP2]], i32 1
				; CHECK-NEXT: [[I1:%.*]] = insertelement <4 x i8> [[I0]], i8 [[TMP4]], i32 1
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i8> [[TMP2]], i32 2
				; CHECK-NEXT: [[I2:%.*]] = insertelement <4 x i8> [[I1]], i8 [[TMP5]], i32 2
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <4 x i8> [[TMP2]], i32 3
				; CHECK-NEXT: [[I3:%.*]] = insertelement <4 x i8> [[I2]], i8 [[TMP6]], i32 3
				; CHECK-NEXT: [[RET:%.*]] = zext <4 x i8> [[I3]] to <4 x i32>
				; CHECK-NEXT: ret <4 x i32> [[RET]]
				;
				bb:
				%p0 = getelementptr inbounds [4 x i8], [4 x i8]* undef, i64 undef, i64 0
				%p1 = getelementptr inbounds [4 x i8], [4 x i8]* undef, i64 undef, i64 1
				%p2 = getelementptr inbounds [4 x i8], [4 x i8]* undef, i64 undef, i64 2
				%p3 = getelementptr inbounds [4 x i8], [4 x i8]* undef, i64 undef, i64 3
				%v3 = load i8, i8* %p3, align 1
				%v2 = load i8, i8* %p2, align 1
				%v0 = load i8, i8* %p0, align 1
				%v1 = load i8, i8* %p1, align 1
				%i0 = insertelement <4 x i8> undef, i8 %v1, i32 0
				%i1 = insertelement <4 x i8> %i0, i8 %v0, i32 1
				%i2 = insertelement <4 x i8> %i1, i8 %v2, i32 2
				%i3 = insertelement <4 x i8> %i2, i8 %v3, i32 3
				%ret = zext <4 x i8> %i3 to <4 x i32>
				ret <4 x i32> %ret
				}

test/Transforms/SLPVectorizer/X86/jumbled-same.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4
	@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4

	define i32 @fn1() {			define i32 @fn1() {
	; CHECK-LABEL: @fn1(			; CHECK-LABEL: @fn1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4			; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[TMP0]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[TMP0]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
	; CHECK-NEXT: [[TMP2:%.*]] = icmp sgt <4 x i32> [[TMP1]], zeroinitializer			; CHECK-NEXT: [[TMP2:%.*]] = icmp sgt <4 x i32> [[TMP1]], zeroinitializer
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0
				mkuperUnsubmitted Not Done Reply Inline Actions Ok, now this makes sense. :-) mkuper: Ok, now this makes sense. :-)
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP3]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP3]], i32 0
	; CHECK-NEXT: [[TMP5:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 1			; CHECK-NEXT: [[TMP5:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 1
	; CHECK-NEXT: [[TMP6:%.]] = insertelement <4 x i32> [[TMP5]], i32 ptrtoint (i32 () @fn1 to i32), i32 2			; CHECK-NEXT: [[TMP6:%.]] = insertelement <4 x i32> [[TMP5]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 8, i32 3			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 8, i32 3
	; CHECK-NEXT: [[TMP8:%.*]] = select <4 x i1> [[TMP2]], <4 x i32> [[TMP7]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>			; CHECK-NEXT: [[TMP8:%.*]] = select <4 x i1> [[TMP2]], <4 x i32> [[TMP7]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
	; CHECK-NEXT: store <4 x i32> [[TMP8]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4			; CHECK-NEXT: store <4 x i32> [[TMP8]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;
	Show All 19 Lines