This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
VectorUtils.h
-
lib/
-
Analysis/
3
VectorUtils.cpp
-
Transforms/Vectorize/
-
Vectorize/
9/20
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/AArch64/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
6
transpose.ll

Differential D46126

[SLP] Vectorize transposable binary operand bundles
Needs ReviewPublic

Authored by mssimpso on Apr 26 2018, 8:44 AM.

Download Raw Diff

Details

Reviewers

ABataev
Ayal
RKSimon
mkuper
javed.absar

Summary

Given a bundle of binary operations, we normally try to vectorize their operands by bundling all operand-zero values together and bundling all operand-one values together. Thus, assuming all operands are instructions, each operand bundle must have the same opcode to be vectorizable. If this is not the case, the operand bundles must be gathered. Instead of gathering, we can try to "transpose" the operand bundles such that each resulting bundle will have a single opcode.

This patch adds the ability to transpose binary operand bundles to enable vectorization. When we transpose the operand bundles, we bundle together all operands of one or more instructions, rather than bundling together a single operand (e.g, operand-zero) of several instructions. This bundling is similar to that performed by tryToVectorizePair. If a transpose is performed while building a vectorizable tree, we must re-transpose the vectors at run-time to restore the correct bundling. The transpose is performed with shufflevector instructions that read corresponding even- or odd-numbered vector elements from two n-dimensional source vectors and write each result into consecutive elements of an n-dimensional destination vector. The cost of these added shufflevector instructions is included in the cost model.

Please refer to the modified test cases for examples.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 17581
Build 17581: arc lint + arc unit

Event Timeline

mssimpso created this revision.Apr 26 2018, 8:44 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptApr 26 2018, 8:44 AM

Will it work correctly if some of the operations are used several times in the bundles? It would be good to have the tests for this kind of situation.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1444–1460	Maybe it's worth it to merge these 2 checks into one to reduce the number of iterations?
1491	`auto &`->`const BoUpSLP::ValueList &` or, probably, you may use `ArrayRef<Value*` here
1495	Try to use `Bundles.emplace_back(Slice.begin(), Slice.end());`
1982	`auto`->`Optional<SmallVector<BoUpSLP::ValueList, 4>>`
1983	`auto &`->`const auto &`
2494	Pre-reserve the space for the Operands here.
2495	Maybe it is worth to store the Transposed bundles in the `TreeNode` to not perform this kind of analysis several times?

In D46126#1079801, @ABataev wrote:

Will it work correctly if some of the operations are used several times in the bundles? It would be good to have the tests for this kind of situation.

Thanks for taking a look at the patch Alexey!. It does work for the reuse case, but the operations still need to be in transposed form. I'll add a test for this. I already have one reuse test case (build_vec_v4i32_reuse), but this shows reuse in the final binop, not in its operands.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1444–1460	Sure, that makes sense.
2495	Yes, that was my thought as well. I'll update the patch.

This is reminiscent of LV's interleave group optimization, in the sense that a couple of correlated inefficient vector "gathers" are replaced by a couple of efficiently formed vectors followed by transposing shuffles. The correlated gathers may come from the two operands of a binary operation, as in this patch, or more generally from arbitrary leaves of the SLP tree.

lib/Analysis/VectorUtils.cpp
535	Could `createStrideMask` be (re)used?
lib/Transforms/Vectorize/SLPVectorizer.cpp
1470	Both operands are mapped to the opcode of operand 0?
1528	May be easier to follow if `ElemsPerSrcVec` is initially set to `BundleVF` and doubled inside the loop, `ElemsPerDstVec` is set to its double, and `VF` renamed to something like `NumVectors`.
1980	Comment that transposing operands each having a common opcode is mutually exclusive with swapping commutative operands below, and should precede it?

In D46126#1082870, @Ayal wrote:

This is reminiscent of LV's interleave group optimization, in the sense that a couple of correlated inefficient vector "gathers" are replaced by a couple of efficiently formed vectors followed by transposing shuffles. The correlated gathers may come from the two operands of a binary operation, as in this patch, or more generally from arbitrary leaves of the SLP tree.

Thanks for taking a look at the patch, Ayal! Yes, that's right. It's a little like LV's interleave groups. While the "correlated gathers" would be leaves of the tree without this patch, if we can perform a transpose, we can continue recursively building a deeper tree based on the shuffled bundles. That's a good summary that could be incorporated in the comment above transposeBinOpBundle.

And yes, the approach is currently limited to the two operands of binary operations. But it could be generalized to include the operands of other instructions, or as you point out, any correlated gathers in the SLP tree.

lib/Analysis/VectorUtils.cpp
535	Not directly, but we could use `createStrideMask` to create two stride-2 masks and then interleave them. stride_mask_0 = <0, 2, 4, 6> stride_mask_1 = <8, 10, 12, 14> transpose_mask = <0, 8, 2, 10, 4, 12, 6, 14> What do you think?
lib/Transforms/Vectorize/SLPVectorizer.cpp
1470	I've already checked that both operands have the same opcode (starting at line 1423). I also added a comment here, but this is probably still a bit confusing. I will rewrite this to make it more clear.
1528	Sounds good.
1980	Yes, that's a good idea.

Ayal added inline comments.May 1 2018, 8:45 AM

lib/Analysis/VectorUtils.cpp
535	I'm a bit confused about the concatenation order which presumably leads to this interleaved strided mask, rather than using the strided <0, 2, 4, ..., 2*(VF-1)> mask directly. See comments below, also examining the VF=4 tests.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1470	Ahh, of course...
1479	Making all bundles have the same (smallest) size certainly simplifies things, but potentially misses performance opportunities; at-least in theory. Perhaps deserves a comment. Suffice to check that `MinSize` is a power of 2, and that all bundle sizes are divisible by MinSize, rather than requiring all bundle sizes to be powers of 2. E.g., having one bundle of size 2 and another of size 6 should be fine, iiuc.
1491	This is a bit confusing: bundles are traversed according to their insertion order into Opcode2Operands, i.e., according to original operand order; but are aggregated and placed inside `Bundles` according to their opcode?
test/Transforms/SLPVectorizer/AArch64/transpose.ll
93	Indeed this calls for shuffling with <0,4,2,6> and <1,5,3,7> masks; but is this pattern more natural to expect than %tmp2.1 = add i32 %tmp0.2, %tmp0.3 %tmp2.2 = add i32 %tmp1.0, %tmp1.1 Perhaps in some complex numbers context(?)
168	Could this be done equally well with 'normal' strided masks, i.e.: ; CHECK-NEXT: [[TMP5:%.]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; CHECK-NEXT: [[TMP6:%.]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; CHECK-NEXT: [[TMP7:%.]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 2, i32 4, i32 6> ; CHECK-NEXT: [[TMP8:%.]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 1, i32 3, i32 5, i32 7>
183	This admittedly refers to the original test: `%tmp1.2 == %tmp0.2` and `%tmp1.3 == %tmp0.3`. Not sure if this was intentional, but it does raise the issue of exercising bundles of originally different sizes, and potential reuse of same instruction multiple times. Are the four xor's first bundled together, and then broken into two adjacent bundles of MinSize=2?
228	ditto.

Addressed first round of comments from Alexey and Ayal. Thanks again for the feedback! I'll respond to Ayal's most recent comments in a separate update.

mssimpso added inline comments.May 1 2018, 10:33 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1491	Using `const std::pair<unsigned, BoUpSLP::ValueList> &`
2494	I'm not sure we can do this. For the transpose case, since we don't know what the operands are (they will be shuffles), I've left `Operands` empty. If we instead pre-reserve space, we'll end up sending a vector of null pointers to `getArithmeticInstrCost`, which will probably break.

mssimpso added inline comments.May 1 2018, 11:12 AM

test/Transforms/SLPVectorizer/AArch64/transpose.ll
93	Ah, I see your point now. I think the current patch will actually produce incorrect code if you make the above change to the test, since we don't actually enforce the "interleaved order". We'll still use the <0,4,2,6> and <1,5,3,7> masks instead of the stride-2 masks. So I think we should probably record what the mask should be when we do the re-bundling to allow for the various possibilities. It makes sense that we would have to do this in hindsight. We can also probably get rid of the MinSize restriction at the same time. I'd also want to test the mask against the known shuffle kinds for the cost calculation to ensure we are computing the most appropriate cost for the target. I'm actually surprised we don't already have something like `TTI->getShuffleKind(ArrayRef<int> Mask)`. Perhaps I'll work on that first.

Ayal added inline comments.May 1 2018, 12:31 PM

test/Transforms/SLPVectorizer/AArch64/transpose.ll
93	Perhaps the initial natural order, going bottom-up, is the 2 x n matrix transpose, i.e., the even and odd stride-2 masks. This could be further optimized, by considering the orderings preferred by subsequent leaves up the tree, similar to Alexey's `NumOpsWantToKeepOrder`; but here one could pass through several transposings before reaching the leaves... (btw, any worthwhile workloads driving this?) After choosing the `bestOrder()` = the one wanted by most ops, `SK_PermuteSingleSrc` shuffle costs are used. LV uses `getInterleavedMemoryOpCost*()` to compute the cost of its strided shuffles. And `isShuffle()` may also provide inspiration ;-).

Revision Contents

Path

Size

include/

llvm/

Analysis/

VectorUtils.h

18 lines

lib/

Analysis/

VectorUtils.cpp

11 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

314 lines

test/

Transforms/

SLPVectorizer/

AArch64/

transpose.ll

198 lines

Diff 144743

include/llvm/Analysis/VectorUtils.h

	Show First 20 Lines • Show All 161 Lines • ▼ Show 20 Lines
	/// <Start, Start + 1, ... Start + NumInts - 1, undef_1, ... undef_NumUndefs>			/// <Start, Start + 1, ... Start + NumInts - 1, undef_1, ... undef_NumUndefs>
	///			///
	/// For example, the mask for Start = 0, NumInsts = 4, and NumUndefs = 4 is:			/// For example, the mask for Start = 0, NumInsts = 4, and NumUndefs = 4 is:
	///			///
	/// <0, 1, 2, 3, undef, undef, undef, undef>			/// <0, 1, 2, 3, undef, undef, undef, undef>
	Constant *createSequentialMask(IRBuilder<> &Builder, unsigned Start,			Constant *createSequentialMask(IRBuilder<> &Builder, unsigned Start,
	unsigned NumInts, unsigned NumUndefs);			unsigned NumInts, unsigned NumUndefs);

				/// Create a transpose shuffle mask.
				///
				/// This function creates a shuffle mask useful for transposing a 2xn matrix.
				/// The mask reads corresponding even- or odd-numbered vector elements from two
				/// n-dimensional source vectors and writes each result into consecutive
				/// elements of an n-dimensional destination vector. The elements of the mask
				/// begin with either zero or one depending on the value of \p IsOdd, and are
				/// of the form:
				///
				/// <0, VF + 0, 2, VF + 2, ..., VF - 2, 2 * VF - 2>, for IsOdd = false, and
				/// <1, VF + 1, 3, VF + 3, ..., VF - 1, 2 * VF - 1>, for IsOdd = true.
				///
				/// For example, the mask for VF = 4 is:
				///
				/// <0, 4, 2, 6>, for IsOdd = false, and
				/// <1, 5, 3, 7>, for IsOdd = true
				Constant *createTransposeMask(IRBuilder<> &Builder, unsigned VF, bool IsOdd);

	/// Concatenate a list of vectors.			/// Concatenate a list of vectors.
	///			///
	/// This function generates code that concatenate the vectors in \p Vecs into a			/// This function generates code that concatenate the vectors in \p Vecs into a
	/// single large vector. The number of vectors should be greater than one, and			/// single large vector. The number of vectors should be greater than one, and
	/// their element types should be the same. The number of elements in the			/// their element types should be the same. The number of elements in the
	/// vectors should also be the same; however, if the last vector has fewer			/// vectors should also be the same; however, if the last vector has fewer
	/// elements, it will be padded with undefs.			/// elements, it will be padded with undefs.
	Value concatenateVectors(IRBuilder<> &Builder, ArrayRef<Value > Vecs);			Value concatenateVectors(IRBuilder<> &Builder, ArrayRef<Value > Vecs);

	} // llvm namespace			} // llvm namespace

	#endif			#endif

lib/Analysis/VectorUtils.cpp

Show First 20 Lines • Show All 517 Lines • ▼ Show 20 Lines	Constant *llvm::createSequentialMask(IRBuilder<> &Builder, unsigned Start,

Constant *Undef = UndefValue::get(Builder.getInt32Ty());		Constant *Undef = UndefValue::get(Builder.getInt32Ty());
for (unsigned i = 0; i < NumUndefs; i++)		for (unsigned i = 0; i < NumUndefs; i++)
Mask.push_back(Undef);		Mask.push_back(Undef);

return ConstantVector::get(Mask);		return ConstantVector::get(Mask);
}		}

		Constant *llvm::createTransposeMask(IRBuilder<> &Builder, unsigned VF,
		bool IsOdd) {
		SmallVector<Constant *, 16> Mask;
		unsigned StartVal = IsOdd ? 1 : 0;
		for (unsigned I = 0; I < VF; I += 2)
		for (unsigned J = 0; J < 2; ++J)
		Mask.push_back(Builder.getInt32(StartVal + I + VF * J));

		return ConstantVector::get(Mask);
		}
		AyalUnsubmitted Not Done Reply Inline Actions Could `createStrideMask` be (re)used? Ayal: Could `createStrideMask` be (re)used?
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Not directly, but we could use `createStrideMask` to create two stride-2 masks and then interleave them. stride_mask_0 = <0, 2, 4, 6> stride_mask_1 = <8, 10, 12, 14> transpose_mask = <0, 8, 2, 10, 4, 12, 6, 14> What do you think? mssimpso: Not directly, but we could use `createStrideMask` to create two stride-2 masks and then…
		AyalUnsubmitted Not Done Reply Inline Actions I'm a bit confused about the concatenation order which presumably leads to this interleaved strided mask, rather than using the strided <0, 2, 4, ..., 2(VF-1)> mask directly. See comments below, also examining the VF=4 tests. Ayal:* I'm a bit confused about the concatenation order which presumably leads to this interleaved…

/// A helper function for concatenating vectors. This function concatenates two		/// A helper function for concatenating vectors. This function concatenates two
/// vectors having the same element type. If the second vector has fewer		/// vectors having the same element type. If the second vector has fewer
/// elements than the first, it is padded with undefs.		/// elements than the first, it is padded with undefs.
static Value concatenateTwoVectors(IRBuilder<> &Builder, Value V1,		static Value concatenateTwoVectors(IRBuilder<> &Builder, Value V1,
Value *V2) {		Value *V2) {
VectorType *VecTy1 = dyn_cast<VectorType>(V1->getType());		VectorType *VecTy1 = dyn_cast<VectorType>(V1->getType());
VectorType *VecTy2 = dyn_cast<VectorType>(V2->getType());		VectorType *VecTy2 = dyn_cast<VectorType>(V2->getType());
assert(VecTy1 && VecTy2 &&		assert(VecTy1 && VecTy2 &&
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 741 Lines • ▼ Show 20 Lines	struct TreeEntry {
/// to be a pointer and needs to be able to initialize the child iterator.		/// to be a pointer and needs to be able to initialize the child iterator.
/// Thus we need a reference back to the container to translate the indices		/// Thus we need a reference back to the container to translate the indices
/// to entries.		/// to entries.
std::vector<TreeEntry> &Container;		std::vector<TreeEntry> &Container;

/// The TreeEntry index containing the user of this entry. We can actually		/// The TreeEntry index containing the user of this entry. We can actually
/// have multiple users so the data structure is not truly a tree.		/// have multiple users so the data structure is not truly a tree.
SmallVector<int, 1> UserTreeIndices;		SmallVector<int, 1> UserTreeIndices;

		/// Have the operands of this tree entry been transposed to enable
		/// vectorization?
		SmallVector<ValueList, 4> TransposedOperands;
};		};

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
void newTreeEntry(ArrayRef<Value *> VL, bool Vectorized, int &UserTreeIdx,		void newTreeEntry(ArrayRef<Value *> VL, bool Vectorized, int &UserTreeIdx,
ArrayRef<unsigned> ReuseShuffleIndices = None,		ArrayRef<unsigned> ReuseShuffleIndices = None,
ArrayRef<unsigned> ReorderIndices = None) {		ArrayRef<unsigned> ReorderIndices = None,
		ArrayRef<ValueList> TransposedOperands = None) {
VectorizableTree.emplace_back(VectorizableTree);		VectorizableTree.emplace_back(VectorizableTree);
int idx = VectorizableTree.size() - 1;		int idx = VectorizableTree.size() - 1;
TreeEntry *Last = &VectorizableTree[idx];		TreeEntry *Last = &VectorizableTree[idx];
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->NeedToGather = !Vectorized;		Last->NeedToGather = !Vectorized;
Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),		Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),
ReuseShuffleIndices.end());		ReuseShuffleIndices.end());
Last->ReorderIndices = ReorderIndices;		Last->ReorderIndices = ReorderIndices;
		Last->TransposedOperands.append(TransposedOperands.begin(),
		TransposedOperands.end());
if (Vectorized) {		if (Vectorized) {
for (int i = 0, e = VL.size(); i != e; ++i) {		for (int i = 0, e = VL.size(); i != e; ++i) {
assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");		assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");
ScalarToTreeEntry[VL[i]] = idx;		ScalarToTreeEntry[VL[i]] = idx;
}		}
} else {		} else {
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
}		}
▲ Show 20 Lines • Show All 585 Lines • ▼ Show 20 Lines	static std::string getNodeAttributes(const TreeEntry *Entry,
if (Entry->NeedToGather)		if (Entry->NeedToGather)
return "color=red";		return "color=red";
return "";		return "";
}		}
};		};

} // end namespace llvm		} // end namespace llvm

		/// Try and transpose a bundle of binary operations if doing so would
		/// enable vectorization.
		///
		/// Given a bundle of isomorphic binary operations, we normally try to
		/// vectorize their operands by bundling all operand-zero values together and
		/// bundling all operand-one values together. Thus, assuming all operands are
		/// instructions, each operand bundle must also be isomorphic. That is, all
		/// operand-zero values must have the same opcode and all operand-one values
		/// must have the same opcode. For example:
		///
		/// ; Original expression: Both operand bundles have the same opcode (the
		/// ; operand-zero bundle is an 'add' and the operand-one bundle is a 'sub').
		/// ; The operations are vectorizable.
		/// add (add s0 s1) (sub s2, s3)
		/// add (add s4 s5) (sub s6, s7)
		///
		/// ; Vectorized IR (simplified)
		/// %a = add <s0, s4>, <s1, s5> ; = <s0 + s1, s4 + s5>
		/// %b = sub <s2, s6>, <s3, s7> ; = <s2 - s3, s6 - s7>
		/// %c = add %a, %b
		///
		/// However, if an operand bundle is nonisomorhpic (i.e., the instructions in
		/// the bundle have more than one opcode), it is not vectorizable. If this is
		/// the case, we can try to "transpose" the operands such that each resulting
		/// bundle will have a single opcode. For example:
		///
		/// ; Neither operand bundle is vectorizable because they each contain more
		/// ; than one opcode ('add' and 'sub'). However, we can transpose the
		/// ; bundles to enable vectorization.
		/// add (add s0 s1) (add s2, s3)
		/// add (sub s4 s5) (sub s6, s7)
		///
		/// ; Vectorized IR (simplified)
		/// %a = add <s0, s2>, <s1, s3> ; = <s0 + s1, s2 + s3>
		/// %b = sub <s4, s6>, <s5, s7> ; = <s4 - s5, s6 - s7>
		/// %c = shufflevector %a, %b, <0 2> ; = <s0 + s1, s4 - s5>
		/// %d = shufflevector %a, %b, <1 3> ; = <s2 + s3, s6 - s7>
		/// %e = add %c, %d
		///
		/// When we transpose the operands, we bundle together all operands of one or
		/// more instructions, rather than bundling together a single operand (e.g,
		/// operand-zero) of several instructions. Conceptually, this is a row-wise
		/// bundle instead of a column-wise bundle, hence the name "transpose". If a
		/// transpose is performed while building a vectorizable tree, we must
		/// re-transpose the vectors at run-time using shufflevector instructions to
		/// restore the correct bundling.
		///
		/// This function determines if a transpose is needed for a given bundle, and
		/// if so, returns a list of isomorphic "operand" bundles.
		static Optional<SmallVector<BoUpSLP::ValueList, 4>>
		transposeBinOpBundle(ArrayRef<Value *> VL) {
		assert(!VL.empty() && isa<BinaryOperator>(VL[0]) && "No binary operator");
		assert(getSameOpcode(VL).Opcode && "No single opcode");
		assert(!getSameOpcode(VL).IsAltShuffle && "Has alternate opocde");

		if (VL.size() < 2)
		return None;

		// Check that for each instruction in the given bundle, its operands are also
		// instructions having identical opcodes.
		if (any_of(VL, [](Value *V) {
		auto *VLI = cast<BinaryOperator>(V);
		auto *VLIOp0 = dyn_cast<Instruction>(VLI->getOperand(0));
		auto *VLIOp1 = dyn_cast<Instruction>(VLI->getOperand(1));
		return !VLIOp0 \|\| !VLIOp1 \|\| VLIOp0->getOpcode() != VLIOp1->getOpcode();
		}))
		return None;

		auto *VL0 = cast<BinaryOperator>(VL[0]);
		auto *VL0Op0 = cast<Instruction>(VL0->getOperand(0));
		auto *VL0Op1 = cast<Instruction>(VL0->getOperand(1));

		// Check if the operand-zero bundle or the operand-one bundle is isomorphic.
		// If this is the case, a transpose is unnecessary since at least one of the
		// bundles is already vectorizable.
		bool Op0IsIsomorphic = true;
		bool Op1IsIsomorphic = true;
		for (Value *V : VL.drop_front()) {
		auto *VLIOp0 = cast<Instruction>(cast<Instruction>(V)->getOperand(0));
		auto *VLIOp1 = cast<Instruction>(cast<Instruction>(V)->getOperand(1));
		Op0IsIsomorphic &= VLIOp0->getOpcode() == VL0Op0->getOpcode();
		Op1IsIsomorphic &= VLIOp1->getOpcode() == VL0Op1->getOpcode();
		if (!Op0IsIsomorphic && !Op1IsIsomorphic)
		break;
		}
		if (Op0IsIsomorphic \|\| Op1IsIsomorphic)
		return None;

		// Group the operands according to their opcode. If the size of any group is
		ABataevUnsubmitted Done Reply Inline Actions Maybe it's worth it to merge these 2 checks into one to reduce the number of iterations? ABataev: Maybe it's worth it to merge these 2 checks into one to reduce the number of iterations?
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Sure, that makes sense. mssimpso: Sure, that makes sense.
		// not a power of two, we aren't going to transpose. Note that we've already
		// shown that both operands of a given instruction have the same opcode.
		MapVector<unsigned, BoUpSLP::ValueList> Opcode2Operands;
		for (Value *V : VL) {
		auto *VLIOp0 = cast<Instruction>(cast<Instruction>(V)->getOperand(0));
		auto *VLIOp1 = cast<Instruction>(cast<Instruction>(V)->getOperand(1));
		unsigned SameOpcode = VLIOp0->getOpcode();
		assert(VLIOp1->getOpcode() == SameOpcode &&
		"Operands should have the same opcode");
		Opcode2Operands[SameOpcode].push_back(VLIOp0);
		AyalUnsubmitted Done Reply Inline Actions Both operands are mapped to the opcode of operand 0? Ayal: Both operands are mapped to the opcode of operand 0?
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions I've already checked that both operands have the same opcode (starting at line 1423). I also added a comment here, but this is probably still a bit confusing. I will rewrite this to make it more clear. mssimpso: I've already checked that both operands have the same opcode (starting at line 1423). I also…
		AyalUnsubmitted Not Done Reply Inline Actions Ahh, of course... Ayal: Ahh, of course...
		Opcode2Operands[SameOpcode].push_back(VLIOp1);
		}
		if (any_of(Opcode2Operands,
		[](std::pair<unsigned, BoUpSLP::ValueList> &Entry) {
		return !isPowerOf2_32(Entry.second.size());
		}))
		return None;

		// Find the size of the smallest isomorphic bundle. We will partition larger
		AyalUnsubmitted Not Done Reply Inline Actions Making all bundles have the same (smallest) size certainly simplifies things, but potentially misses performance opportunities; at-least in theory. Perhaps deserves a comment. Suffice to check that `MinSize` is a power of 2, and that all bundle sizes are divisible by MinSize, rather than requiring all bundle sizes to be powers of 2. E.g., having one bundle of size 2 and another of size 6 should be fine, iiuc. Ayal: Making all bundles have the same (smallest) size certainly simplifies things, but potentially…
		// bundles so that each bundle is the same size.
		unsigned MinSize =
		std::min_element(Opcode2Operands.begin(), Opcode2Operands.end(),
		[](std::pair<unsigned, BoUpSLP::ValueList> &Entry1,
		std::pair<unsigned, BoUpSLP::ValueList> &Entry2) {
		return Entry1.second.size() < Entry2.second.size();
		})
		->second.size();

		// Collect the final bundles while ensuring that each bundle contains MinSize
		// values.
		SmallVector<BoUpSLP::ValueList, 4> Bundles;
		ABataevUnsubmitted Done Reply Inline Actions `auto &`->`const BoUpSLP::ValueList &` or, probably, you may use `ArrayRef<Value` here ABataev:* `auto &`->`const BoUpSLP::ValueList &` or, probably, you may use `ArrayRef<Value*` here
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Using `const std::pair<unsigned, BoUpSLP::ValueList> &` mssimpso: Using `const std::pair<unsigned, BoUpSLP::ValueList> &`
		AyalUnsubmitted Not Done Reply Inline Actions This is a bit confusing: bundles are traversed according to their insertion order into Opcode2Operands, i.e., according to original operand order; but are aggregated and placed inside `Bundles` according to their opcode? Ayal: This is a bit confusing: bundles are traversed according to their insertion order into…
		for (const std::pair<unsigned, BoUpSLP::ValueList> &Entry : Opcode2Operands) {
		ArrayRef<Value *> Bundle(Entry.second);
		while (Bundle.size() >= MinSize) {
		ArrayRef<Value *> Slice = Bundle.take_front(MinSize);
		ABataevUnsubmitted Done Reply Inline Actions Try to use `Bundles.emplace_back(Slice.begin(), Slice.end());` ABataev: Try to use `Bundles.emplace_back(Slice.begin(), Slice.end());`
		Bundles.emplace_back(Slice.begin(), Slice.end());
		Bundle = Bundle.drop_front(MinSize);
		}
		}

		return Bundles;
		}

		/// Compute the cost of the shufflevector instructions needed to re-transpose
		/// binary operator bundles at run-time.
		///
		/// Unlike a normal binary operator bundle having two input bundles (one for
		/// each operand), a transposed binary operator bundle can have more than two
		/// (i.e., \p NumBundles) input bundles. We generate code for the transpose by
		/// concatenating the transposed input bundles together, and then shuffling out
		/// vectors corresponding to the operand-zero and operand-one bundles of the
		/// binary operator. So we have two kinds of shuffles to account for. We use
		/// the SK_InsertSubvector ShuffleKind for concatenating vectors, and the
		/// SK_Transpose ShuffleKind for the transpose.
		static unsigned getTransposedBinOpShuffleCost(TargetTransformInfo *TTI,
		unsigned BundleVF,
		unsigned NumBundles,
		VectorType *VecTy) {
		assert(isPowerOf2_32(NumBundles) && "NumBundles is not a power of 2");

		// Compute the cost of concatenating NumBundles together, where each vector
		// initially contains BundleVF elements. The result of the concatenation is
		// two vectors with type VecTy that will be used as inputs to the
		// shufflevector instructions performing the actual transpose. We assume the
		// concatenation will be performed in ~log2(NumBundles) steps.
		unsigned Cost = 0;
		for (unsigned NumDstVecs = NumBundles / 2, ElemsPerSrcVec = BundleVF;
		NumDstVecs >= 2; NumDstVecs /= 2, ElemsPerSrcVec *= 2) {
		AyalUnsubmitted Done Reply Inline Actions May be easier to follow if `ElemsPerSrcVec` is initially set to `BundleVF` and doubled inside the loop, `ElemsPerDstVec` is set to its double, and `VF` renamed to something like `NumVectors`. Ayal: May be easier to follow if `ElemsPerSrcVec` is initially set to `BundleVF` and doubled inside…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Sounds good. mssimpso: Sounds good.
		Type *SrcTy = VectorType::get(VecTy->getElementType(), ElemsPerSrcVec);
		Type DstTy = VectorType::get(VecTy->getElementType(), 2 ElemsPerSrcVec);
		Cost += NumDstVecs *
		TTI->getShuffleCost(TargetTransformInfo::SK_InsertSubvector, DstTy,
		ElemsPerSrcVec, SrcTy);
		}

		// Compute the cost of the transpose. The transpose requires two
		// shufflevector instructions: one to compute the operand-zero bundle and
		// another to compute the operand-one bundle.
		Cost += 2 * TTI->getShuffleCost(TargetTransformInfo::SK_Transpose, VecTy);

		return Cost;
		}

void BoUpSLP::buildTree(ArrayRef<Value *> Roots,		void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst) {		ArrayRef<Value *> UserIgnoreLst) {
ExtraValueToDebugLocsMap ExternallyUsedValues;		ExtraValueToDebugLocsMap ExternallyUsedValues;
buildTree(Roots, ExternallyUsedValues, UserIgnoreLst);		buildTree(Roots, ExternallyUsedValues, UserIgnoreLst);
}		}

void BoUpSLP::buildTree(ArrayRef<Value *> Roots,		void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
ExtraValueToDebugLocsMap &ExternallyUsedValues,		ExtraValueToDebugLocsMap &ExternallyUsedValues,
▲ Show 20 Lines • Show All 416 Lines • ▼ Show 20 Lines	switch (ShuffleOrOp) {
case Instruction::SRem:		case Instruction::SRem:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor:		case Instruction::Xor:
newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);
DEBUG(dbgs() << "SLP: added a vector of bin op.\n");		DEBUG(dbgs() << "SLP: added a vector of bin op.\n");

		// If the operand bundles of this binary operator are not vectorizable,
		// determine if we can transpose them, and if so, continue building the
		// tree with the transposed bundles. That there is an instruction bundle
		AyalUnsubmitted Done Reply Inline Actions Comment that transposing operands each having a common opcode is mutually exclusive with swapping commutative operands below, and should precede it? Ayal: Comment that transposing operands each having a common opcode is mutually exclusive with…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Yes, that's a good idea. mssimpso: Yes, that's a good idea.
		// for which we can compute a number of transposable operand bundles,
		// implies that the two operands of each instruction in the bundle have
		ABataevUnsubmitted Done Reply Inline Actions `auto`->`Optional<SmallVector<BoUpSLP::ValueList, 4>>` ABataev: `auto`->`Optional<SmallVector<BoUpSLP::ValueList, 4>>`
		// the same opcode. This is mutually exclusive with swapping commutative
		ABataevUnsubmitted Done Reply Inline Actions `auto &`->`const auto &` ABataev: `auto &`->`const auto &`
		// operands, which we attempt to do next.
		if (isa<BinaryOperator>(VL0))
		if (Optional<SmallVector<BoUpSLP::ValueList, 4>> TransposedBundles =
		transposeBinOpBundle(VL)) {
		newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies,
		/ReorderIndices=/None, *TransposedBundles);
		for (const auto &Bundle : *TransposedBundles)
		buildTree_rec(Bundle, Depth + 1, UserTreeIdx);
		return;
		}

		newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);

// Sort operands of the instructions so that each side is more likely to		// Sort operands of the instructions so that each side is more likely to
// have the same opcode.		// have the same opcode.
if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {		if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {
ValueList Left, Right;		ValueList Left, Right;
reorderInputsAccordingToOpcode(S.Opcode, VL, Left, Right);		reorderInputsAccordingToOpcode(S.Opcode, VL, Left, Right);
buildTree_rec(Left, Depth + 1, UserTreeIdx);		buildTree_rec(Left, Depth + 1, UserTreeIdx);
buildTree_rec(Right, Depth + 1, UserTreeIdx);		buildTree_rec(Right, Depth + 1, UserTreeIdx);
return;		return;
▲ Show 20 Lines • Show All 480 Lines • ▼ Show 20 Lines	case Instruction::Xor: {
TargetTransformInfo::OK_AnyValue;		TargetTransformInfo::OK_AnyValue;
TargetTransformInfo::OperandValueKind Op2VK =		TargetTransformInfo::OperandValueKind Op2VK =
TargetTransformInfo::OK_UniformConstantValue;		TargetTransformInfo::OK_UniformConstantValue;
TargetTransformInfo::OperandValueProperties Op1VP =		TargetTransformInfo::OperandValueProperties Op1VP =
TargetTransformInfo::OP_None;		TargetTransformInfo::OP_None;
TargetTransformInfo::OperandValueProperties Op2VP =		TargetTransformInfo::OperandValueProperties Op2VP =
TargetTransformInfo::OP_None;		TargetTransformInfo::OP_None;

		unsigned TransposeShuffleCost = 0;
		SmallVector<const Value *, 4> Operands;
		ABataevUnsubmitted Not Done Reply Inline Actions Pre-reserve the space for the Operands here. ABataev: Pre-reserve the space for the Operands here.
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure we can do this. For the transpose case, since we don't know what the operands are (they will be shuffles), I've left `Operands` empty. If we instead pre-reserve space, we'll end up sending a vector of null pointers to `getArithmeticInstrCost`, which will probably break. mssimpso: I'm not sure we can do this. For the transpose case, since we don't know what the operands are…

		ABataevUnsubmitted Done Reply Inline Actions Maybe it is worth to store the Transposed bundles in the `TreeNode` to not perform this kind of analysis several times? ABataev: Maybe it is worth to store the Transposed bundles in the `TreeNode` to not perform this kind of…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Yes, that was my thought as well. I'll update the patch. mssimpso: Yes, that was my thought as well. I'll update the patch.
		if (!E->TransposedOperands.empty()) {
		// If this binary operator's operand bundles can be transposed, we need
		// to account for the shufflevector instructions that will perform the
		// transpose operation.
		TransposeShuffleCost =
		getTransposedBinOpShuffleCost(TTI, E->TransposedOperands[0].size(),
		E->TransposedOperands.size(), VecTy);
		} else {
// If all operands are exactly the same ConstantInt then set the		// If all operands are exactly the same ConstantInt then set the
// operand kind to OK_UniformConstantValue.		// operand kind to OK_UniformConstantValue.
// If instead not all operands are constants, then set the operand kind		// If instead not all operands are constants, then set the operand kind
// to OK_AnyValue. If all operands are constants but not the same,		// to OK_AnyValue. If all operands are constants but not the same,
// then set the operand kind to OK_NonUniformConstantValue.		// then set the operand kind to OK_NonUniformConstantValue.
ConstantInt *CInt = nullptr;		ConstantInt *CInt = nullptr;
for (unsigned i = 0; i < VL.size(); ++i) {		for (unsigned i = 0; i < VL.size(); ++i) {
const Instruction *I = cast<Instruction>(VL[i]);		const Instruction *I = cast<Instruction>(VL[i]);
if (!isa<ConstantInt>(I->getOperand(1))) {		if (!isa<ConstantInt>(I->getOperand(1))) {
Op2VK = TargetTransformInfo::OK_AnyValue;		Op2VK = TargetTransformInfo::OK_AnyValue;
break;		break;
}		}
if (i == 0) {		if (i == 0) {
CInt = cast<ConstantInt>(I->getOperand(1));		CInt = cast<ConstantInt>(I->getOperand(1));
continue;		continue;
}		}
if (Op2VK == TargetTransformInfo::OK_UniformConstantValue &&		if (Op2VK == TargetTransformInfo::OK_UniformConstantValue &&
CInt != cast<ConstantInt>(I->getOperand(1)))		CInt != cast<ConstantInt>(I->getOperand(1)))
Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;		Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
}		}
// FIXME: Currently cost of model modification for division by power of		// FIXME: Currently cost of model modification for division by power of
// 2 is handled for X86 and AArch64. Add support for other targets.		// 2 is handled for X86 and AArch64. Add support for other targets.
if (Op2VK == TargetTransformInfo::OK_UniformConstantValue && CInt &&		if (Op2VK == TargetTransformInfo::OK_UniformConstantValue && CInt &&
CInt->getValue().isPowerOf2())		CInt->getValue().isPowerOf2())
Op2VP = TargetTransformInfo::OP_PowerOf2;		Op2VP = TargetTransformInfo::OP_PowerOf2;

SmallVector<const Value *, 4> Operands(VL0->operand_values());		Operands.append(VL0->value_op_begin(), VL0->value_op_end());
		}

if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
ReuseShuffleCost -=		ReuseShuffleCost -=
(ReuseShuffleNumbers - VL.size()) *		(ReuseShuffleNumbers - VL.size()) *
TTI->getArithmeticInstrCost(S.Opcode, ScalarTy, Op1VK, Op2VK, Op1VP,		TTI->getArithmeticInstrCost(S.Opcode, ScalarTy, Op1VK, Op2VK, Op1VP,
Op2VP, Operands);		Op2VP, Operands);
}		}
int ScalarCost =		int ScalarCost =
VecTy->getNumElements() *		VecTy->getNumElements() *
TTI->getArithmeticInstrCost(S.Opcode, ScalarTy, Op1VK, Op2VK, Op1VP,		TTI->getArithmeticInstrCost(S.Opcode, ScalarTy, Op1VK, Op2VK, Op1VP,
Op2VP, Operands);		Op2VP, Operands);
int VecCost = TTI->getArithmeticInstrCost(S.Opcode, VecTy, Op1VK, Op2VK,		int VecCost = TTI->getArithmeticInstrCost(S.Opcode, VecTy, Op1VK, Op2VK,
Op1VP, Op2VP, Operands);		Op1VP, Op2VP, Operands);
return ReuseShuffleCost + VecCost - ScalarCost;		return ReuseShuffleCost + TransposeShuffleCost + VecCost - ScalarCost;
}		}
case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
TargetTransformInfo::OperandValueKind Op1VK =		TargetTransformInfo::OperandValueKind Op1VK =
TargetTransformInfo::OK_AnyValue;		TargetTransformInfo::OK_AnyValue;
TargetTransformInfo::OperandValueKind Op2VK =		TargetTransformInfo::OperandValueKind Op2VK =
TargetTransformInfo::OK_UniformConstantValue;		TargetTransformInfo::OK_UniformConstantValue;

if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
▲ Show 20 Lines • Show All 985 Lines • ▼ Show 20 Lines	switch (ShuffleOrOp) {
case Instruction::SRem:		case Instruction::SRem:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
		setInsertPointAfterBundle(E->Scalars, VL0);
		Value *LHS = nullptr;
		Value *RHS = nullptr;
		if (!E->TransposedOperands.empty()) {
		// If this binary operator's operand bundles can be transposed,
		// generate the shufflevector instructions that will perform the
		// transpose operation.
		//
		// First, vectorize the transposed bundles and concatenate them
		// together to form a single vector.
		ValueList VectorizedValues;
		for (const ValueList &Bundle : E->TransposedOperands)
		VectorizedValues.push_back(vectorizeTree(Bundle));
		Value *Merged = concatenateVectors(Builder, VectorizedValues);

		// Next, perform the actual transpose by shuffling out the operand-zero
		// and operand-one vectors using even and odd transpose masks.
		Value *Undef = UndefValue::get(Merged->getType());
		unsigned VF = cast<VectorType>(Merged->getType())->getNumElements() / 2;
		Constant LHSMask = createTransposeMask(Builder, VF, /IsOdd=*/false);
		Constant RHSMask = createTransposeMask(Builder, VF, /IsOdd=*/true);
		LHS = Builder.CreateShuffleVector(Merged, Undef, LHSMask);
		RHS = Builder.CreateShuffleVector(Merged, Undef, RHSMask);
		} else {
ValueList LHSVL, RHSVL;		ValueList LHSVL, RHSVL;
if (isa<BinaryOperator>(VL0) && VL0->isCommutative())		if (VL0->isCommutative()) {
reorderInputsAccordingToOpcode(S.Opcode, E->Scalars, LHSVL,		reorderInputsAccordingToOpcode(S.Opcode, E->Scalars, LHSVL, RHSVL);
RHSVL);		} else {
else
for (Value *V : E->Scalars) {		for (Value *V : E->Scalars) {
auto *I = cast<Instruction>(V);		auto *I = cast<Instruction>(V);
LHSVL.push_back(I->getOperand(0));		LHSVL.push_back(I->getOperand(0));
RHSVL.push_back(I->getOperand(1));		RHSVL.push_back(I->getOperand(1));
}		}
		}
setInsertPointAfterBundle(E->Scalars, VL0);		LHS = vectorizeTree(LHSVL);
		RHS = vectorizeTree(RHSVL);
Value *LHS = vectorizeTree(LHSVL);		}
Value *RHS = vectorizeTree(RHSVL);

if (E->VectorizedValue) {		if (E->VectorizedValue) {
DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");		DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

Value *V = Builder.CreateBinOp(		Value *V = Builder.CreateBinOp(
static_cast<Instruction::BinaryOps>(S.Opcode), LHS, RHS);		static_cast<Instruction::BinaryOps>(S.Opcode), LHS, RHS);
▲ Show 20 Lines • Show All 3,093 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/AArch64/transpose.ll

; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s		; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"		target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"		target triple = "aarch64--linux-gnu"

define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {		define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {
; CHECK-LABEL: @build_vec_v2i64(		; CHECK-LABEL: @build_vec_v2i64(
; CHECK-NEXT: [[V0_0:%.*]] = extractelement <2 x i64> %v0, i32 0		; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i64> %v0, %v1
; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i64> %v0, i32 1		; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i64> %v0, %v1
; CHECK-NEXT: [[V1_0:%.*]] = extractelement <2 x i64> %v1, i32 0		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i64> [[TMP1]], <2 x i64> [[TMP2]], <2 x i32> <i32 0, i32 2>
; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i64> %v1, i32 1		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i64> [[TMP1]], <2 x i64> [[TMP2]], <2 x i32> <i32 1, i32 3>
; CHECK-NEXT: [[TMP0_0:%.*]] = add i64 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP5:%.*]] = add <2 x i64> [[TMP3]], [[TMP4]]
; CHECK-NEXT: [[TMP0_1:%.*]] = add i64 [[V0_1]], [[V1_1]]		; CHECK-NEXT: ret <2 x i64> [[TMP5]]
; CHECK-NEXT: [[TMP1_0:%.*]] = sub i64 [[V0_0]], [[V1_0]]
; CHECK-NEXT: [[TMP1_1:%.*]] = sub i64 [[V0_1]], [[V1_1]]
; CHECK-NEXT: [[TMP2_0:%.*]] = add i64 [[TMP0_0]], [[TMP0_1]]
; CHECK-NEXT: [[TMP2_1:%.*]] = add i64 [[TMP1_0]], [[TMP1_1]]
; CHECK-NEXT: [[TMP3_0:%.*]] = insertelement <2 x i64> undef, i64 [[TMP2_0]], i32 0
; CHECK-NEXT: [[TMP3_1:%.*]] = insertelement <2 x i64> [[TMP3_0]], i64 [[TMP2_1]], i32 1
; CHECK-NEXT: ret <2 x i64> [[TMP3_1]]
;		;
%v0.0 = extractelement <2 x i64> %v0, i32 0		%v0.0 = extractelement <2 x i64> %v0, i32 0
%v0.1 = extractelement <2 x i64> %v0, i32 1		%v0.1 = extractelement <2 x i64> %v0, i32 1
%v1.0 = extractelement <2 x i64> %v1, i32 0		%v1.0 = extractelement <2 x i64> %v1, i32 0
%v1.1 = extractelement <2 x i64> %v1, i32 1		%v1.1 = extractelement <2 x i64> %v1, i32 1
%tmp0.0 = add i64 %v0.0, %v1.0		%tmp0.0 = add i64 %v0.0, %v1.0
%tmp0.1 = add i64 %v0.1, %v1.1		%tmp0.1 = add i64 %v0.1, %v1.1
%tmp1.0 = sub i64 %v0.0, %v1.0		%tmp1.0 = sub i64 %v0.0, %v1.0
%tmp1.1 = sub i64 %v0.1, %v1.1		%tmp1.1 = sub i64 %v0.1, %v1.1
%tmp2.0 = add i64 %tmp0.0, %tmp0.1		%tmp2.0 = add i64 %tmp0.0, %tmp0.1
%tmp2.1 = add i64 %tmp1.0, %tmp1.1		%tmp2.1 = add i64 %tmp1.0, %tmp1.1
%tmp3.0 = insertelement <2 x i64> undef, i64 %tmp2.0, i32 0		%tmp3.0 = insertelement <2 x i64> undef, i64 %tmp2.0, i32 0
%tmp3.1 = insertelement <2 x i64> %tmp3.0, i64 %tmp2.1, i32 1		%tmp3.1 = insertelement <2 x i64> %tmp3.0, i64 %tmp2.1, i32 1
ret <2 x i64> %tmp3.1		ret <2 x i64> %tmp3.1
}		}

define void @store_chain_v2i64(i64* %a, i64* %b, i64* %c) {		define void @store_chain_v2i64(i64* %a, i64* %b, i64* %c) {
; CHECK-LABEL: @store_chain_v2i64(		; CHECK-LABEL: @store_chain_v2i64(
; CHECK-NEXT: [[A_1:%.]] = getelementptr i64, i64 %a, i64 1		; CHECK-NEXT: [[TMP1:%.]] = bitcast i64 %a to <2 x i64>*
; CHECK-NEXT: [[B_1:%.]] = getelementptr i64, i64 %b, i64 1		; CHECK-NEXT: [[TMP2:%.]] = load <2 x i64>, <2 x i64> [[TMP1]], align 8
; CHECK-NEXT: [[C_1:%.]] = getelementptr i64, i64 %c, i64 1		; CHECK-NEXT: [[TMP3:%.]] = bitcast i64 %b to <2 x i64>*
; CHECK-NEXT: [[V0_0:%.]] = load i64, i64 %a, align 8		; CHECK-NEXT: [[TMP4:%.]] = load <2 x i64>, <2 x i64> [[TMP3]], align 8
; CHECK-NEXT: [[V0_1:%.]] = load i64, i64 [[A_1]], align 8		; CHECK-NEXT: [[TMP5:%.*]] = add <2 x i64> [[TMP2]], [[TMP4]]
; CHECK-NEXT: [[V1_0:%.]] = load i64, i64 %b, align 8		; CHECK-NEXT: [[TMP6:%.*]] = sub <2 x i64> [[TMP2]], [[TMP4]]
; CHECK-NEXT: [[V1_1:%.]] = load i64, i64 [[B_1]], align 8		; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <2 x i64> [[TMP5]], <2 x i64> [[TMP6]], <2 x i32> <i32 0, i32 2>
; CHECK-NEXT: [[TMP0_0:%.*]] = add i64 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i64> [[TMP5]], <2 x i64> [[TMP6]], <2 x i32> <i32 1, i32 3>
; CHECK-NEXT: [[TMP0_1:%.*]] = add i64 [[V0_1]], [[V1_1]]		; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i64> [[TMP7]], [[TMP8]]
; CHECK-NEXT: [[TMP1_0:%.*]] = sub i64 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP10:%.]] = bitcast i64 %c to <2 x i64>*
; CHECK-NEXT: [[TMP1_1:%.*]] = sub i64 [[V0_1]], [[V1_1]]		; CHECK-NEXT: store <2 x i64> [[TMP9]], <2 x i64>* [[TMP10]], align 8
; CHECK-NEXT: [[TMP2_0:%.*]] = add i64 [[TMP0_0]], [[TMP0_1]]
; CHECK-NEXT: [[TMP2_1:%.*]] = add i64 [[TMP1_0]], [[TMP1_1]]
; CHECK-NEXT: store i64 [[TMP2_0]], i64* %c, align 8
; CHECK-NEXT: store i64 [[TMP2_1]], i64* [[C_1]], align 8
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%a.0 = getelementptr i64, i64* %a, i64 0		%a.0 = getelementptr i64, i64* %a, i64 0
%a.1 = getelementptr i64, i64* %a, i64 1		%a.1 = getelementptr i64, i64* %a, i64 1
%b.0 = getelementptr i64, i64* %b, i64 0		%b.0 = getelementptr i64, i64* %b, i64 0
%b.1 = getelementptr i64, i64* %b, i64 1		%b.1 = getelementptr i64, i64* %b, i64 1
%c.0 = getelementptr i64, i64* %c, i64 0		%c.0 = getelementptr i64, i64* %c, i64 0
%c.1 = getelementptr i64, i64* %c, i64 1		%c.1 = getelementptr i64, i64* %c, i64 1
Show All 9 Lines	;
%tmp2.1 = add i64 %tmp1.0, %tmp1.1		%tmp2.1 = add i64 %tmp1.0, %tmp1.1
store i64 %tmp2.0, i64* %c.0, align 8		store i64 %tmp2.0, i64* %c.0, align 8
store i64 %tmp2.1, i64* %c.1, align 8		store i64 %tmp2.1, i64* %c.1, align 8
ret void		ret void
}		}

define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32(		; CHECK-LABEL: @build_vec_v4i32(
; CHECK-NEXT: [[V0_0:%.*]] = extractelement <4 x i32> %v0, i32 0		; CHECK-NEXT: [[TMP1:%.*]] = add <4 x i32> %v0, %v1
; CHECK-NEXT: [[V0_1:%.*]] = extractelement <4 x i32> %v0, i32 1		; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> %v0, %v1
; CHECK-NEXT: [[V0_2:%.*]] = extractelement <4 x i32> %v0, i32 2		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 0, i32 4, i32 2, i32 6>
; CHECK-NEXT: [[V0_3:%.*]] = extractelement <4 x i32> %v0, i32 3		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 5, i32 3, i32 7>
; CHECK-NEXT: [[V1_0:%.*]] = extractelement <4 x i32> %v1, i32 0		; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP3]], [[TMP4]]
; CHECK-NEXT: [[V1_1:%.*]] = extractelement <4 x i32> %v1, i32 1		; CHECK-NEXT: ret <4 x i32> [[TMP5]]
; CHECK-NEXT: [[V1_2:%.*]] = extractelement <4 x i32> %v1, i32 2
; CHECK-NEXT: [[V1_3:%.*]] = extractelement <4 x i32> %v1, i32 3
; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]]
; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]]
; CHECK-NEXT: [[TMP0_2:%.*]] = add i32 [[V0_2]], [[V1_2]]
; CHECK-NEXT: [[TMP0_3:%.*]] = add i32 [[V0_3]], [[V1_3]]
; CHECK-NEXT: [[TMP1_0:%.*]] = sub i32 [[V0_0]], [[V1_0]]
; CHECK-NEXT: [[TMP1_1:%.*]] = sub i32 [[V0_1]], [[V1_1]]
; CHECK-NEXT: [[TMP1_2:%.*]] = sub i32 [[V0_2]], [[V1_2]]
; CHECK-NEXT: [[TMP1_3:%.*]] = sub i32 [[V0_3]], [[V1_3]]
; CHECK-NEXT: [[TMP2_0:%.*]] = add i32 [[TMP0_0]], [[TMP0_1]]
; CHECK-NEXT: [[TMP2_1:%.*]] = add i32 [[TMP1_0]], [[TMP1_1]]
; CHECK-NEXT: [[TMP2_2:%.*]] = add i32 [[TMP0_2]], [[TMP0_3]]
; CHECK-NEXT: [[TMP2_3:%.*]] = add i32 [[TMP1_2]], [[TMP1_3]]
; CHECK-NEXT: [[TMP3_0:%.*]] = insertelement <4 x i32> undef, i32 [[TMP2_0]], i32 0
; CHECK-NEXT: [[TMP3_1:%.*]] = insertelement <4 x i32> [[TMP3_0]], i32 [[TMP2_1]], i32 1
; CHECK-NEXT: [[TMP3_2:%.*]] = insertelement <4 x i32> [[TMP3_1]], i32 [[TMP2_2]], i32 2
; CHECK-NEXT: [[TMP3_3:%.*]] = insertelement <4 x i32> [[TMP3_2]], i32 [[TMP2_3]], i32 3
; CHECK-NEXT: ret <4 x i32> [[TMP3_3]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
%v1.3 = extractelement <4 x i32> %v1, i32 3		%v1.3 = extractelement <4 x i32> %v1, i32 3
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
%tmp0.2 = add i32 %v0.2, %v1.2		%tmp0.2 = add i32 %v0.2, %v1.2
%tmp0.3 = add i32 %v0.3, %v1.3		%tmp0.3 = add i32 %v0.3, %v1.3
%tmp1.0 = sub i32 %v0.0, %v1.0		%tmp1.0 = sub i32 %v0.0, %v1.0
%tmp1.1 = sub i32 %v0.1, %v1.1		%tmp1.1 = sub i32 %v0.1, %v1.1
%tmp1.2 = sub i32 %v0.2, %v1.2		%tmp1.2 = sub i32 %v0.2, %v1.2
%tmp1.3 = sub i32 %v0.3, %v1.3		%tmp1.3 = sub i32 %v0.3, %v1.3
%tmp2.0 = add i32 %tmp0.0, %tmp0.1		%tmp2.0 = add i32 %tmp0.0, %tmp0.1
%tmp2.1 = add i32 %tmp1.0, %tmp1.1		%tmp2.1 = add i32 %tmp1.0, %tmp1.1
%tmp2.2 = add i32 %tmp0.2, %tmp0.3		%tmp2.2 = add i32 %tmp0.2, %tmp0.3
		AyalUnsubmitted Not Done Reply Inline Actions Indeed this calls for shuffling with <0,4,2,6> and <1,5,3,7> masks; but is this pattern more natural to expect than %tmp2.1 = add i32 %tmp0.2, %tmp0.3 %tmp2.2 = add i32 %tmp1.0, %tmp1.1 Perhaps in some complex numbers context(?) Ayal: Indeed this calls for shuffling with <0,4,2,6> and <1,5,3,7> masks; but is this pattern more…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Ah, I see your point now. I think the current patch will actually produce incorrect code if you make the above change to the test, since we don't actually enforce the "interleaved order". We'll still use the <0,4,2,6> and <1,5,3,7> masks instead of the stride-2 masks. So I think we should probably record what the mask should be when we do the re-bundling to allow for the various possibilities. It makes sense that we would have to do this in hindsight. We can also probably get rid of the MinSize restriction at the same time. I'd also want to test the mask against the known shuffle kinds for the cost calculation to ensure we are computing the most appropriate cost for the target. I'm actually surprised we don't already have something like `TTI->getShuffleKind(ArrayRef<int> Mask)`. Perhaps I'll work on that first. mssimpso: Ah, I see your point now. I think the current patch will actually produce incorrect code if you…
		AyalUnsubmitted Not Done Reply Inline Actions Perhaps the initial natural order, going bottom-up, is the 2 x n matrix transpose, i.e., the even and odd stride-2 masks. This could be further optimized, by considering the orderings preferred by subsequent leaves up the tree, similar to Alexey's `NumOpsWantToKeepOrder`; but here one could pass through several transposings before reaching the leaves... (btw, any worthwhile workloads driving this?) After choosing the `bestOrder()` = the one wanted by most ops, `SK_PermuteSingleSrc` shuffle costs are used. LV uses `getInterleavedMemoryOpCost()` to compute the cost of its strided shuffles. And `isShuffle()` may also provide inspiration ;-). Ayal:* Perhaps the initial natural order, going bottom-up, is the 2 x n matrix transpose, i.e., the…
%tmp2.3 = add i32 %tmp1.2, %tmp1.3		%tmp2.3 = add i32 %tmp1.2, %tmp1.3
%tmp3.0 = insertelement <4 x i32> undef, i32 %tmp2.0, i32 0		%tmp3.0 = insertelement <4 x i32> undef, i32 %tmp2.0, i32 0
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_reuse_0(		; CHECK-LABEL: @build_vec_v4i32_reuse_0(
; CHECK-NEXT: [[V0_0:%.*]] = extractelement <2 x i32> %v0, i32 0		; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i32> %v0, %v1
; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i32> %v0, i32 1		; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i32> %v0, %v1
; CHECK-NEXT: [[V1_0:%.*]] = extractelement <2 x i32> %v1, i32 0		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 2>
; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i32> %v1, i32 1		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 3>
; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP5:%.*]] = add <2 x i32> [[TMP3]], [[TMP4]]
; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]]		; CHECK-NEXT: [[TMP3_3:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: [[TMP1_0:%.*]] = sub i32 [[V0_0]], [[V1_0]]
; CHECK-NEXT: [[TMP1_1:%.*]] = sub i32 [[V0_1]], [[V1_1]]
; CHECK-NEXT: [[TMP2_0:%.*]] = add i32 [[TMP0_0]], [[TMP0_1]]
; CHECK-NEXT: [[TMP2_1:%.*]] = add i32 [[TMP1_0]], [[TMP1_1]]
; CHECK-NEXT: [[TMP3_0:%.*]] = insertelement <4 x i32> undef, i32 [[TMP2_0]], i32 0
; CHECK-NEXT: [[TMP3_1:%.*]] = insertelement <4 x i32> [[TMP3_0]], i32 [[TMP2_1]], i32 1
; CHECK-NEXT: [[TMP3_2:%.*]] = insertelement <4 x i32> [[TMP3_1]], i32 [[TMP2_0]], i32 2
; CHECK-NEXT: [[TMP3_3:%.*]] = insertelement <4 x i32> [[TMP3_2]], i32 [[TMP2_1]], i32 3
; CHECK-NEXT: ret <4 x i32> [[TMP3_3]]		; CHECK-NEXT: ret <4 x i32> [[TMP3_3]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
%tmp1.0 = sub i32 %v0.0, %v1.0		%tmp1.0 = sub i32 %v0.0, %v1.0
%tmp1.1 = sub i32 %v0.1, %v1.1		%tmp1.1 = sub i32 %v0.1, %v1.1
%tmp2.0 = add i32 %tmp0.0, %tmp0.1		%tmp2.0 = add i32 %tmp0.0, %tmp0.1
%tmp2.1 = add i32 %tmp1.0, %tmp1.1		%tmp2.1 = add i32 %tmp1.0, %tmp1.1
%tmp3.0 = insertelement <4 x i32> undef, i32 %tmp2.0, i32 0		%tmp3.0 = insertelement <4 x i32> undef, i32 %tmp2.0, i32 0
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.0, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.0, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.1, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.1, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_reuse_1(		; CHECK-LABEL: @build_vec_v4i32_reuse_1(
; CHECK-NEXT: [[V0_0:%.*]] = extractelement <2 x i32> %v0, i32 0		; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i32> %v0, %v1
; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i32> %v0, i32 1		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: [[V1_0:%.*]] = extractelement <2 x i32> %v1, i32 0		; CHECK-NEXT: [[TMP2:%.*]] = xor <2 x i32> %v0, %v1
; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i32> %v1, i32 1		; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <2 x i32> [[TMP2]], <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 1, i32 0>
; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[SHUFFLE]], <4 x i32> [[SHUFFLE1]], <4 x i32> <i32 0, i32 4, i32 2, i32 6>
; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]]		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[SHUFFLE]], <4 x i32> [[SHUFFLE1]], <4 x i32> <i32 1, i32 5, i32 3, i32 7>
; CHECK-NEXT: [[TMP0_2:%.*]] = xor i32 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP5:%.*]] = sub <4 x i32> [[TMP3]], [[TMP4]]
; CHECK-NEXT: [[TMP0_3:%.*]] = xor i32 [[V0_1]], [[V1_1]]		; CHECK-NEXT: ret <4 x i32> [[TMP5]]
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x i32> undef, i32 [[TMP0_0]], i32 0
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x i32> undef, i32 [[TMP0_1]], i32 0
; CHECK-NEXT: [[TMP3:%.*]] = sub <2 x i32> [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP1_2:%.*]] = sub i32 [[TMP0_2]], [[TMP0_3]]
; CHECK-NEXT: [[TMP1_3:%.*]] = sub i32 [[TMP0_3]], [[TMP0_2]]
; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x i32> [[TMP3]], i32 0
; CHECK-NEXT: [[TMP2_0:%.*]] = insertelement <4 x i32> undef, i32 [[TMP4]], i32 0
; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i32> [[TMP3]], i32 0
; CHECK-NEXT: [[TMP2_1:%.*]] = insertelement <4 x i32> [[TMP2_0]], i32 [[TMP5]], i32 1
; CHECK-NEXT: [[TMP2_2:%.*]] = insertelement <4 x i32> [[TMP2_1]], i32 [[TMP1_2]], i32 2
; CHECK-NEXT: [[TMP2_3:%.*]] = insertelement <4 x i32> [[TMP2_2]], i32 [[TMP1_3]], i32 3
; CHECK-NEXT: ret <4 x i32> [[TMP2_3]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
%tmp0.2 = xor i32 %v0.0, %v1.0		%tmp0.2 = xor i32 %v0.0, %v1.0
%tmp0.3 = xor i32 %v0.1, %v1.1		%tmp0.3 = xor i32 %v0.1, %v1.1
%tmp1.0 = sub i32 %tmp0.0, %tmp0.1		%tmp1.0 = sub i32 %tmp0.0, %tmp0.1
%tmp1.1 = sub i32 %tmp0.0, %tmp0.1		%tmp1.1 = sub i32 %tmp0.0, %tmp0.1
%tmp1.2 = sub i32 %tmp0.2, %tmp0.3		%tmp1.2 = sub i32 %tmp0.2, %tmp0.3
%tmp1.3 = sub i32 %tmp0.3, %tmp0.2		%tmp1.3 = sub i32 %tmp0.3, %tmp0.2
%tmp2.0 = insertelement <4 x i32> undef, i32 %tmp1.0, i32 0		%tmp2.0 = insertelement <4 x i32> undef, i32 %tmp1.0, i32 0
%tmp2.1 = insertelement <4 x i32> %tmp2.0, i32 %tmp1.1, i32 1		%tmp2.1 = insertelement <4 x i32> %tmp2.0, i32 %tmp1.1, i32 1
%tmp2.2 = insertelement <4 x i32> %tmp2.1, i32 %tmp1.2, i32 2		%tmp2.2 = insertelement <4 x i32> %tmp2.1, i32 %tmp1.2, i32 2
%tmp2.3 = insertelement <4 x i32> %tmp2.2, i32 %tmp1.3, i32 3		%tmp2.3 = insertelement <4 x i32> %tmp2.2, i32 %tmp1.3, i32 3
ret <4 x i32> %tmp2.3		ret <4 x i32> %tmp2.3
}		}

define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_3_binops(		; CHECK-LABEL: @build_vec_v4i32_3_binops(
; CHECK-NEXT: [[V0_0:%.*]] = extractelement <2 x i32> %v0, i32 0		; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i32> %v0, %v1
; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i32> %v0, i32 1		; CHECK-NEXT: [[TMP2:%.*]] = xor <2 x i32> %v0, %v1
; CHECK-NEXT: [[V1_0:%.*]] = extractelement <2 x i32> %v1, i32 0		; CHECK-NEXT: [[TMP3:%.*]] = mul <2 x i32> %v0, %v1
; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i32> %v1, i32 1		; CHECK-NEXT: [[TMP4:%.*]] = xor <2 x i32> %v0, %v1
; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP3]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP2]], <2 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[TMP1_0:%.*]] = mul i32 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 4, i32 2, i32 6>
; CHECK-NEXT: [[TMP1_1:%.*]] = mul i32 [[V0_1]], [[V1_1]]		; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 1, i32 5, i32 3, i32 7>
		AyalUnsubmitted Not Done Reply Inline Actions Could this be done equally well with 'normal' strided masks, i.e.: ; CHECK-NEXT: [[TMP5:%.]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; CHECK-NEXT: [[TMP6:%.]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; CHECK-NEXT: [[TMP7:%.]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 2, i32 4, i32 6> ; CHECK-NEXT: [[TMP8:%.]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 1, i32 3, i32 5, i32 7> Ayal: Could this be done equally well with 'normal' strided masks, i.e.: ``` ; CHECK-NEXT: [[TMP5…
; CHECK-NEXT: [[TMP1:%.*]] = xor <2 x i32> %v0, %v1		; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP7]], [[TMP8]]
; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> undef, <2 x i32> zeroinitializer		; CHECK-NEXT: ret <4 x i32> [[TMP9]]
; CHECK-NEXT: [[TMP3:%.*]] = xor <2 x i32> %v0, %v1
; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> undef, <2 x i32> <i32 1, i32 1>
; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> undef, i32 [[TMP0_0]], i32 0
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 [[TMP1_0]], i32 1
; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x i32> undef, i32 [[TMP0_1]], i32 0
; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP1_1]], i32 1
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]
; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP2]], [[TMP4]]
; CHECK-NEXT: [[TMP3_3:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: ret <4 x i32> [[TMP3_3]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
%tmp0.2 = xor i32 %v0.0, %v1.0		%tmp0.2 = xor i32 %v0.0, %v1.0
%tmp0.3 = xor i32 %v0.1, %v1.1		%tmp0.3 = xor i32 %v0.1, %v1.1
%tmp1.0 = mul i32 %v0.0, %v1.0		%tmp1.0 = mul i32 %v0.0, %v1.0
%tmp1.1 = mul i32 %v0.1, %v1.1		%tmp1.1 = mul i32 %v0.1, %v1.1
%tmp1.2 = xor i32 %v0.0, %v1.0		%tmp1.2 = xor i32 %v0.0, %v1.0
%tmp1.3 = xor i32 %v0.1, %v1.1		%tmp1.3 = xor i32 %v0.1, %v1.1
		AyalUnsubmitted Not Done Reply Inline Actions This admittedly refers to the original test: `%tmp1.2 == %tmp0.2` and `%tmp1.3 == %tmp0.3`. Not sure if this was intentional, but it does raise the issue of exercising bundles of originally different sizes, and potential reuse of same instruction multiple times. Are the four xor's first bundled together, and then broken into two adjacent bundles of MinSize=2? Ayal: This admittedly refers to the original test: `%tmp1.2 == %tmp0.2` and `%tmp1.3 == %tmp0.3`. Not…
%tmp2.0 = add i32 %tmp0.0, %tmp0.1		%tmp2.0 = add i32 %tmp0.0, %tmp0.1
%tmp2.1 = add i32 %tmp1.0, %tmp1.1		%tmp2.1 = add i32 %tmp1.0, %tmp1.1
%tmp2.2 = add i32 %tmp0.2, %tmp0.3		%tmp2.2 = add i32 %tmp0.2, %tmp0.3
%tmp2.3 = add i32 %tmp1.2, %tmp1.3		%tmp2.3 = add i32 %tmp1.2, %tmp1.3
%tmp3.0 = insertelement <4 x i32> undef, i32 %tmp2.0, i32 0		%tmp3.0 = insertelement <4 x i32> undef, i32 %tmp2.0, i32 0
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @reduction_v4i32(		; CHECK-LABEL: @reduction_v4i32(
; CHECK-NEXT: [[V0_0:%.*]] = extractelement <4 x i32> %v0, i32 0		; CHECK-NEXT: [[TMP1:%.*]] = add <4 x i32> %v0, %v1
; CHECK-NEXT: [[V0_1:%.*]] = extractelement <4 x i32> %v0, i32 1		; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> %v0, %v1
; CHECK-NEXT: [[V0_2:%.*]] = extractelement <4 x i32> %v0, i32 2		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> [[TMP1]], <4 x i32> <i32 0, i32 4, i32 2, i32 6>
; CHECK-NEXT: [[V0_3:%.*]] = extractelement <4 x i32> %v0, i32 3		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> [[TMP1]], <4 x i32> <i32 1, i32 5, i32 3, i32 7>
; CHECK-NEXT: [[V1_0:%.*]] = extractelement <4 x i32> %v1, i32 0		; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP3]], [[TMP4]]
; CHECK-NEXT: [[V1_1:%.*]] = extractelement <4 x i32> %v1, i32 1		; CHECK-NEXT: [[TMP6:%.*]] = lshr <4 x i32> [[TMP5]], <i32 15, i32 15, i32 15, i32 15>
; CHECK-NEXT: [[V1_2:%.*]] = extractelement <4 x i32> %v1, i32 2		; CHECK-NEXT: [[TMP7:%.*]] = and <4 x i32> [[TMP6]], <i32 65537, i32 65537, i32 65537, i32 65537>
; CHECK-NEXT: [[V1_3:%.*]] = extractelement <4 x i32> %v1, i32 3		; CHECK-NEXT: [[TMP8:%.*]] = mul nuw <4 x i32> [[TMP7]], <i32 65535, i32 65535, i32 65535, i32 65535>
; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]]		; CHECK-NEXT: [[TMP10:%.*]] = xor <4 x i32> [[TMP9]], [[TMP8]]
; CHECK-NEXT: [[TMP0_2:%.*]] = add i32 [[V0_2]], [[V1_2]]		; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.experimental.vector.reduce.add.i32.v4i32(<4 x i32> [[TMP10]])
; CHECK-NEXT: [[TMP0_3:%.*]] = add i32 [[V0_3]], [[V1_3]]		; CHECK-NEXT: ret i32 [[TMP11]]
; CHECK-NEXT: [[TMP1_0:%.*]] = sub i32 [[V0_0]], [[V1_0]]
; CHECK-NEXT: [[TMP1_1:%.*]] = sub i32 [[V0_1]], [[V1_1]]
; CHECK-NEXT: [[TMP1_2:%.*]] = sub i32 [[V0_2]], [[V1_2]]
; CHECK-NEXT: [[TMP1_3:%.*]] = sub i32 [[V0_3]], [[V1_3]]
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i32> undef, i32 [[TMP1_0]], i32 0
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> [[TMP1]], i32 [[TMP0_0]], i32 1
; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[TMP0_2]], i32 2
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP3]], i32 [[TMP1_2]], i32 3
; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> undef, i32 [[TMP1_1]], i32 0
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[TMP0_1]], i32 1
; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP0_3]], i32 2
; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP1_3]], i32 3
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP4]], [[TMP8]]
; CHECK-NEXT: [[TMP10:%.*]] = lshr <4 x i32> [[TMP9]], <i32 15, i32 15, i32 15, i32 15>
; CHECK-NEXT: [[TMP11:%.*]] = and <4 x i32> [[TMP10]], <i32 65537, i32 65537, i32 65537, i32 65537>
; CHECK-NEXT: [[TMP12:%.*]] = mul nuw <4 x i32> [[TMP11]], <i32 65535, i32 65535, i32 65535, i32 65535>
; CHECK-NEXT: [[TMP13:%.*]] = add <4 x i32> [[TMP12]], [[TMP9]]
; CHECK-NEXT: [[TMP14:%.*]] = xor <4 x i32> [[TMP13]], [[TMP12]]
; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.experimental.vector.reduce.add.i32.v4i32(<4 x i32> [[TMP14]])
; CHECK-NEXT: ret i32 [[TMP15]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
%v1.3 = extractelement <4 x i32> %v1, i32 3		%v1.3 = extractelement <4 x i32> %v1, i32 3
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
%tmp0.2 = add i32 %v0.2, %v1.2		%tmp0.2 = add i32 %v0.2, %v1.2
%tmp0.3 = add i32 %v0.3, %v1.3		%tmp0.3 = add i32 %v0.3, %v1.3
%tmp1.0 = sub i32 %v0.0, %v1.0		%tmp1.0 = sub i32 %v0.0, %v1.0
%tmp1.1 = sub i32 %v0.1, %v1.1		%tmp1.1 = sub i32 %v0.1, %v1.1
%tmp1.2 = sub i32 %v0.2, %v1.2		%tmp1.2 = sub i32 %v0.2, %v1.2
%tmp1.3 = sub i32 %v0.3, %v1.3		%tmp1.3 = sub i32 %v0.3, %v1.3
%tmp2.0 = add i32 %tmp0.0, %tmp0.1		%tmp2.0 = add i32 %tmp0.0, %tmp0.1
%tmp2.1 = add i32 %tmp1.0, %tmp1.1		%tmp2.1 = add i32 %tmp1.0, %tmp1.1
%tmp2.2 = add i32 %tmp0.2, %tmp0.3		%tmp2.2 = add i32 %tmp0.2, %tmp0.3
		AyalUnsubmitted Not Done Reply Inline Actions ditto. Ayal: ditto.
%tmp2.3 = add i32 %tmp1.2, %tmp1.3		%tmp2.3 = add i32 %tmp1.2, %tmp1.3
%tmp3.0 = lshr i32 %tmp2.0, 15		%tmp3.0 = lshr i32 %tmp2.0, 15
%tmp3.1 = lshr i32 %tmp2.1, 15		%tmp3.1 = lshr i32 %tmp2.1, 15
%tmp3.2 = lshr i32 %tmp2.2, 15		%tmp3.2 = lshr i32 %tmp2.2, 15
%tmp3.3 = lshr i32 %tmp2.3, 15		%tmp3.3 = lshr i32 %tmp2.3, 15
%tmp4.0 = and i32 %tmp3.0, 65537		%tmp4.0 = and i32 %tmp3.0, 65537
%tmp4.1 = and i32 %tmp3.1, 65537		%tmp4.1 = and i32 %tmp3.1, 65537
%tmp4.2 = and i32 %tmp3.2, 65537		%tmp4.2 = and i32 %tmp3.2, 65537
Show All 18 Lines