This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
14/16
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
1/3
loadorder.ll
-
X86/
-
split-load8_2-unord.ll

Differential D122145

[SLP] Cluster ordering for loads
ClosedPublic

Authored by dmgreen on Mar 21 2022, 8:17 AM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
vporpo
dtemirbulatov

Commits

rG802e15c57699: [SLP] Cluster ordering for loads

Summary

Given a load without a better order, this patch partially sorts the elements to form clusters of adjacent elements in memory. These clusters can potentially be loaded in fewer loads, meaning less overall shuffling (for example loading v4i8 clusters of a v16i8 as a single f32 loads, as opposed to multiple independent bytes loads and inserts).

Diff Detail

Event Timeline

dmgreen created this revision.Mar 21 2022, 8:17 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 8:17 AM

Herald added subscribers: mgrang, hiraditya. · View Herald Transcript

dmgreen requested review of this revision.Mar 21 2022, 8:17 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 8:17 AM

Harbormaster completed remote builds in B155397: Diff 416947.Mar 21 2022, 8:17 AM

dmgreen mentioned this in D122148: [SLP] Peek into loads when hitting the RecursionMaxDepth.Mar 21 2022, 8:28 AM

dmgreen added a child revision: D122148: [SLP] Peek into loads when hitting the RecursionMaxDepth.

Just FYI D105986

In D122145#3396562, @ABataev wrote:

Just FYI D105986

Oh yeah, that's interesting. I had tried a few of your other patches, but not seen that one. It goes back before I was looking. It seems to change a lot more - is this one of the things it does? It's hard to tell with so many changes :)

One thing I was thinking of doing here was a kind of ordering-priority, and only cluster the loads if there wasn't anything else that looked like a better order. It seemed that a lot of the tests I tried did just fine with the clustered loads order though compared to any other, so I wasn't sure if it was wroth adding something like that. When I tried it, it was getting tripped up in the TopToBottom ordering, not being able to detect what counted as a better order.

ABataev added inline comments.Apr 4 2022, 8:03 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3257	I beleive better to use `Value ` here instead of `auto `
3259	`const auto &`?
3259–3269	Better to outline this loop to a lambda.
3266	`emplace_back(Ptr, *Diff, Cnt);`. Also, I'm not sure about postincrement here, we usually avoid doing it in the expressions.
3277	`emplace_back(Ptr, 0, Cnt);`
3282	`const auto &`?
3285	`stable_sort`
3291	What if some of the loads are the loads from the same address but different instruction? Like: %gep0 = gep %0, 1 %l0 = load %gep0 %gep1 = gep %0, 1 %l1 = load %gep1 %l0 and %l1 will be in the same vector but they are not consecutive, though `(End - Start) == int(Vec.size() - 1);` might be true for >=4 loads.
3305	Message?
3314	Try to preallocate space in the vector, the number of elements is known.
3368–3369	Add a check that we have >= 4 loads

Update as per comments.

Harbormaster completed remote builds in B157924: Diff 420435.Apr 5 2022, 3:17 AM

dmgreen added inline comments.Apr 5 2022, 3:17 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3259	For this one and the other `const auto&` below we might modify Base.
3259–3269	Not sure I understood this one exactly, let me know if I got the idea wrong.
3291	Changing it to checking all the indexes sounds good.

ABataev added inline comments.Apr 5 2022, 4:03 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3302–3303	What if we have non-power-of-2 number of elements in each cluster?
llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
352–359	Looks like a regression here, worth investigation.

dmgreen added inline comments.Apr 5 2022, 7:36 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3302–3303	There are a couple of test in reduce_blockstrided3 and store_blockstrided3 with blocks of size 3. The first was a few instruction shorter under X86 (didn't change on AArch64). The second was different-but-not-worse on AArch64 (didn't change under X86). That's the only tests I've seen with non-power-2 clusters though, so it's not very exhaustive testing. (A quick test of a "reduce_blockstrided5" seems to be better too - a lot less shuffling in the version I tried under X86 and more vectorization under AArch64) It can depend on the costmodel. I think both AArch64 and X86 will have a much lower cost for insert-subvectors that are aligned and a power2 in size. And how bad the initial ordering is - if it allows more less-than-full-width vectorization that might still be a win. I can make it more conservative if you think that's best? I don't have a strong opinion either way.
llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
352–359	Because of the two v4i16 mul's? It looks like it does OK overall with the nicer order of the loads: https://godbolt.org/z/9f44fPeTW, https://godbolt.org/z/eonoM8Ys7 for x86 From what I can see, the SLP vectorizer produces a single v8i16 mul. It is instcombine that then splits that up because it thinks that one shuffle is better than 2: * IR Dump After SLPVectorizerPass on reduce_blockstrided4 * define i16 @reduce_blockstrided4(i16* nocapture noundef readonly %x, i16* nocapture noundef readonly %y, i32 noundef %stride) { entry: %idxprom = sext i32 %stride to i64 %arrayidx4 = getelementptr inbounds i16, i16* %x, i64 %idxprom %arrayidx20 = getelementptr inbounds i16, i16* %y, i64 %idxprom %0 = bitcast i16* %x to <4 x i16>* %1 = load <4 x i16>, <4 x i16>* %0, align 2 %2 = bitcast i16* %arrayidx4 to <4 x i16>* %3 = load <4 x i16>, <4 x i16>* %2, align 2 %4 = bitcast i16* %y to <4 x i16>* %5 = load <4 x i16>, <4 x i16>* %4, align 2 %6 = bitcast i16* %arrayidx20 to <4 x i16>* %7 = load <4 x i16>, <4 x i16>* %6, align 2 %8 = shufflevector <4 x i16> %5, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %9 = shufflevector <4 x i16> %7, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %10 = shufflevector <8 x i16> %8, <8 x i16> %9, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11> %11 = shufflevector <4 x i16> %1, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %12 = shufflevector <4 x i16> %3, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %13 = shufflevector <8 x i16> %11, <8 x i16> %12, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11> %14 = mul <8 x i16> %10, %13 %15 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %14) ret i16 %15 } I can look into fixing that if you think it's worth doing. I'm not sure how yet (instcombine can't look at the cost model), but I've often worried about the amount of vector shuffles that instcombine transforms. Maybe it can be moved to VectorCombine so to get better costing.

llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
352–359	I think we neeed to improve SLP vectorizer here to reduce the number of emitted shuffles, hope it will be fixed soon.

This revision is now accepted and ready to land.Apr 5 2022, 7:41 AM

OK after a bit of a delay (D123801), I'm going to give this a try. Please let me know if it causes issues.

This revision was landed with ongoing or failed builds.May 7 2022, 6:38 AM

Closed by commit rG802e15c57699: [SLP] Cluster ordering for loads (authored by dmgreen). · Explain Why

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rG802e15c57699: [SLP] Cluster ordering for loads.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

89 lines

test/

Transforms/

SLPVectorizer/

AArch64/

loadorder.ll

226 lines

X86/

split-load8_2-unord.ll

12 lines

Diff 416947

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 867 Lines • ▼ Show 20 Lines	public:
void optimizeGatherSequence();		void optimizeGatherSequence();

/// Checks if the specified gather tree entry \p TE can be represented as a		/// Checks if the specified gather tree entry \p TE can be represented as a
/// shuffled vector entry + (possibly) permutation with other gathers. It		/// shuffled vector entry + (possibly) permutation with other gathers. It
/// implements the checks only for possibly ordered scalars (Loads,		/// implements the checks only for possibly ordered scalars (Loads,
/// ExtractElement, ExtractValue), which can be part of the graph.		/// ExtractElement, ExtractValue), which can be part of the graph.
Optional<OrdersType> findReusedOrderedScalars(const TreeEntry &TE);		Optional<OrdersType> findReusedOrderedScalars(const TreeEntry &TE);

		/// Sort loads into increasing pointers offsets to allow greater clustering.
		Optional<OrdersType> findPartiallyOrderedLoads(const TreeEntry &TE);

/// Gets reordering data for the given tree entry. If the entry is vectorized		/// Gets reordering data for the given tree entry. If the entry is vectorized
/// - just return ReorderIndices, otherwise check if the scalars can be		/// - just return ReorderIndices, otherwise check if the scalars can be
/// reordered and return the most optimal order.		/// reordered and return the most optimal order.
/// \param TopToBottom If true, include the order of vectorized stores and		/// \param TopToBottom If true, include the order of vectorized stores and
/// insertelement nodes, otherwise skip them.		/// insertelement nodes, otherwise skip them.
Optional<OrdersType> getReorderingData(const TreeEntry &TE, bool TopToBottom);		Optional<OrdersType> getReorderingData(const TreeEntry &TE, bool TopToBottom);

/// Reorders the current graph to the most profitable order starting from the		/// Reorders the current graph to the most profitable order starting from the
▲ Show 20 Lines • Show All 2,349 Lines • ▼ Show 20 Lines	for (unsigned I = 0; I < NumScalars;) {
}		}
++It;		++It;
}		}
return CurrentOrder;		return CurrentOrder;
}		}
return None;		return None;
}		}

		bool clusterSortPtrAccesses(ArrayRef<Value > VL, Type ElemTy,
		const DataLayout &DL, ScalarEvolution &SE,
		SmallVectorImpl<unsigned> &SortedIndices) {
		assert(llvm::all_of(
		VL, [](const Value *V) { return V->getType()->isPointerTy(); }) &&
		"Expected list of pointer operands.");
		// Map from bases to a vector of (Ptr, Offset, OrigIdx), which we insert each
		// Ptr into, sort and return the sorted indices with values next to one
		// another.
		MapVector<Value , SmallVector<std::tuple<Value , int, unsigned>>> Bases;
		Bases[VL[0]].push_back(std::make_tuple(VL[0], 0U, 0U));

		unsigned Cnt = 1;
		for (auto *Ptr : VL.drop_front()) {
		ABataevUnsubmitted Done Reply Inline Actions I beleive better to use `Value ` here instead of `auto ` ABataev: I beleive better to use `Value ` here instead of `auto `
		bool Found = false;
		for (auto &Base : Bases) {
		ABataevUnsubmitted Done Reply Inline Actions `const auto &`? ABataev: `const auto &`?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions For this one and the other `const auto&` below we might modify Base. dmgreen: For this one and the other `const auto&` below we might modify Base.
		Optional<int> Diff =
		getPointersDiff(ElemTy, Base.first, ElemTy, Ptr, DL, SE,
		/StrictCheck=/true);
		if (!Diff)
		continue;

		Base.second.push_back(std::make_tuple(Ptr, *Diff, Cnt++));
		ABataevUnsubmitted Done Reply Inline Actions `emplace_back(Ptr, Diff, Cnt);`. Also, I'm not sure about postincrement here, we usually avoid doing it in the expressions. ABataev:* `emplace_back(Ptr, *Diff, Cnt);`. Also, I'm not sure about postincrement here, we usually avoid…
		Found = true;
		break;
		}
		ABataevUnsubmitted Not Done Reply Inline Actions Better to outline this loop to a lambda. ABataev: Better to outline this loop to a lambda.
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Not sure I understood this one exactly, let me know if I got the idea wrong. dmgreen: Not sure I understood this one exactly, let me know if I got the idea wrong.

		if (!Found) {
		// If we haven't found enough to usefully cluster, return early.
		if (Bases.size() > VL.size() / 2 - 1)
		return false;

		// Not found already - add a new Base
		Bases[Ptr].push_back(std::make_tuple(Ptr, 0, Cnt++));
		ABataevUnsubmitted Done Reply Inline Actions `emplace_back(Ptr, 0, Cnt);` ABataev: `emplace_back(Ptr, 0, Cnt);`
		}
		}

		bool AnyConsecutive = false;
		for (auto &Base : Bases) {
		ABataevUnsubmitted Done Reply Inline Actions `const auto &`? ABataev: `const auto &`?
		auto &Vec = Base.second;
		if (Vec.size() > 1) {
		llvm::sort(Vec, [](std::tuple<Value *, int, unsigned> &X,
		ABataevUnsubmitted Done Reply Inline Actions `stable_sort` ABataev: `stable_sort`
		std::tuple<Value *, int, unsigned> &Y) {
		return std::get<1>(X) < std::get<1>(Y);
		});
		int Start = std::get<1>(Vec.front());
		int End = std::get<1>(Vec.back());
		AnyConsecutive \|= (End - Start) == int(Vec.size() - 1);
		ABataevUnsubmitted Done Reply Inline Actions What if some of the loads are the loads from the same address but different instruction? Like: %gep0 = gep %0, 1 %l0 = load %gep0 %gep1 = gep %0, 1 %l1 = load %gep1 %l0 and %l1 will be in the same vector but they are not consecutive, though `(End - Start) == int(Vec.size() - 1);` might be true for >=4 loads. ABataev: What if some of the loads are the loads from the same address but different instruction? Like…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Changing it to checking all the indexes sounds good. dmgreen: Changing it to checking all the indexes sounds good.
		}
		}

		// Fill SortedIndices array only if it looks worth-while to sort the ptrs.
		SortedIndices.clear();
		if (!AnyConsecutive)
		return false;

		for (auto &Base : Bases) {
		for (auto &T : Base.second)
		SortedIndices.push_back(std::get<2>(T));
		}
		ABataevUnsubmitted Not Done Reply Inline Actions What if we have non-power-of-2 number of elements in each cluster? ABataev: What if we have non-power-of-2 number of elements in each cluster?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions There are a couple of test in reduce_blockstrided3 and store_blockstrided3 with blocks of size 3. The first was a few instruction shorter under X86 (didn't change on AArch64). The second was different-but-not-worse on AArch64 (didn't change under X86). That's the only tests I've seen with non-power-2 clusters though, so it's not very exhaustive testing. (A quick test of a "reduce_blockstrided5" seems to be better too - a lot less shuffling in the version I tried under X86 and more vectorization under AArch64) It can depend on the costmodel. I think both AArch64 and X86 will have a much lower cost for insert-subvectors that are aligned and a power2 in size. And how bad the initial ordering is - if it allows more less-than-full-width vectorization that might still be a win. I can make it more conservative if you think that's best? I don't have a strong opinion either way. dmgreen: There are a couple of test in reduce_blockstrided3 and store_blockstrided3 with blocks of size…

		assert(SortedIndices.size() == VL.size());
		ABataevUnsubmitted Done Reply Inline Actions Message? ABataev: Message?
		return true;
		}

		Optional<BoUpSLP::OrdersType>
		BoUpSLP::findPartiallyOrderedLoads(const BoUpSLP::TreeEntry &TE) {
		assert(TE.State == TreeEntry::NeedToGather && "Expected gather node only.");
		Type *ScalarTy = TE.Scalars[0]->getType();

		SmallVector<Value *> Ptrs;
		ABataevUnsubmitted Done Reply Inline Actions Try to preallocate space in the vector, the number of elements is known. ABataev: Try to preallocate space in the vector, the number of elements is known.
		for (Value *V : TE.Scalars) {
		auto *L = dyn_cast<LoadInst>(V);
		if (!L \|\| !L->isSimple())
		return None;
		Ptrs.push_back(L->getPointerOperand());
		}

		BoUpSLP::OrdersType Order;
		if (clusterSortPtrAccesses(Ptrs, ScalarTy, DL, SE, Order))
		return Order;
		return None;
		}

Optional<BoUpSLP::OrdersType> BoUpSLP::getReorderingData(const TreeEntry &TE,		Optional<BoUpSLP::OrdersType> BoUpSLP::getReorderingData(const TreeEntry &TE,
bool TopToBottom) {		bool TopToBottom) {
// No need to reorder if need to shuffle reuses, still need to shuffle the		// No need to reorder if need to shuffle reuses, still need to shuffle the
// node.		// node.
if (!TE.ReuseShuffleIndices.empty())		if (!TE.ReuseShuffleIndices.empty())
return None;		return None;
if (TE.State == TreeEntry::Vectorize &&		if (TE.State == TreeEntry::Vectorize &&
(isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE.getMainOp()) \|\|		(isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE.getMainOp()) \|\|
Show All 24 Lines	if (((TE.getOpcode() == Instruction::ExtractElement &&
if (Reuse \|\| !CurrentOrder.empty()) {		if (Reuse \|\| !CurrentOrder.empty()) {
if (!CurrentOrder.empty())		if (!CurrentOrder.empty())
fixupOrderingIndices(CurrentOrder);		fixupOrderingIndices(CurrentOrder);
return CurrentOrder;		return CurrentOrder;
}		}
}		}
if (Optional<OrdersType> CurrentOrder = findReusedOrderedScalars(TE))		if (Optional<OrdersType> CurrentOrder = findReusedOrderedScalars(TE))
return CurrentOrder;		return CurrentOrder;
		if (Optional<OrdersType> Order = findPartiallyOrderedLoads(TE))
		return Order;
		ABataevUnsubmitted Done Reply Inline Actions Add a check that we have >= 4 loads ABataev: Add a check that we have >= 4 loads
}		}
return None;		return None;
}		}

void BoUpSLP::reorderTopToBottom() {		void BoUpSLP::reorderTopToBottom() {
// Maps VF to the graph nodes.		// Maps VF to the graph nodes.
DenseMap<unsigned, SetVector<TreeEntry *>> VFToOrderedEntries;		DenseMap<unsigned, SetVector<TreeEntry *>> VFToOrderedEntries;
// ExtractElement gather nodes which can be vectorized and need to handle		// ExtractElement gather nodes which can be vectorized and need to handle
▲ Show 20 Lines • Show All 7,426 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll

Show First 20 Lines • Show All 338 Lines • ▼ Show 20 Lines

define i16 @reduce_blockstrided4(i16* nocapture noundef readonly %x, i16* nocapture noundef readonly %y, i32 noundef %stride) {		define i16 @reduce_blockstrided4(i16* nocapture noundef readonly %x, i16* nocapture noundef readonly %y, i32 noundef %stride) {
; CHECK-LABEL: @reduce_blockstrided4(		; CHECK-LABEL: @reduce_blockstrided4(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP0:%.]] = bitcast i16 [[X:%.]] to <4 x i16>		; CHECK-NEXT: [[TMP0:%.]] = bitcast i16 [[X:%.]] to <4 x i16>
; CHECK-NEXT: [[TMP1:%.]] = load <4 x i16>, <4 x i16> [[TMP0]], align 2		; CHECK-NEXT: [[TMP1:%.]] = load <4 x i16>, <4 x i16> [[TMP0]], align 2
; CHECK-NEXT: [[IDXPROM:%.]] = sext i32 [[STRIDE:%.]] to i64		; CHECK-NEXT: [[IDXPROM:%.]] = sext i32 [[STRIDE:%.]] to i64
; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i16, i16 [[X]], i64 [[IDXPROM]]		; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i16, i16 [[X]], i64 [[IDXPROM]]
; CHECK-NEXT: [[TMP2:%.]] = load i16, i16 [[ARRAYIDX4]], align 2		; CHECK-NEXT: [[TMP2:%.]] = bitcast i16 [[ARRAYIDX4]] to <4 x i16>*
; CHECK-NEXT: [[ADD5:%.*]] = add nsw i32 [[STRIDE]], 1		; CHECK-NEXT: [[TMP3:%.]] = load <4 x i16>, <4 x i16> [[TMP2]], align 2
; CHECK-NEXT: [[IDXPROM6:%.*]] = sext i32 [[ADD5]] to i64		; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[Y:%.]] to <4 x i16>
; CHECK-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i16, i16 [[X]], i64 [[IDXPROM6]]		; CHECK-NEXT: [[TMP5:%.]] = load <4 x i16>, <4 x i16> [[TMP4]], align 2
; CHECK-NEXT: [[TMP3:%.]] = load i16, i16 [[ARRAYIDX7]], align 2
; CHECK-NEXT: [[ADD8:%.*]] = add nsw i32 [[STRIDE]], 2
; CHECK-NEXT: [[IDXPROM9:%.*]] = sext i32 [[ADD8]] to i64
; CHECK-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds i16, i16 [[X]], i64 [[IDXPROM9]]
; CHECK-NEXT: [[TMP4:%.]] = load i16, i16 [[ARRAYIDX10]], align 2
; CHECK-NEXT: [[ADD11:%.*]] = add nsw i32 [[STRIDE]], 3
; CHECK-NEXT: [[IDXPROM12:%.*]] = sext i32 [[ADD11]] to i64
; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i16, i16 [[X]], i64 [[IDXPROM12]]
; CHECK-NEXT: [[TMP5:%.]] = load i16, i16 [[ARRAYIDX13]], align 2
; CHECK-NEXT: [[TMP6:%.]] = bitcast i16 [[Y:%.]] to <4 x i16>
; CHECK-NEXT: [[TMP7:%.]] = load <4 x i16>, <4 x i16> [[TMP6]], align 2
; CHECK-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i16, i16 [[Y]], i64 [[IDXPROM]]		; CHECK-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i16, i16 [[Y]], i64 [[IDXPROM]]
; CHECK-NEXT: [[TMP8:%.]] = load i16, i16 [[ARRAYIDX20]], align 2		; CHECK-NEXT: [[TMP6:%.]] = bitcast i16 [[ARRAYIDX20]] to <4 x i16>*
; CHECK-NEXT: [[ARRAYIDX23:%.]] = getelementptr inbounds i16, i16 [[Y]], i64 [[IDXPROM6]]		; CHECK-NEXT: [[TMP7:%.]] = load <4 x i16>, <4 x i16> [[TMP6]], align 2
; CHECK-NEXT: [[TMP9:%.]] = load i16, i16 [[ARRAYIDX23]], align 2		; CHECK-NEXT: [[TMP8:%.*]] = mul <4 x i16> [[TMP5]], [[TMP1]]
; CHECK-NEXT: [[ARRAYIDX26:%.]] = getelementptr inbounds i16, i16 [[Y]], i64 [[IDXPROM9]]		; CHECK-NEXT: [[TMP9:%.*]] = mul <4 x i16> [[TMP7]], [[TMP3]]
; CHECK-NEXT: [[TMP10:%.]] = load i16, i16 [[ARRAYIDX26]], align 2		; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i16> [[TMP8]], <4 x i16> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
; CHECK-NEXT: [[ARRAYIDX29:%.]] = getelementptr inbounds i16, i16 [[Y]], i64 [[IDXPROM12]]		; CHECK-NEXT: [[TMP11:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP10]])
; CHECK-NEXT: [[TMP11:%.]] = load i16, i16 [[ARRAYIDX29]], align 2		; CHECK-NEXT: ret i16 [[TMP11]]
; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <4 x i16> [[TMP7]], <4 x i16> poison, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP13:%.*]] = insertelement <8 x i16> [[TMP12]], i16 [[TMP9]], i64 4
; CHECK-NEXT: [[TMP14:%.*]] = insertelement <8 x i16> [[TMP13]], i16 [[TMP8]], i64 5
; CHECK-NEXT: [[TMP15:%.*]] = insertelement <8 x i16> [[TMP14]], i16 [[TMP11]], i64 6
; CHECK-NEXT: [[TMP16:%.*]] = insertelement <8 x i16> [[TMP15]], i16 [[TMP10]], i64 7
; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <4 x i16> [[TMP1]], <4 x i16> poison, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP18:%.*]] = insertelement <8 x i16> [[TMP17]], i16 [[TMP3]], i64 4
; CHECK-NEXT: [[TMP19:%.*]] = insertelement <8 x i16> [[TMP18]], i16 [[TMP2]], i64 5
; CHECK-NEXT: [[TMP20:%.*]] = insertelement <8 x i16> [[TMP19]], i16 [[TMP5]], i64 6
; CHECK-NEXT: [[TMP21:%.*]] = insertelement <8 x i16> [[TMP20]], i16 [[TMP4]], i64 7
; CHECK-NEXT: [[TMP22:%.*]] = mul <8 x i16> [[TMP16]], [[TMP21]]
; CHECK-NEXT: [[TMP23:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP22]])
; CHECK-NEXT: ret i16 [[TMP23]]
;		;
		ABataevUnsubmitted Not Done Reply Inline Actions Looks like a regression here, worth investigation. ABataev: Looks like a regression here, worth investigation.
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Because of the two v4i16 mul's? It looks like it does OK overall with the nicer order of the loads: https://godbolt.org/z/9f44fPeTW, https://godbolt.org/z/eonoM8Ys7 for x86 From what I can see, the SLP vectorizer produces a single v8i16 mul. It is instcombine that then splits that up because it thinks that one shuffle is better than 2: * IR Dump After SLPVectorizerPass on reduce_blockstrided4 * define i16 @reduce_blockstrided4(i16* nocapture noundef readonly %x, i16* nocapture noundef readonly %y, i32 noundef %stride) { entry: %idxprom = sext i32 %stride to i64 %arrayidx4 = getelementptr inbounds i16, i16* %x, i64 %idxprom %arrayidx20 = getelementptr inbounds i16, i16* %y, i64 %idxprom %0 = bitcast i16* %x to <4 x i16>* %1 = load <4 x i16>, <4 x i16>* %0, align 2 %2 = bitcast i16* %arrayidx4 to <4 x i16>* %3 = load <4 x i16>, <4 x i16>* %2, align 2 %4 = bitcast i16* %y to <4 x i16>* %5 = load <4 x i16>, <4 x i16>* %4, align 2 %6 = bitcast i16* %arrayidx20 to <4 x i16>* %7 = load <4 x i16>, <4 x i16>* %6, align 2 %8 = shufflevector <4 x i16> %5, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %9 = shufflevector <4 x i16> %7, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %10 = shufflevector <8 x i16> %8, <8 x i16> %9, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11> %11 = shufflevector <4 x i16> %1, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %12 = shufflevector <4 x i16> %3, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %13 = shufflevector <8 x i16> %11, <8 x i16> %12, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11> %14 = mul <8 x i16> %10, %13 %15 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %14) ret i16 %15 } I can look into fixing that if you think it's worth doing. I'm not sure how yet (instcombine can't look at the cost model), but I've often worried about the amount of vector shuffles that instcombine transforms. Maybe it can be moved to VectorCombine so to get better costing. dmgreen: Because of the two v4i16 mul's? It looks like it does OK overall with the nicer order of the…
		ABataevUnsubmitted Not Done Reply Inline Actions I think we neeed to improve SLP vectorizer here to reduce the number of emitted shuffles, hope it will be fixed soon. ABataev: I think we neeed to improve SLP vectorizer here to reduce the number of emitted shuffles, hope…
entry:		entry:
%0 = load i16, i16* %x, align 2		%0 = load i16, i16* %x, align 2
%arrayidx1 = getelementptr inbounds i16, i16* %x, i64 1		%arrayidx1 = getelementptr inbounds i16, i16* %x, i64 1
%1 = load i16, i16* %arrayidx1, align 2		%1 = load i16, i16* %arrayidx1, align 2
%arrayidx2 = getelementptr inbounds i16, i16* %x, i64 2		%arrayidx2 = getelementptr inbounds i16, i16* %x, i64 2
%2 = load i16, i16* %arrayidx2, align 2		%2 = load i16, i16* %arrayidx2, align 2
%arrayidx3 = getelementptr inbounds i16, i16* %x, i64 3		%arrayidx3 = getelementptr inbounds i16, i16* %x, i64 3
%3 = load i16, i16* %arrayidx3, align 2		%3 = load i16, i16* %arrayidx3, align 2
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	entry:
ret i16 %add73		ret i16 %add73
}		}

define i32 @reduce_blockstrided4x4(i8* nocapture noundef readonly %p1, i32 noundef %off1, i8* nocapture noundef readonly %p2, i32 noundef %off2) {		define i32 @reduce_blockstrided4x4(i8* nocapture noundef readonly %p1, i32 noundef %off1, i8* nocapture noundef readonly %p2, i32 noundef %off2) {
; CHECK-LABEL: @reduce_blockstrided4x4(		; CHECK-LABEL: @reduce_blockstrided4x4(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[IDX_EXT:%.]] = sext i32 [[OFF1:%.]] to i64		; CHECK-NEXT: [[IDX_EXT:%.]] = sext i32 [[OFF1:%.]] to i64
; CHECK-NEXT: [[IDX_EXT63:%.]] = sext i32 [[OFF2:%.]] to i64		; CHECK-NEXT: [[IDX_EXT63:%.]] = sext i32 [[OFF2:%.]] to i64
; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[P2:%.*]], align 1
; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i8, i8 [[P1:%.*]], i64 4		; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i8, i8 [[P1:%.*]], i64 4
; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i8, i8 [[P2]], i64 4		; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i8, i8 [[P2:%.*]], i64 4
; CHECK-NEXT: [[TMP1:%.]] = load i8, i8 [[ARRAYIDX5]], align 1		; CHECK-NEXT: [[TMP0:%.]] = bitcast i8 [[P1]] to <4 x i8>*
; CHECK-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds i8, i8 [[P2]], i64 1		; CHECK-NEXT: [[TMP1:%.]] = load <4 x i8>, <4 x i8> [[TMP0]], align 1
; CHECK-NEXT: [[TMP2:%.]] = load i8, i8 [[ARRAYIDX10]], align 1		; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[P2]] to <4 x i8>*
; CHECK-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i8, i8 [[P2]], i64 5		; CHECK-NEXT: [[TMP3:%.]] = load <4 x i8>, <4 x i8> [[TMP2]], align 1
; CHECK-NEXT: [[TMP3:%.]] = load i8, i8 [[ARRAYIDX15]], align 1		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[ARRAYIDX3]] to <4 x i8>*
; CHECK-NEXT: [[ARRAYIDX22:%.]] = getelementptr inbounds i8, i8 [[P2]], i64 2		; CHECK-NEXT: [[TMP5:%.]] = load <4 x i8>, <4 x i8> [[TMP4]], align 1
; CHECK-NEXT: [[TMP4:%.]] = load i8, i8 [[ARRAYIDX22]], align 1		; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[ARRAYIDX5]] to <4 x i8>*
; CHECK-NEXT: [[ARRAYIDX27:%.]] = getelementptr inbounds i8, i8 [[P2]], i64 6
; CHECK-NEXT: [[TMP5:%.]] = load i8, i8 [[ARRAYIDX27]], align 1
; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[P1]] to <4 x i8>*
; CHECK-NEXT: [[TMP7:%.]] = load <4 x i8>, <4 x i8> [[TMP6]], align 1		; CHECK-NEXT: [[TMP7:%.]] = load <4 x i8>, <4 x i8> [[TMP6]], align 1
; CHECK-NEXT: [[ARRAYIDX34:%.]] = getelementptr inbounds i8, i8 [[P2]], i64 3
; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX34]], align 1
; CHECK-NEXT: [[TMP9:%.]] = bitcast i8 [[ARRAYIDX3]] to <4 x i8>*
; CHECK-NEXT: [[TMP10:%.]] = load <4 x i8>, <4 x i8> [[TMP9]], align 1
; CHECK-NEXT: [[ARRAYIDX39:%.]] = getelementptr inbounds i8, i8 [[P2]], i64 7
; CHECK-NEXT: [[TMP11:%.]] = load i8, i8 [[ARRAYIDX39]], align 1
; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds i8, i8 [[P1]], i64 [[IDX_EXT]]		; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds i8, i8 [[P1]], i64 [[IDX_EXT]]
; CHECK-NEXT: [[ADD_PTR64:%.]] = getelementptr inbounds i8, i8 [[P2]], i64 [[IDX_EXT63]]		; CHECK-NEXT: [[ADD_PTR64:%.]] = getelementptr inbounds i8, i8 [[P2]], i64 [[IDX_EXT63]]
; CHECK-NEXT: [[TMP12:%.]] = load i8, i8 [[ADD_PTR64]], align 1
; CHECK-NEXT: [[ARRAYIDX3_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR]], i64 4		; CHECK-NEXT: [[ARRAYIDX3_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR]], i64 4
; CHECK-NEXT: [[ARRAYIDX5_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 4		; CHECK-NEXT: [[ARRAYIDX5_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 4
; CHECK-NEXT: [[TMP13:%.]] = load i8, i8 [[ARRAYIDX5_1]], align 1		; CHECK-NEXT: [[TMP8:%.]] = bitcast i8 [[ADD_PTR]] to <4 x i8>*
; CHECK-NEXT: [[ARRAYIDX10_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 1		; CHECK-NEXT: [[TMP9:%.]] = load <4 x i8>, <4 x i8> [[TMP8]], align 1
; CHECK-NEXT: [[TMP14:%.]] = load i8, i8 [[ARRAYIDX10_1]], align 1		; CHECK-NEXT: [[TMP10:%.]] = bitcast i8 [[ADD_PTR64]] to <4 x i8>*
; CHECK-NEXT: [[ARRAYIDX15_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 5		; CHECK-NEXT: [[TMP11:%.]] = load <4 x i8>, <4 x i8> [[TMP10]], align 1
; CHECK-NEXT: [[TMP15:%.]] = load i8, i8 [[ARRAYIDX15_1]], align 1		; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <4 x i8> [[TMP1]], <4 x i8> [[TMP3]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[ARRAYIDX22_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 2		; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <4 x i8> [[TMP9]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP16:%.]] = load i8, i8 [[ARRAYIDX22_1]], align 1		; CHECK-NEXT: [[TMP14:%.*]] = shufflevector <16 x i8> [[TMP12]], <16 x i8> [[TMP13]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[ARRAYIDX27_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 6		; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <4 x i8> [[TMP11]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP17:%.]] = load i8, i8 [[ARRAYIDX27_1]], align 1		; CHECK-NEXT: [[TMP16:%.*]] = shufflevector <16 x i8> [[TMP14]], <16 x i8> [[TMP15]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>
; CHECK-NEXT: [[TMP18:%.]] = bitcast i8 [[ADD_PTR]] to <4 x i8>*		; CHECK-NEXT: [[TMP17:%.*]] = zext <16 x i8> [[TMP16]] to <16 x i32>
		; CHECK-NEXT: [[TMP18:%.]] = bitcast i8 [[ARRAYIDX3_1]] to <4 x i8>*
; CHECK-NEXT: [[TMP19:%.]] = load <4 x i8>, <4 x i8> [[TMP18]], align 1		; CHECK-NEXT: [[TMP19:%.]] = load <4 x i8>, <4 x i8> [[TMP18]], align 1
; CHECK-NEXT: [[ARRAYIDX34_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 3		; CHECK-NEXT: [[TMP20:%.]] = bitcast i8 [[ARRAYIDX5_1]] to <4 x i8>*
; CHECK-NEXT: [[TMP20:%.]] = load i8, i8 [[ARRAYIDX34_1]], align 1		; CHECK-NEXT: [[TMP21:%.]] = load <4 x i8>, <4 x i8> [[TMP20]], align 1
; CHECK-NEXT: [[TMP21:%.*]] = shufflevector <4 x i8> [[TMP7]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>		; CHECK-NEXT: [[TMP22:%.*]] = shufflevector <4 x i8> [[TMP5]], <4 x i8> [[TMP7]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP22:%.*]] = insertelement <16 x i8> [[TMP21]], i8 [[TMP8]], i64 4		; CHECK-NEXT: [[TMP23:%.*]] = shufflevector <4 x i8> [[TMP19]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP23:%.*]] = insertelement <16 x i8> [[TMP22]], i8 [[TMP4]], i64 5		; CHECK-NEXT: [[TMP24:%.*]] = shufflevector <16 x i8> [[TMP22]], <16 x i8> [[TMP23]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP24:%.*]] = insertelement <16 x i8> [[TMP23]], i8 [[TMP2]], i64 6		; CHECK-NEXT: [[TMP25:%.*]] = shufflevector <4 x i8> [[TMP21]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP25:%.*]] = insertelement <16 x i8> [[TMP24]], i8 [[TMP0]], i64 7		; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <16 x i8> [[TMP24]], <16 x i8> [[TMP25]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>
; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <4 x i8> [[TMP19]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>		; CHECK-NEXT: [[TMP27:%.*]] = zext <16 x i8> [[TMP26]] to <16 x i32>
; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <16 x i8> [[TMP25]], <16 x i8> [[TMP26]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>		; CHECK-NEXT: [[TMP28:%.*]] = mul nuw nsw <16 x i32> [[TMP17]], [[TMP27]]
; CHECK-NEXT: [[TMP28:%.*]] = insertelement <16 x i8> [[TMP27]], i8 [[TMP20]], i64 12		; CHECK-NEXT: [[TMP29:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP28]])
; CHECK-NEXT: [[TMP29:%.*]] = insertelement <16 x i8> [[TMP28]], i8 [[TMP16]], i64 13		; CHECK-NEXT: ret i32 [[TMP29]]
; CHECK-NEXT: [[TMP30:%.*]] = insertelement <16 x i8> [[TMP29]], i8 [[TMP14]], i64 14
; CHECK-NEXT: [[TMP31:%.*]] = insertelement <16 x i8> [[TMP30]], i8 [[TMP12]], i64 15
; CHECK-NEXT: [[TMP32:%.*]] = zext <16 x i8> [[TMP31]] to <16 x i32>
; CHECK-NEXT: [[TMP33:%.]] = bitcast i8 [[ARRAYIDX3_1]] to <4 x i8>*
; CHECK-NEXT: [[TMP34:%.]] = load <4 x i8>, <4 x i8> [[TMP33]], align 1
; CHECK-NEXT: [[ARRAYIDX39_1:%.]] = getelementptr inbounds i8, i8 [[ADD_PTR64]], i64 7
; CHECK-NEXT: [[TMP35:%.]] = load i8, i8 [[ARRAYIDX39_1]], align 1
; CHECK-NEXT: [[TMP36:%.*]] = shufflevector <4 x i8> [[TMP10]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP37:%.*]] = insertelement <16 x i8> [[TMP36]], i8 [[TMP11]], i64 4
; CHECK-NEXT: [[TMP38:%.*]] = insertelement <16 x i8> [[TMP37]], i8 [[TMP5]], i64 5
; CHECK-NEXT: [[TMP39:%.*]] = insertelement <16 x i8> [[TMP38]], i8 [[TMP3]], i64 6
; CHECK-NEXT: [[TMP40:%.*]] = insertelement <16 x i8> [[TMP39]], i8 [[TMP1]], i64 7
; CHECK-NEXT: [[TMP41:%.*]] = shufflevector <4 x i8> [[TMP34]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP42:%.*]] = shufflevector <16 x i8> [[TMP40]], <16 x i8> [[TMP41]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP43:%.*]] = insertelement <16 x i8> [[TMP42]], i8 [[TMP35]], i64 12
; CHECK-NEXT: [[TMP44:%.*]] = insertelement <16 x i8> [[TMP43]], i8 [[TMP17]], i64 13
; CHECK-NEXT: [[TMP45:%.*]] = insertelement <16 x i8> [[TMP44]], i8 [[TMP15]], i64 14
; CHECK-NEXT: [[TMP46:%.*]] = insertelement <16 x i8> [[TMP45]], i8 [[TMP13]], i64 15
; CHECK-NEXT: [[TMP47:%.*]] = zext <16 x i8> [[TMP46]] to <16 x i32>
; CHECK-NEXT: [[TMP48:%.*]] = mul nuw nsw <16 x i32> [[TMP32]], [[TMP47]]
; CHECK-NEXT: [[TMP49:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP48]])
; CHECK-NEXT: ret i32 [[TMP49]]
;		;
entry:		entry:
%idx.ext = sext i32 %off1 to i64		%idx.ext = sext i32 %off1 to i64
%idx.ext63 = sext i32 %off2 to i64		%idx.ext63 = sext i32 %off2 to i64

%0 = load i8, i8* %p1, align 1		%0 = load i8, i8* %p1, align 1
%conv = zext i8 %0 to i32		%conv = zext i8 %0 to i32
%1 = load i8, i8* %p2, align 1		%1 = load i8, i8* %p2, align 1
▲ Show 20 Lines • Show All 225 Lines • ▼ Show 20 Lines
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[X:%.]] to <2 x i32>		; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[X:%.]] to <2 x i32>
; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 4		; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 4
; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[X]], i64 2		; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[X]], i64 2
; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX2]], align 4		; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX2]], align 4
; CHECK-NEXT: [[ADD4:%.]] = add nsw i32 [[STRIDE:%.]], 1		; CHECK-NEXT: [[ADD4:%.]] = add nsw i32 [[STRIDE:%.]], 1
; CHECK-NEXT: [[IDXPROM5:%.*]] = sext i32 [[ADD4]] to i64		; CHECK-NEXT: [[IDXPROM5:%.*]] = sext i32 [[ADD4]] to i64
; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[X]], i64 [[IDXPROM5]]		; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[X]], i64 [[IDXPROM5]]
; CHECK-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX6]], align 4		; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[ARRAYIDX6]] to <2 x i32>*
; CHECK-NEXT: [[ADD7:%.*]] = add nsw i32 [[STRIDE]], 2		; CHECK-NEXT: [[TMP4:%.]] = load <2 x i32>, <2 x i32> [[TMP3]], align 4
; CHECK-NEXT: [[IDXPROM8:%.*]] = sext i32 [[ADD7]] to i64
; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[X]], i64 [[IDXPROM8]]
; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[ARRAYIDX9]], align 4
; CHECK-NEXT: [[MUL:%.*]] = shl nsw i32 [[STRIDE]], 1		; CHECK-NEXT: [[MUL:%.*]] = shl nsw i32 [[STRIDE]], 1
; CHECK-NEXT: [[IDXPROM11:%.*]] = sext i32 [[MUL]] to i64		; CHECK-NEXT: [[IDXPROM11:%.*]] = sext i32 [[MUL]] to i64
; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[X]], i64 [[IDXPROM11]]		; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[X]], i64 [[IDXPROM11]]
; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX12]], align 4		; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX12]], align 4
; CHECK-NEXT: [[ADD14:%.*]] = or i32 [[MUL]], 1		; CHECK-NEXT: [[ADD14:%.*]] = or i32 [[MUL]], 1
; CHECK-NEXT: [[IDXPROM15:%.*]] = sext i32 [[ADD14]] to i64		; CHECK-NEXT: [[IDXPROM15:%.*]] = sext i32 [[ADD14]] to i64
; CHECK-NEXT: [[ARRAYIDX16:%.]] = getelementptr inbounds i32, i32 [[X]], i64 [[IDXPROM15]]		; CHECK-NEXT: [[ARRAYIDX16:%.]] = getelementptr inbounds i32, i32 [[X]], i64 [[IDXPROM15]]
; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[ARRAYIDX16]] to <2 x i32>*		; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[ARRAYIDX16]] to <2 x i32>*
; CHECK-NEXT: [[TMP7:%.]] = load <2 x i32>, <2 x i32> [[TMP6]], align 4		; CHECK-NEXT: [[TMP7:%.]] = load <2 x i32>, <2 x i32> [[TMP6]], align 4
; CHECK-NEXT: [[MUL21:%.*]] = mul nsw i32 [[STRIDE]], 3		; CHECK-NEXT: [[MUL21:%.*]] = mul nsw i32 [[STRIDE]], 3
; CHECK-NEXT: [[IDXPROM23:%.*]] = sext i32 [[MUL21]] to i64		; CHECK-NEXT: [[IDXPROM23:%.*]] = sext i32 [[MUL21]] to i64
; CHECK-NEXT: [[ARRAYIDX24:%.]] = getelementptr inbounds i32, i32 [[X]], i64 [[IDXPROM23]]		; CHECK-NEXT: [[ARRAYIDX24:%.]] = getelementptr inbounds i32, i32 [[X]], i64 [[IDXPROM23]]
; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[ARRAYIDX24]] to <2 x i32>*		; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[ARRAYIDX24]] to <2 x i32>*
; CHECK-NEXT: [[TMP9:%.]] = load <2 x i32>, <2 x i32> [[TMP8]], align 4		; CHECK-NEXT: [[TMP9:%.]] = load <2 x i32>, <2 x i32> [[TMP8]], align 4
; CHECK-NEXT: [[ADD30:%.*]] = add nsw i32 [[MUL21]], 2		; CHECK-NEXT: [[ADD30:%.*]] = add nsw i32 [[MUL21]], 2
; CHECK-NEXT: [[IDXPROM31:%.*]] = sext i32 [[ADD30]] to i64		; CHECK-NEXT: [[IDXPROM31:%.*]] = sext i32 [[ADD30]] to i64
; CHECK-NEXT: [[ARRAYIDX32:%.]] = getelementptr inbounds i32, i32 [[X]], i64 [[IDXPROM31]]		; CHECK-NEXT: [[ARRAYIDX32:%.]] = getelementptr inbounds i32, i32 [[X]], i64 [[IDXPROM31]]
; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[ARRAYIDX32]], align 4		; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[ARRAYIDX32]], align 4
; CHECK-NEXT: [[TMP11:%.]] = bitcast i32 [[Y:%.]] to <2 x i32>		; CHECK-NEXT: [[TMP11:%.]] = bitcast i32 [[Y:%.]] to <2 x i32>
; CHECK-NEXT: [[TMP12:%.]] = load <2 x i32>, <2 x i32> [[TMP11]], align 4		; CHECK-NEXT: [[TMP12:%.]] = load <2 x i32>, <2 x i32> [[TMP11]], align 4
; CHECK-NEXT: [[ARRAYIDX35:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 2		; CHECK-NEXT: [[ARRAYIDX35:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 2
; CHECK-NEXT: [[TMP13:%.]] = load i32, i32 [[ARRAYIDX35]], align 4		; CHECK-NEXT: [[TMP13:%.]] = load i32, i32 [[ARRAYIDX35]], align 4
; CHECK-NEXT: [[ARRAYIDX41:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 [[IDXPROM5]]		; CHECK-NEXT: [[ARRAYIDX41:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 [[IDXPROM5]]
; CHECK-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX41]], align 4		; CHECK-NEXT: [[TMP14:%.]] = bitcast i32 [[ARRAYIDX41]] to <2 x i32>*
; CHECK-NEXT: [[ARRAYIDX44:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 [[IDXPROM8]]		; CHECK-NEXT: [[TMP15:%.]] = load <2 x i32>, <2 x i32> [[TMP14]], align 4
; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 [[ARRAYIDX44]], align 4
; CHECK-NEXT: [[ARRAYIDX48:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 [[IDXPROM11]]		; CHECK-NEXT: [[ARRAYIDX48:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 [[IDXPROM11]]
; CHECK-NEXT: [[TMP16:%.]] = load i32, i32 [[ARRAYIDX48]], align 4		; CHECK-NEXT: [[TMP16:%.]] = load i32, i32 [[ARRAYIDX48]], align 4
; CHECK-NEXT: [[ARRAYIDX52:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 [[IDXPROM15]]		; CHECK-NEXT: [[ARRAYIDX52:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 [[IDXPROM15]]
; CHECK-NEXT: [[TMP17:%.]] = bitcast i32 [[ARRAYIDX52]] to <2 x i32>*		; CHECK-NEXT: [[TMP17:%.]] = bitcast i32 [[ARRAYIDX52]] to <2 x i32>*
; CHECK-NEXT: [[TMP18:%.]] = load <2 x i32>, <2 x i32> [[TMP17]], align 4		; CHECK-NEXT: [[TMP18:%.]] = load <2 x i32>, <2 x i32> [[TMP17]], align 4
; CHECK-NEXT: [[ARRAYIDX60:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 [[IDXPROM23]]		; CHECK-NEXT: [[ARRAYIDX60:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 [[IDXPROM23]]
; CHECK-NEXT: [[TMP19:%.]] = bitcast i32 [[ARRAYIDX60]] to <2 x i32>*		; CHECK-NEXT: [[TMP19:%.]] = bitcast i32 [[ARRAYIDX60]] to <2 x i32>*
; CHECK-NEXT: [[TMP20:%.]] = load <2 x i32>, <2 x i32> [[TMP19]], align 4		; CHECK-NEXT: [[TMP20:%.]] = load <2 x i32>, <2 x i32> [[TMP19]], align 4
; CHECK-NEXT: [[ARRAYIDX68:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 [[IDXPROM31]]		; CHECK-NEXT: [[ARRAYIDX68:%.]] = getelementptr inbounds i32, i32 [[Y]], i64 [[IDXPROM31]]
; CHECK-NEXT: [[TMP21:%.]] = load i32, i32 [[ARRAYIDX68]], align 4		; CHECK-NEXT: [[TMP21:%.]] = load i32, i32 [[ARRAYIDX68]], align 4
; CHECK-NEXT: [[ARRAYIDX72:%.]] = getelementptr inbounds i32, i32 [[Z:%.*]], i64 1		; CHECK-NEXT: [[ARRAYIDX72:%.]] = getelementptr inbounds i32, i32 [[Z:%.*]], i64 1
; CHECK-NEXT: [[MUL73:%.*]] = mul nsw i32 [[TMP13]], [[TMP2]]		; CHECK-NEXT: [[MUL73:%.*]] = mul nsw i32 [[TMP13]], [[TMP2]]
; CHECK-NEXT: store i32 [[MUL73]], i32* [[Z]], align 4		; CHECK-NEXT: store i32 [[MUL73]], i32* [[Z]], align 4
; CHECK-NEXT: [[ARRAYIDX76:%.]] = getelementptr inbounds i32, i32 [[Z]], i64 6		; CHECK-NEXT: [[ARRAYIDX76:%.]] = getelementptr inbounds i32, i32 [[Z]], i64 6
; CHECK-NEXT: [[TMP22:%.*]] = shufflevector <2 x i32> [[TMP12]], <2 x i32> poison, <4 x i32> <i32 1, i32 0, i32 undef, i32 undef>		; CHECK-NEXT: [[TMP22:%.*]] = mul nsw <2 x i32> [[TMP12]], [[TMP1]]
; CHECK-NEXT: [[TMP23:%.*]] = insertelement <4 x i32> [[TMP22]], i32 [[TMP15]], i64 2		; CHECK-NEXT: [[TMP23:%.*]] = mul nsw <2 x i32> [[TMP15]], [[TMP4]]
; CHECK-NEXT: [[TMP24:%.*]] = insertelement <4 x i32> [[TMP23]], i32 [[TMP14]], i64 3		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP22]], <2 x i32> [[TMP23]], <4 x i32> <i32 1, i32 0, i32 3, i32 2>
; CHECK-NEXT: [[TMP25:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <4 x i32> <i32 1, i32 0, i32 undef, i32 undef>		; CHECK-NEXT: [[TMP24:%.]] = bitcast i32 [[ARRAYIDX72]] to <4 x i32>*
; CHECK-NEXT: [[TMP26:%.*]] = insertelement <4 x i32> [[TMP25]], i32 [[TMP4]], i64 2		; CHECK-NEXT: store <4 x i32> [[SHUFFLE]], <4 x i32>* [[TMP24]], align 4
; CHECK-NEXT: [[TMP27:%.*]] = insertelement <4 x i32> [[TMP26]], i32 [[TMP3]], i64 3
; CHECK-NEXT: [[TMP28:%.*]] = mul nsw <4 x i32> [[TMP24]], [[TMP27]]
; CHECK-NEXT: [[TMP29:%.]] = bitcast i32 [[ARRAYIDX72]] to <4 x i32>*
; CHECK-NEXT: store <4 x i32> [[TMP28]], <4 x i32>* [[TMP29]], align 4
; CHECK-NEXT: [[MUL81:%.*]] = mul nsw i32 [[TMP16]], [[TMP5]]		; CHECK-NEXT: [[MUL81:%.*]] = mul nsw i32 [[TMP16]], [[TMP5]]
; CHECK-NEXT: [[ARRAYIDX82:%.]] = getelementptr inbounds i32, i32 [[Z]], i64 8		; CHECK-NEXT: [[ARRAYIDX82:%.]] = getelementptr inbounds i32, i32 [[Z]], i64 8
; CHECK-NEXT: store i32 [[MUL81]], i32* [[ARRAYIDX82]], align 4		; CHECK-NEXT: store i32 [[MUL81]], i32* [[ARRAYIDX82]], align 4
; CHECK-NEXT: [[TMP30:%.*]] = mul nsw <2 x i32> [[TMP18]], [[TMP7]]		; CHECK-NEXT: [[TMP25:%.*]] = mul nsw <2 x i32> [[TMP18]], [[TMP7]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP30]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <2 x i32> [[TMP25]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[TMP31:%.]] = bitcast i32 [[ARRAYIDX76]] to <2 x i32>*		; CHECK-NEXT: [[TMP26:%.]] = bitcast i32 [[ARRAYIDX76]] to <2 x i32>*
; CHECK-NEXT: store <2 x i32> [[SHUFFLE]], <2 x i32>* [[TMP31]], align 4		; CHECK-NEXT: store <2 x i32> [[SHUFFLE1]], <2 x i32>* [[TMP26]], align 4
; CHECK-NEXT: [[TMP32:%.*]] = mul nsw <2 x i32> [[TMP20]], [[TMP9]]		; CHECK-NEXT: [[TMP27:%.*]] = mul nsw <2 x i32> [[TMP20]], [[TMP9]]
; CHECK-NEXT: [[ARRAYIDX90:%.]] = getelementptr inbounds i32, i32 [[Z]], i64 10		; CHECK-NEXT: [[ARRAYIDX90:%.]] = getelementptr inbounds i32, i32 [[Z]], i64 10
; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <2 x i32> [[TMP32]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[SHUFFLE2:%.*]] = shufflevector <2 x i32> [[TMP27]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[TMP33:%.]] = bitcast i32 [[ARRAYIDX90]] to <2 x i32>*		; CHECK-NEXT: [[TMP28:%.]] = bitcast i32 [[ARRAYIDX90]] to <2 x i32>*
; CHECK-NEXT: store <2 x i32> [[SHUFFLE1]], <2 x i32>* [[TMP33]], align 4		; CHECK-NEXT: store <2 x i32> [[SHUFFLE2]], <2 x i32>* [[TMP28]], align 4
; CHECK-NEXT: [[MUL91:%.*]] = mul nsw i32 [[TMP21]], [[TMP10]]		; CHECK-NEXT: [[MUL91:%.*]] = mul nsw i32 [[TMP21]], [[TMP10]]
; CHECK-NEXT: [[ARRAYIDX92:%.]] = getelementptr inbounds i32, i32 [[Z]], i64 9		; CHECK-NEXT: [[ARRAYIDX92:%.]] = getelementptr inbounds i32, i32 [[Z]], i64 9
; CHECK-NEXT: store i32 [[MUL91]], i32* [[ARRAYIDX92]], align 4		; CHECK-NEXT: store i32 [[MUL91]], i32* [[ARRAYIDX92]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%0 = load i32, i32* %x, align 4		%0 = load i32, i32* %x, align 4
%arrayidx1 = getelementptr inbounds i32, i32* %x, i64 1		%arrayidx1 = getelementptr inbounds i32, i32* %x, i64 1
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines

define void @store_blockstrided4(i16* nocapture noundef readonly %x, i16* nocapture noundef readonly %y, i32 noundef %stride, i16 *%dst0) {		define void @store_blockstrided4(i16* nocapture noundef readonly %x, i16* nocapture noundef readonly %y, i32 noundef %stride, i16 *%dst0) {
; CHECK-LABEL: @store_blockstrided4(		; CHECK-LABEL: @store_blockstrided4(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP0:%.]] = bitcast i16 [[X:%.]] to <4 x i16>		; CHECK-NEXT: [[TMP0:%.]] = bitcast i16 [[X:%.]] to <4 x i16>
; CHECK-NEXT: [[TMP1:%.]] = load <4 x i16>, <4 x i16> [[TMP0]], align 2		; CHECK-NEXT: [[TMP1:%.]] = load <4 x i16>, <4 x i16> [[TMP0]], align 2
; CHECK-NEXT: [[IDXPROM:%.]] = sext i32 [[STRIDE:%.]] to i64		; CHECK-NEXT: [[IDXPROM:%.]] = sext i32 [[STRIDE:%.]] to i64
; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i16, i16 [[X]], i64 [[IDXPROM]]		; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i16, i16 [[X]], i64 [[IDXPROM]]
; CHECK-NEXT: [[TMP2:%.]] = load i16, i16 [[ARRAYIDX4]], align 2		; CHECK-NEXT: [[TMP2:%.]] = bitcast i16 [[ARRAYIDX4]] to <4 x i16>*
; CHECK-NEXT: [[ADD5:%.*]] = add nsw i32 [[STRIDE]], 1		; CHECK-NEXT: [[TMP3:%.]] = load <4 x i16>, <4 x i16> [[TMP2]], align 2
; CHECK-NEXT: [[IDXPROM6:%.*]] = sext i32 [[ADD5]] to i64		; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[Y:%.]] to <4 x i16>
; CHECK-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i16, i16 [[X]], i64 [[IDXPROM6]]		; CHECK-NEXT: [[TMP5:%.]] = load <4 x i16>, <4 x i16> [[TMP4]], align 2
; CHECK-NEXT: [[TMP3:%.]] = load i16, i16 [[ARRAYIDX7]], align 2
; CHECK-NEXT: [[ADD8:%.*]] = add nsw i32 [[STRIDE]], 2
; CHECK-NEXT: [[IDXPROM9:%.*]] = sext i32 [[ADD8]] to i64
; CHECK-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds i16, i16 [[X]], i64 [[IDXPROM9]]
; CHECK-NEXT: [[TMP4:%.]] = load i16, i16 [[ARRAYIDX10]], align 2
; CHECK-NEXT: [[ADD11:%.*]] = add nsw i32 [[STRIDE]], 3
; CHECK-NEXT: [[IDXPROM12:%.*]] = sext i32 [[ADD11]] to i64
; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i16, i16 [[X]], i64 [[IDXPROM12]]
; CHECK-NEXT: [[TMP5:%.]] = load i16, i16 [[ARRAYIDX13]], align 2
; CHECK-NEXT: [[TMP6:%.]] = bitcast i16 [[Y:%.]] to <4 x i16>
; CHECK-NEXT: [[TMP7:%.]] = load <4 x i16>, <4 x i16> [[TMP6]], align 2
; CHECK-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i16, i16 [[Y]], i64 [[IDXPROM]]		; CHECK-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i16, i16 [[Y]], i64 [[IDXPROM]]
; CHECK-NEXT: [[TMP8:%.]] = load i16, i16 [[ARRAYIDX20]], align 2		; CHECK-NEXT: [[TMP6:%.]] = bitcast i16 [[ARRAYIDX20]] to <4 x i16>*
; CHECK-NEXT: [[ARRAYIDX23:%.]] = getelementptr inbounds i16, i16 [[Y]], i64 [[IDXPROM6]]		; CHECK-NEXT: [[TMP7:%.]] = load <4 x i16>, <4 x i16> [[TMP6]], align 2
; CHECK-NEXT: [[TMP9:%.]] = load i16, i16 [[ARRAYIDX23]], align 2		; CHECK-NEXT: [[TMP8:%.*]] = mul <4 x i16> [[TMP5]], [[TMP1]]
; CHECK-NEXT: [[ARRAYIDX26:%.]] = getelementptr inbounds i16, i16 [[Y]], i64 [[IDXPROM9]]		; CHECK-NEXT: [[TMP9:%.*]] = mul <4 x i16> [[TMP7]], [[TMP3]]
; CHECK-NEXT: [[TMP10:%.]] = load i16, i16 [[ARRAYIDX26]], align 2		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i16> [[TMP8]], <4 x i16> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6>
; CHECK-NEXT: [[ARRAYIDX29:%.]] = getelementptr inbounds i16, i16 [[Y]], i64 [[IDXPROM12]]		; CHECK-NEXT: [[TMP10:%.]] = bitcast i16 [[DST0:%.]] to <8 x i16>
; CHECK-NEXT: [[TMP11:%.]] = load i16, i16 [[ARRAYIDX29]], align 2		; CHECK-NEXT: store <8 x i16> [[SHUFFLE]], <8 x i16>* [[TMP10]], align 2
; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <4 x i16> [[TMP7]], <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 3, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP13:%.*]] = insertelement <8 x i16> [[TMP12]], i16 [[TMP9]], i64 4
; CHECK-NEXT: [[TMP14:%.*]] = insertelement <8 x i16> [[TMP13]], i16 [[TMP8]], i64 5
; CHECK-NEXT: [[TMP15:%.*]] = insertelement <8 x i16> [[TMP14]], i16 [[TMP11]], i64 6
; CHECK-NEXT: [[TMP16:%.*]] = insertelement <8 x i16> [[TMP15]], i16 [[TMP10]], i64 7
; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <4 x i16> [[TMP1]], <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 3, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP18:%.*]] = insertelement <8 x i16> [[TMP17]], i16 [[TMP3]], i64 4
; CHECK-NEXT: [[TMP19:%.*]] = insertelement <8 x i16> [[TMP18]], i16 [[TMP2]], i64 5
; CHECK-NEXT: [[TMP20:%.*]] = insertelement <8 x i16> [[TMP19]], i16 [[TMP5]], i64 6
; CHECK-NEXT: [[TMP21:%.*]] = insertelement <8 x i16> [[TMP20]], i16 [[TMP4]], i64 7
; CHECK-NEXT: [[TMP22:%.*]] = mul <8 x i16> [[TMP16]], [[TMP21]]
; CHECK-NEXT: [[TMP23:%.]] = bitcast i16 [[DST0:%.]] to <8 x i16>
; CHECK-NEXT: store <8 x i16> [[TMP22]], <8 x i16>* [[TMP23]], align 2
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%0 = load i16, i16* %x, align 2		%0 = load i16, i16* %x, align 2
%arrayidx1 = getelementptr inbounds i16, i16* %x, i64 1		%arrayidx1 = getelementptr inbounds i16, i16* %x, i64 1
%1 = load i16, i16* %arrayidx1, align 2		%1 = load i16, i16* %arrayidx1, align 2
%arrayidx2 = getelementptr inbounds i16, i16* %x, i64 2		%arrayidx2 = getelementptr inbounds i16, i16* %x, i64 2
%2 = load i16, i16* %arrayidx2, align 2		%2 = load i16, i16* %arrayidx2, align 2
▲ Show 20 Lines • Show All 1,202 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/split-load8_2-unord.ll

	Show First 20 Lines • Show All 131 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
	; CHECK-NEXT: [[ARRAYIDX23:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 3			; CHECK-NEXT: [[ARRAYIDX23:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 3
	; CHECK-NEXT: [[ARRAYIDX30:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 4			; CHECK-NEXT: [[ARRAYIDX30:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 4
	; CHECK-NEXT: [[ARRAYIDX37:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 5			; CHECK-NEXT: [[ARRAYIDX37:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 5
	; CHECK-NEXT: [[ARRAYIDX44:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 6			; CHECK-NEXT: [[ARRAYIDX44:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 6
	; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[G20]] to <4 x i32>*			; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[G20]] to <4 x i32>*
	; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4			; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4
	; CHECK-NEXT: [[ARRAYIDX51:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 7			; CHECK-NEXT: [[ARRAYIDX51:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 7
	; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <8 x i32> <i32 1, i32 0, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <8 x i32> poison, <8 x i32> [[TMP4]], <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7>			; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> poison, <8 x i32> <i32 3, i32 1, i32 2, i32 0, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <8 x i32> [[TMP4]], <8 x i32> [[TMP5]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
	; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> [[TMP5]], <8 x i32> [[TMP6]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x i32> [[TMP6]], <8 x i32> poison, <8 x i32> <i32 1, i32 0, i32 2, i32 3, i32 7, i32 5, i32 6, i32 4>
	; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[ARRAYIDX2]] to <8 x i32>*			; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[ARRAYIDX2]] to <8 x i32>*
	; CHECK-NEXT: store <8 x i32> [[TMP7]], <8 x i32>* [[TMP8]], align 4			; CHECK-NEXT: store <8 x i32> [[SHUFFLE]], <8 x i32>* [[TMP7]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%p1 = alloca [16 x i32], align 16			%p1 = alloca [16 x i32], align 16
	%p2 = alloca [16 x i32], align 16			%p2 = alloca [16 x i32], align 16
	%g10 = getelementptr inbounds [16 x i32], [16 x i32]* %p1, i32 0, i64 4			%g10 = getelementptr inbounds [16 x i32], [16 x i32]* %p1, i32 0, i64 4
	%g11 = getelementptr inbounds [16 x i32], [16 x i32]* %p1, i32 0, i64 5			%g11 = getelementptr inbounds [16 x i32], [16 x i32]* %p1, i32 0, i64 5
	%g12 = getelementptr inbounds [16 x i32], [16 x i32]* %p1, i32 0, i64 6			%g12 = getelementptr inbounds [16 x i32], [16 x i32]* %p1, i32 0, i64 6
	▲ Show 20 Lines • Show All 113 Lines • Show Last 20 Lines