This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
12/28
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/AArch64/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
multiple_reduction.ll

Differential D133441

[SLP] Vectorize mutual horizontal reductions.
AbandonedPublic

Authored by labrinea on Sep 7 2022, 11:06 AM.

Download Raw Diff

Details

Reviewers

vporpo
fhahn
ABataev

Summary

When we try to reduce a horizontal reduction which has external users, the vectorization may be considered unprofitable because of the scalar extraction cost. However, we sometimes encounter mutual reductions cancelling each other out without actually having to extract any scalars. This fixes issue #55350.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

labrinea created this revision.Sep 7 2022, 11:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 11:06 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

labrinea requested review of this revision.Sep 7 2022, 11:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 11:06 AM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

ABataev added inline comments.Sep 7 2022, 11:38 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070	Not sure it is a good idea to store reduction ops here, you're working with reduced values actually.
11799	Why do you need this one? In case of successful reduction, the vectorizer restarts the analysis and rebuilds the reduction graph.

Harbormaster completed remote builds in B185461: Diff 458514.Sep 7 2022, 12:03 PM

Well, this issue shows up in other cases too, not just reductions, see for example https://reviews.llvm.org/D132773. Ideally I would prefer a more generic solution rather than workarounds for each specific case because each workaround is rather limited. For example this patch won't help vectorize the code if we have more than two reductions, or if the external users are not reductions like the stores in the following example:

define i64 @foo(ptr %ptr.0) {
entry:
  %ptr.1 = getelementptr inbounds i32, ptr %ptr.0, i64 1
  %ptr.2 = getelementptr inbounds i32, ptr %ptr.0, i64 2
  %ptr.3 = getelementptr inbounds i32, ptr %ptr.0, i64 3

  %0 = load i32, ptr %ptr.0, align 4, !tbaa !5
  %1 = load i32, ptr %ptr.1, align 4, !tbaa !5
  %2 = load i32, ptr %ptr.2, align 4, !tbaa !5
  %3 = load i32, ptr %ptr.3, align 4, !tbaa !5

  %add.1 = add i32 %1, %0
  %add.2 = add i32 %2, %add.1
  %add.3 = add i32 %3, %add.2

  %mul = mul i32 %0, %0
  %mul.1 = mul i32 %1, %1
  %mul.2 = mul i32 %2, %2
  %mul.3 = mul i32 %3, %3

  %add5.1 = add i32 %mul.1, %mul
  %add5.2 = add i32 %mul.2, %add5.1
  %add5.3 = add i32 %mul.3, %add5.2

  ; Non-reduction external users
  store i32 %0, ptr %ptr.0
  store i32 %1, ptr %ptr.1
  store i32 %2, ptr %ptr.2
  store i32 %3, ptr %ptr.3

  %conv = zext i32 %add.3 to i64
  %conv6 = zext i32 %add5.3 to i64
  %shl = shl nuw i64 %conv6, 32
  %add7 = or i64 %shl, %conv
  ret i64 %add7
}

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2656	Perhaps we could name this `SkipCost` or something along those lines, because that way we can reuse it for other similar workarounds?

In D133441#3775372, @vporpo wrote:

Ideally I would prefer a more generic solution rather than workarounds for each specific case

Fair enough. Can you recommend a better solution that could fix both D132773 and this one?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070	If you look at the reproducer: for i = ... sm += x[i]; sq += x[i] * x[i]; the addition `sm += x[i]` (which is a reduction op of the first horizontal reduction) is an external user of the scalar load from the multiplication `x[i] * x[i]` (which is a reduction value of the second horizontal reduction)
11799	The idea is to separate the "matching" of a horizontal reduction (matchAssociativeReduction) from the "processing" of it (tryToReduce). That way we can postpone the processing until we have found at least another one. This allows us to identify mutual reductions and ignore the extraction cost of the common scalar values. As Vasilis mentioned this limits the amount of mutual reductions to two, which is not ideal.

ABataev added inline comments.Sep 8 2022, 4:21 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070	Aha, this is for external users. Then why do you need to store there reduced values?
11799	No need to postpone it, the pass will just repeat the same steps once again after the changes.

labrinea added inline comments.Sep 8 2022, 5:16 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070	Because similarly the multiplication (reduced value) is an external user of the scalar load from the other horizontal reduction.
11799	I am not following. What should the code look like then? How can we solve the problem of mutual reductions without looking ahead? PendingReduction serves the purpose of one element buffer if that makes it clearer.

ABataev added inline comments.Sep 8 2022, 6:21 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070	Ah, I see what do you mean. But this is not the best place for gathering the reduced values, they must be gathered during an attempt to build the tree, since they might be not vectorized but gathered. If they are gathered, you cannot treat them as a future vector.
11799	Just adjust the cost, everything else should be handled by the SLP vectorizer pass.

labrinea added inline comments.Sep 8 2022, 6:57 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070	I see. In other words we cannot speculate that the vectorization will eventually happen before trying to actually reduce, so the look ahead approach I am suggesting is rather unsafe? By the time we are attempting to build the tree it is already too late.

ABataev added inline comments.Sep 8 2022, 7:01 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070	You can gather such "vectorizable" scalars during/after the SLP graph building and store them in the set for the future analysis. Later, next SLP graphs will check if such scalars are potentially vectorizable and exclude the extract cost for them. The, SLP sees, that some part of the basic block was vectorized and restarts to try to vectorize other parts of the same block.

Ah, I see what do you mean. But this is not the best place for gathering the reduced values, they must be gathered during an attempt to build the tree, since they might be not vectorized but gathered. If they are gathered, you cannot treat them as a future vector.

Unlike the similar situation with store seeds, the reduction does not have the issue with the "good" lane ordering, and there are no dependency issues that could prohibit them to be scheduled in a specific way, so this looks a lot safer. If the reduction user is connected to a vectorizable TreeEntry and the current tree gets vectorized, isn't it guaranteed that the tree rooted at the reduction will also get vectorized?

Fair enough. Can you recommend a better solution that could fix both D132773 and this one?

One solution could be to collect the seeds that resulted in failed trees due to external users (and perhaps filter them more by checking if the users are actually seeds), and revisit the seeds once again. I think this is more or less what Alexey is describing.

In D133441#3777591, @vporpo wrote:

Ah, I see what do you mean. But this is not the best place for gathering the reduced values, they must be gathered during an attempt to build the tree, since they might be not vectorized but gathered. If they are gathered, you cannot treat them as a future vector.

Unlike the similar situation with store seeds, the reduction does not have the issue with the "good" lane ordering, and there are no dependency issues that could prohibit them to be scheduled in a specific way, so this looks a lot safer.

No quite, e.g. if the reduced values are loads or something like this, they still might be not scheduled.

If the reduction user is connected to a vectorizable TreeEntry and the current tree gets vectorized, isn't it guaranteed that the tree rooted at the reduction will also get vectorized?

Yes, but need a check that the tree entry is going to be vectorized.

Fair enough. Can you recommend a better solution that could fix both D132773 and this one?

One solution could be to collect the seeds that resulted in failed trees due to external users (and perhaps filter them more by checking if the users are actually seeds), and revisit the seeds once again. I think this is more or less what Alexey is describing.

Yep, exactly! At least looks so :)

dtemirbulatov added a subscriber: dtemirbulatov.Sep 9 2022, 7:25 PM

One solution could be to collect the seeds that resulted in failed trees due to external users (and perhaps filter them more by checking if the users are actually seeds), and revisit the seeds once again. I think this is more or less what Alexey is describing.

Yep, exactly! At least looks so :)

I am not sure how would I revisit the seeds though without changing the code. Also can you clarify what do you mean by seeds in this context? The root instruction of the reduction, the list of reduced values passed to buildTree, something else?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070	That doesn't happen with horizontal reductions. If a reduction fails then its root instruction is put in the postponed list for later processing, where it's not treated as a reduction any more. At least that's my understanding. I could indeed cache the scalars of a vectorizable tree entry later on, maybe inside getTreeCost (where the extract cost is available, allowing a more informed decision). However the reduction ops themselves are not part of the tree entry, and so we would still need to cache them, perhaps via the ignore list. Also, the horizontal reduction is iteratively modeled by several vectorization trees of different factor. Which one do I pick to cache? All of them?
11799	As I explained on the other thread of comments, the unsuccessful reductions are not rescheduled the way you describe. I am still looking into how else could I make it work.

In D133441#3782167, @labrinea wrote:

One solution could be to collect the seeds that resulted in failed trees due to external users (and perhaps filter them more by checking if the users are actually seeds), and revisit the seeds once again. I think this is more or less what Alexey is describing.

Yep, exactly! At least looks so :)

I am not sure how would I revisit the seeds though without changing the code. Also can you clarify what do you mean by seeds in this context? The root instruction of the reduction, the list of reduced values passed to buildTree, something else?

This should require code changes in at least a couple of places:

Checking that a tree failed to vectorize because of external uses (changes mainly in`getTreeCost()`).
Collecting the roots that failed to vectorize because of external uses (these are the "seeds" that I am referring to). This should require changes after the invocations of getTreeCost() and the checks against SLPCostThreshold.
Adding some logic to retry vectorizing the missed sseds. This is probably the trickiest part because we don't really know when it is best to retry vectorizing these trees (unless perhaps we also keep track of when the external uses get vectorized?) . I was having a similar issue with the other patch: the vectorizer would succeed vectorizing the code 2-wide, before I would retry vectorizing from the missed 4-wide seed, which resulted in worse than optimal code. But anyway, retrying to vectorize such missed opportunities is definitely better than not retrying at all.

What you don't need to change is the cost calculation for external uses, as in this way the vectorizer is already calculating the costs correctly for you, as this time the uses will have already been vectorized. I think this is what Alexey meant with his comment (line 11778).

In D133441#3784652, @vporpo wrote:

In D133441#3782167, @labrinea wrote:

One solution could be to collect the seeds that resulted in failed trees due to external users (and perhaps filter them more by checking if the users are actually seeds), and revisit the seeds once again. I think this is more or less what Alexey is describing.

Yep, exactly! At least looks so :)

I am not sure how would I revisit the seeds though without changing the code. Also can you clarify what do you mean by seeds in this context? The root instruction of the reduction, the list of reduced values passed to buildTree, something else?

This should require code changes in at least a couple of places:

Checking that a tree failed to vectorize because of external uses (changes mainly in`getTreeCost()`).

Collecting the roots that failed to vectorize because of external uses (these are the "seeds" that I am referring to). This should require changes after the invocations of getTreeCost() and the checks against SLPCostThreshold.

Adding some logic to retry vectorizing the missed sseds. This is probably the trickiest part because we don't really know when it is best to retry vectorizing these trees (unless perhaps we also keep track of when the external uses get vectorized?) . I was having a similar issue with the other patch: the vectorizer would succeed vectorizing the code 2-wide, before I would retry vectorizing from the missed 4-wide seed, which resulted in worse than optimal code. But anyway, retrying to vectorize such missed opportunities is definitely better than not retrying at all.

What you don't need to change is the cost calculation for external uses, as in this way the vectorizer is already calculating the costs correctly for you, as this time the uses will have already been vectorized. I think this is what Alexey meant with his comment (line 11778).

Yep. Current solution is very limited, handles only 2 (possible) reductions while the described change may be able to handle more reductions and other vectorizable patterns too.

I tried to follow the review suggestions as close as possible. However I am not sure about the conditions inside cacheVectorizableValues() and those that determine the value of SkipCost. Relaxing them makes the change to trigger way too broadly resulting additional insert/extract element instructions in the unit tests.

ABataev added inline comments.Sep 20 2022, 10:37 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
921	How can the user node be equal to the operand node?
2710–2713	Not sure this is the best option to keep external vectorizable values, again it does not allow to handle values across different graphs, only across the same instance.
7123–7125	This again is too optimistic, you need to try to estimate the cost of extra shuffles, required for the vectorization.
7303–7306	Maybe put it to deleteTree() function and check if the graph-to-be-deleted was built but was not vectorized?
11320	Why return?

Harbormaster completed remote builds in B187770: Diff 461599.Sep 20 2022, 10:53 AM

labrinea added inline comments.Sep 21 2022, 10:29 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2710–2713	I am not sure I am following your line of thought. As far as I understand the vectorizer keeps one active graph at a time, created by `buildTree()`, which either gets vectorized or destroyed and then another bundle of instructions is scheduled and so on. The DenseMap I added works as a cache of tree entries that failed to vectorize due to extraction cost. Therefore it holds values across different graphs if that makes sense. As I said earlier, the change triggers in quite a few unit tests if the conditions are relaxed a bit, but I couldn't tell for sure if the new codegen was better or worse than before. If that's not what you meant, can you please explain and maybe suggest an alternative approach?
7123–7125	I thought that the shuffle cost is already accounted inside `getEntryCost()`, am I wrong? Or maybe that logic needs to be patched too?
11320	Because returning a non null value means the result will be put back in the queue and rescheduled later, which is what we want for trees that failed due to extraction cost. See line 11831 below.
11831	When `I == Inst` here it means that `tryToReduce()` returned the root of the horizontal reduction and not a newly vectorized tree.

ABataev added inline comments.Sep 21 2022, 10:33 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2710–2713	Yes, but it is able to repeat this process and build new graphs with the adjusted costs. And we can reuse this cache for other cases, where could not vectorize the graph because of the external costs
7123–7125	Nope, for external uses you need to add this cost somehow.

labrinea added inline comments.Sep 29 2022, 8:03 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7123–7125	Another thing I tried was to cache the result of `getEntryCost()` for the tree entries that correspond to the external users, and then use that cost here instead of the extraction cost estimate. I suppose that's not correct either judging from the produced codegen. I am not familiar with the SLP vectorizer and so my ideas are based on intuition rather than actual understanding I am affraid. I'd be open for an offline discussion if you think that would be enlightening/beneficial. I am not entirely sure but it doesn't seem possible to predict the extraction cost without actually vectorizing first and then retrospectively reflecting upon it. While reading the debug output I noticed that insert/extract-element code is being added when vectorizing. Then this code either gets fully vectorized and removed, in which case the cost estimate ought to be zero, or it remains in the final codegen.

ABataev added inline comments.Sep 30 2022, 1:39 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7123–7125	You cannot estimate it when you see the instruction for the first time, but if it is the second time, you can have the info already about its position in the previous tree/graph. And thus you can get info about possible extra shuffles. Also, you should be careful with extractelement instructions emitting, they may led to gather nodes and again too optimistic cost.

dmgreen mentioned this in D142359: [TTI][AArch64] Cost model vector INS instructions.Jan 24 2023, 7:41 AM

Fixed in D155459.

Herald added subscribers: wangpc, sunshaoce, StephenFan. · View Herald TranscriptSep 4 2023, 2:09 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

53 lines

test/

Transforms/

SLPVectorizer/

AArch64/

multiple_reduction.ll

415 lines

Diff 461599

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 901 Lines • ▼ Show 20 Lines	public:

/// Builds external uses of the vectorized scalars, i.e. the list of		/// Builds external uses of the vectorized scalars, i.e. the list of
/// vectorized scalars to be extracted, their lanes and their scalar users. \p		/// vectorized scalars to be extracted, their lanes and their scalar users. \p
/// ExternallyUsedValues contains additional list of external uses to handle		/// ExternallyUsedValues contains additional list of external uses to handle
/// vectorization of reductions.		/// vectorization of reductions.
void		void
buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = {});		buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = {});

		/// Saves horizontal reductions that failed because of mutual extraction cost.
		bool cacheVectorizableValues() {
		bool Changed = false;
		for (unsigned I = 0, E = VectorizableTree.size(); I < E; ++I) {
		TreeEntry &TE = *VectorizableTree[I];
		for (int Lane = 0, LE = TE.Scalars.size(); Lane != LE; ++Lane) {
		Value *Scalar = TE.Scalars[Lane];
		for (User *U : Scalar->users()) {
		Instruction *UserInst = dyn_cast<Instruction>(U);
		if (UserInst && !isDeleted(UserInst) &&
		((UserIgnoreList && UserIgnoreList->contains(U)) \|\|
		getTreeEntry(U) == &TE))
		ABataevUnsubmitted Not Done Reply Inline Actions How can the user node be equal to the operand node? ABataev: How can the user node be equal to the operand node?
		Changed \|= VectorizableValues[Scalar].insert(U).second;
		}
		}
		}
		return Changed;
		}

		void eraseVectorizableValue(Value *V) { VectorizableValues.erase(V); }

		bool foundVectorizableValues() { return FoundVectorizableValues; }

/// Clear the internal data structures that are created by 'buildTree'.		/// Clear the internal data structures that are created by 'buildTree'.
void deleteTree() {		void deleteTree() {
VectorizableTree.clear();		VectorizableTree.clear();
ScalarToTreeEntry.clear();		ScalarToTreeEntry.clear();
MustGather.clear();		MustGather.clear();
ExternalUses.clear();		ExternalUses.clear();
for (auto &Iter : BlocksSchedules) {		for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();		BlockScheduling *BS = Iter.second.get();
▲ Show 20 Lines • Show All 1,695 Lines • ▼ Show 20 Lines	#endif
/// Maps a value to the proposed vectorizable size.		/// Maps a value to the proposed vectorizable size.
SmallDenseMap<Value *, unsigned> InstrElementSize;		SmallDenseMap<Value *, unsigned> InstrElementSize;

/// A list of scalars that we found that we need to keep as scalars.		/// A list of scalars that we found that we need to keep as scalars.
ValueSet MustGather;		ValueSet MustGather;

/// This POD struct describes one external user in the vectorized tree.		/// This POD struct describes one external user in the vectorized tree.
struct ExternalUser {		struct ExternalUser {
ExternalUser(Value S, llvm::User U, int L)		ExternalUser(Value S, llvm::User U, int L, bool SC = false)
: Scalar(S), User(U), Lane(L) {}		: Scalar(S), User(U), Lane(L), SkipCost(SC) {}

// Which scalar in our function.		// Which scalar in our function.
Value *Scalar;		Value *Scalar;

// Which user that uses the scalar.		// Which user that uses the scalar.
llvm::User *User;		llvm::User *User;

// Which lane does the scalar belong to.		// Which lane does the scalar belong to.
int Lane;		int Lane;

		bool SkipCost;
		vporpoUnsubmitted Not Done Reply Inline Actions Perhaps we could name this `SkipCost` or something along those lines, because that way we can reuse it for other similar workarounds? vporpo: Perhaps we could name this `SkipCost` or something along those lines, because that way we can…
};		};
using UserList = SmallVector<ExternalUser, 16>;		using UserList = SmallVector<ExternalUser, 16>;

/// Checks if two instructions may access the same memory.		/// Checks if two instructions may access the same memory.
///		///
/// \p Loc1 is the location of \p Inst1. It is passed explicitly because it		/// \p Loc1 is the location of \p Inst1. It is passed explicitly because it
/// is invariant in the calling loop.		/// is invariant in the calling loop.
bool isAliased(const MemoryLocation &Loc1, Instruction *Inst1,		bool isAliased(const MemoryLocation &Loc1, Instruction *Inst1,
Show All 37 Lines	#endif
DenseSet<size_t> AnalyzedReductionVals;		DenseSet<size_t> AnalyzedReductionVals;

/// A list of values that need to extracted out of the tree.		/// A list of values that need to extracted out of the tree.
/// This list holds pairs of (Internal Scalar : External User). External User		/// This list holds pairs of (Internal Scalar : External User). External User
/// can be nullptr, it means that this Internal Scalar will be used later,		/// can be nullptr, it means that this Internal Scalar will be used later,
/// after vectorization.		/// after vectorization.
UserList ExternalUses;		UserList ExternalUses;

		DenseMap<Value , SmallPtrSet<User , 2>> VectorizableValues;

		bool FoundVectorizableValues;

		ABataevUnsubmitted Not Done Reply Inline Actions Not sure this is the best option to keep external vectorizable values, again it does not allow to handle values across different graphs, only across the same instance. ABataev: Not sure this is the best option to keep external vectorizable values, again it does not allow…
		labrineaAuthorUnsubmitted Done Reply Inline Actions I am not sure I am following your line of thought. As far as I understand the vectorizer keeps one active graph at a time, created by `buildTree()`, which either gets vectorized or destroyed and then another bundle of instructions is scheduled and so on. The DenseMap I added works as a cache of tree entries that failed to vectorize due to extraction cost. Therefore it holds values across different graphs if that makes sense. As I said earlier, the change triggers in quite a few unit tests if the conditions are relaxed a bit, but I couldn't tell for sure if the new codegen was better or worse than before. If that's not what you meant, can you please explain and maybe suggest an alternative approach? labrinea: I am not sure I am following your line of thought. As far as I understand the vectorizer keeps…
		ABataevUnsubmitted Not Done Reply Inline Actions Yes, but it is able to repeat this process and build new graphs with the adjusted costs. And we can reuse this cache for other cases, where could not vectorize the graph because of the external costs ABataev: Yes, but it is able to repeat this process and build new graphs with the adjusted costs. And we…
/// Values used only by @llvm.assume calls.		/// Values used only by @llvm.assume calls.
SmallPtrSet<const Value *, 32> EphValues;		SmallPtrSet<const Value *, 32> EphValues;

/// Holds all of the instructions that we gathered.		/// Holds all of the instructions that we gathered.
SetVector<Instruction *> GatherShuffleSeq;		SetVector<Instruction *> GatherShuffleSeq;

/// A list of blocks that we are going to CSE.		/// A list of blocks that we are going to CSE.
SetVector<BasicBlock *> CSEBlocks;		SetVector<BasicBlock *> CSEBlocks;
▲ Show 20 Lines • Show All 1,582 Lines • ▼ Show 20 Lines	for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
}		}

// Ignore users in the user ignore list.		// Ignore users in the user ignore list.
if (UserIgnoreList && UserIgnoreList->contains(UserInst))		if (UserIgnoreList && UserIgnoreList->contains(UserInst))
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "		LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "
<< Lane << " from " << *Scalar << ".\n");		<< Lane << " from " << *Scalar << ".\n");
ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane));
		bool SkipCost = VectorizableValues[Scalar].contains(U) &&
		all_of(VectorizableValues[Scalar], [this, Entry, U](User *Usr) {
		auto *UserInst = dyn_cast<Instruction>(Usr);
		return !UserInst \|\| isDeleted(UserInst) \|\| Usr == U \|\|
		(UserIgnoreList && UserIgnoreList->contains(Usr)) \|\|
		getTreeEntry(Usr) == Entry;
		});
		ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane, SkipCost));
}		}
}		}
}		}
}		}

DenseMap<Value , SmallVector<StoreInst , 4>>		DenseMap<Value , SmallVector<StoreInst , 4>>
BoUpSLP::collectUserStores(const BoUpSLP::TreeEntry *TE) const {		BoUpSLP::collectUserStores(const BoUpSLP::TreeEntry *TE) const {
DenseMap<Value , SmallVector<StoreInst , 4>> PtrToStoresMap;		DenseMap<Value , SmallVector<StoreInst , 4>> PtrToStoresMap;
▲ Show 20 Lines • Show All 2,786 Lines • ▼ Show 20 Lines	InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
}		}

SmallPtrSet<Value *, 16> ExtractCostCalculated;		SmallPtrSet<Value *, 16> ExtractCostCalculated;
InstructionCost ExtractCost = 0;		InstructionCost ExtractCost = 0;
SmallVector<MapVector<const TreeEntry *, SmallVector<int>>> ShuffleMasks;		SmallVector<MapVector<const TreeEntry *, SmallVector<int>>> ShuffleMasks;
SmallVector<std::pair<Value , const TreeEntry >> FirstUsers;		SmallVector<std::pair<Value , const TreeEntry >> FirstUsers;
SmallVector<APInt> DemandedElts;		SmallVector<APInt> DemandedElts;
for (ExternalUser &EU : ExternalUses) {		for (ExternalUser &EU : ExternalUses) {
		if (EU.SkipCost)
		continue;

		ABataevUnsubmitted Not Done Reply Inline Actions This again is too optimistic, you need to try to estimate the cost of extra shuffles, required for the vectorization. ABataev: This again is too optimistic, you need to try to estimate the cost of extra shuffles, required…
		labrineaAuthorUnsubmitted Done Reply Inline Actions I thought that the shuffle cost is already accounted inside `getEntryCost()`, am I wrong? Or maybe that logic needs to be patched too? labrinea: I thought that the shuffle cost is already accounted inside `getEntryCost()`, am I wrong? Or…
		ABataevUnsubmitted Not Done Reply Inline Actions Nope, for external uses you need to add this cost somehow. ABataev: Nope, for external uses you need to add this cost somehow.
		labrineaAuthorUnsubmitted Done Reply Inline Actions Another thing I tried was to cache the result of `getEntryCost()` for the tree entries that correspond to the external users, and then use that cost here instead of the extraction cost estimate. I suppose that's not correct either judging from the produced codegen. I am not familiar with the SLP vectorizer and so my ideas are based on intuition rather than actual understanding I am affraid. I'd be open for an offline discussion if you think that would be enlightening/beneficial. I am not entirely sure but it doesn't seem possible to predict the extraction cost without actually vectorizing first and then retrospectively reflecting upon it. While reading the debug output I noticed that insert/extract-element code is being added when vectorizing. Then this code either gets fully vectorized and removed, in which case the cost estimate ought to be zero, or it remains in the final codegen. labrinea: Another thing I tried was to cache the result of `getEntryCost()` for the tree entries that…
		ABataevUnsubmitted Not Done Reply Inline Actions You cannot estimate it when you see the instruction for the first time, but if it is the second time, you can have the info already about its position in the previous tree/graph. And thus you can get info about possible extra shuffles. Also, you should be careful with extractelement instructions emitting, they may led to gather nodes and again too optimistic cost. ABataev: You cannot estimate it when you see the instruction for the first time, but if it is the second…
// We only add extract cost once for the same scalar.		// We only add extract cost once for the same scalar.
if (!isa_and_nonnull<InsertElementInst>(EU.User) &&		if (!isa_and_nonnull<InsertElementInst>(EU.User) &&
!ExtractCostCalculated.insert(EU.Scalar).second)		!ExtractCostCalculated.insert(EU.Scalar).second)
continue;		continue;

// Uses by ephemeral values are free (because the ephemeral value will be		// Uses by ephemeral values are free (because the ephemeral value will be
// removed prior to code generation, and so the extraction will be		// removed prior to code generation, and so the extraction will be
// removed as well).		// removed as well).
▲ Show 20 Lines • Show All 161 Lines • ▼ Show 20 Lines	(void)performExtractsShuffleAction<const TreeEntry>(
[](const TreeEntry *E) { return E->getVectorFactor(); }, ResizeToVF,		[](const TreeEntry *E) { return E->getVectorFactor(); }, ResizeToVF,
EstimateShufflesCost);		EstimateShufflesCost);
InstructionCost InsertCost = TTI->getScalarizationOverhead(		InstructionCost InsertCost = TTI->getScalarizationOverhead(
cast<FixedVectorType>(FirstUsers[I].first->getType()), DemandedElts[I],		cast<FixedVectorType>(FirstUsers[I].first->getType()), DemandedElts[I],
/Insert/ true, /Extract/ false);		/Insert/ true, /Extract/ false);
Cost -= InsertCost;		Cost -= InsertCost;
}		}

		FoundVectorizableValues =
		Cost >= -SLPCostThreshold && Cost - ExtractCost < -SLPCostThreshold ?
		cacheVectorizableValues() : false;

		ABataevUnsubmitted Not Done Reply Inline Actions Maybe put it to deleteTree() function and check if the graph-to-be-deleted was built but was not vectorized? ABataev: Maybe put it to deleteTree() function and check if the graph-to-be-deleted was built but was…
#ifndef NDEBUG		#ifndef NDEBUG
SmallString<256> Str;		SmallString<256> Str;
{		{
raw_svector_ostream OS(Str);		raw_svector_ostream OS(Str);
OS << "SLP: Spill Cost = " << SpillCost << ".\n"		OS << "SLP: Spill Cost = " << SpillCost << ".\n"
<< "SLP: Extract Cost = " << ExtractCost << ".\n"		<< "SLP: Extract Cost = " << ExtractCost << ".\n"
<< "SLP: Total Cost = " << Cost << ".\n";		<< "SLP: Total Cost = " << Cost << ".\n";
}		}
▲ Show 20 Lines • Show All 1,648 Lines • ▼ Show 20 Lines	#ifndef NDEBUG
(isa_and_nonnull<Instruction>(U) &&		(isa_and_nonnull<Instruction>(U) &&
isDeleted(cast<Instruction>(U)))) &&		isDeleted(cast<Instruction>(U)))) &&
"Deleting out-of-tree value");		"Deleting out-of-tree value");
}		}
}		}
#endif		#endif
LLVM_DEBUG(dbgs() << "SLP: \tErasing scalar:" << *Scalar << ".\n");		LLVM_DEBUG(dbgs() << "SLP: \tErasing scalar:" << *Scalar << ".\n");
eraseInstruction(cast<Instruction>(Scalar));		eraseInstruction(cast<Instruction>(Scalar));
		eraseVectorizableValue(Scalar);
}		}
}		}

Builder.ClearInsertionPoint();		Builder.ClearInsertionPoint();
InstrElementSize.clear();		InstrElementSize.clear();

return VectorizableTree[0]->VectorizedValue;		return VectorizableTree[0]->VectorizedValue;
}		}
▲ Show 20 Lines • Show All 2,079 Lines • ▼ Show 20 Lines	for (auto &PossibleReducedVals : PossibleReducedValsVect) {
ReducedVals.back().append(Data.rbegin(), Data.rend());		ReducedVals.back().append(Data.rbegin(), Data.rend());
}		}
// Sort the reduced values by number of same/alternate opcode and/or pointer		// Sort the reduced values by number of same/alternate opcode and/or pointer
// operand.		// operand.
stable_sort(ReducedVals, [](ArrayRef<Value > P1, ArrayRef<Value > P2) {		stable_sort(ReducedVals, [](ArrayRef<Value > P1, ArrayRef<Value > P2) {
return P1.size() > P2.size();		return P1.size() > P2.size();
});		});
return true;		return true;
}		}

/// Attempt to vectorize the tree found by matchAssociativeReduction.		/// Attempt to vectorize the tree found by matchAssociativeReduction.
Value tryToReduce(BoUpSLP &V, TargetTransformInfo TTI) {		Value tryToReduce(BoUpSLP &V, TargetTransformInfo TTI) {
		ABataevUnsubmitted Not Done Reply Inline Actions Not sure it is a good idea to store reduction ops here, you're working with reduced values actually. ABataev: Not sure it is a good idea to store reduction ops here, you're working with reduced values…
		labrineaAuthorUnsubmitted Done Reply Inline Actions If you look at the reproducer: for i = ... sm += x[i]; sq += x[i] * x[i]; the addition `sm += x[i]` (which is a reduction op of the first horizontal reduction) is an external user of the scalar load from the multiplication `x[i] * x[i]` (which is a reduction value of the second horizontal reduction) labrinea: If you look at the reproducer: ``` for i = ... sm += x[i]; sq += x[i] * x[i]; ``` the…
		ABataevUnsubmitted Not Done Reply Inline Actions Aha, this is for external users. Then why do you need to store there reduced values? ABataev: Aha, this is for external users. Then why do you need to store there reduced values?
		labrineaAuthorUnsubmitted Done Reply Inline Actions Because similarly the multiplication (reduced value) is an external user of the scalar load from the other horizontal reduction. labrinea: Because similarly the multiplication (reduced value) is an external user of the scalar load…
		ABataevUnsubmitted Not Done Reply Inline Actions Ah, I see what do you mean. But this is not the best place for gathering the reduced values, they must be gathered during an attempt to build the tree, since they might be not vectorized but gathered. If they are gathered, you cannot treat them as a future vector. ABataev: Ah, I see what do you mean. But this is not the best place for gathering the reduced values…
		labrineaAuthorUnsubmitted Done Reply Inline Actions I see. In other words we cannot speculate that the vectorization will eventually happen before trying to actually reduce, so the look ahead approach I am suggesting is rather unsafe? By the time we are attempting to build the tree it is already too late. labrinea: I see. In other words we cannot speculate that the vectorization will eventually happen before…
		ABataevUnsubmitted Not Done Reply Inline Actions You can gather such "vectorizable" scalars during/after the SLP graph building and store them in the set for the future analysis. Later, next SLP graphs will check if such scalars are potentially vectorizable and exclude the extract cost for them. The, SLP sees, that some part of the basic block was vectorized and restarts to try to vectorize other parts of the same block. ABataev: You can gather such "vectorizable" scalars during/after the SLP graph building and store them…
		labrineaAuthorUnsubmitted Done Reply Inline Actions That doesn't happen with horizontal reductions. If a reduction fails then its root instruction is put in the postponed list for later processing, where it's not treated as a reduction any more. At least that's my understanding. I could indeed cache the scalars of a vectorizable tree entry later on, maybe inside getTreeCost (where the extract cost is available, allowing a more informed decision). However the reduction ops themselves are not part of the tree entry, and so we would still need to cache them, perhaps via the ignore list. Also, the horizontal reduction is iteratively modeled by several vectorization trees of different factor. Which one do I pick to cache? All of them? labrinea: That doesn't happen with horizontal reductions. If a reduction fails then its root instruction…
constexpr int ReductionLimit = 4;		constexpr int ReductionLimit = 4;
constexpr unsigned RegMaxNumber = 4;		constexpr unsigned RegMaxNumber = 4;
constexpr unsigned RedValsMaxNumber = 128;		constexpr unsigned RedValsMaxNumber = 128;
// If there are a sufficient number of reduction values, reduce		// If there are a sufficient number of reduction values, reduce
// to a nearby power-of-2. We can safely generate oversized		// to a nearby power-of-2. We can safely generate oversized
// vectors and rely on the backend to split them to legal sizes.		// vectors and rely on the backend to split them to legal sizes.
unsigned NumReducedVals = std::accumulate(		unsigned NumReducedVals = std::accumulate(
ReducedVals.begin(), ReducedVals.end(), 0,		ReducedVals.begin(), ReducedVals.end(), 0,
▲ Show 20 Lines • Show All 232 Lines • ▼ Show 20 Lines	for (unsigned I = 0, E = ReducedVals.size(); I < E; ++I) {
return OptimizationRemarkMissed(		return OptimizationRemarkMissed(
SV_NAME, "HorSLPNotBeneficial",		SV_NAME, "HorSLPNotBeneficial",
ReducedValsToOps.find(VL[0])->second.front())		ReducedValsToOps.find(VL[0])->second.front())
<< "Vectorizing horizontal reduction is possible "		<< "Vectorizing horizontal reduction is possible "
<< "but not beneficial with cost " << ore::NV("Cost", Cost)		<< "but not beneficial with cost " << ore::NV("Cost", Cost)
<< " and threshold "		<< " and threshold "
<< ore::NV("Threshold", -SLPCostThreshold);		<< ore::NV("Threshold", -SLPCostThreshold);
});		});
		if (V.foundVectorizableValues())
		return ReductionRoot;
		ABataevUnsubmitted Not Done Reply Inline Actions Why return? ABataev: Why return?
		labrineaAuthorUnsubmitted Done Reply Inline Actions Because returning a non null value means the result will be put back in the queue and rescheduled later, which is what we want for trees that failed due to extraction cost. See line 11831 below. labrinea: Because returning a non null value means the result will be put back in the queue and…
if (!AdjustReducedVals())		if (!AdjustReducedVals())
V.analyzedReductionVals(VL);		V.analyzedReductionVals(VL);
continue;		continue;
}		}

LLVM_DEBUG(dbgs() << "SLP: Vectorizing horizontal reduction at cost:"		LLVM_DEBUG(dbgs() << "SLP: Vectorizing horizontal reduction at cost:"
<< Cost << ". (HorRdx)\n");		<< Cost << ". (HorRdx)\n");
V.getORE()->emit([&]() {		V.getORE()->emit([&]() {
▲ Show 20 Lines • Show All 462 Lines • ▼ Show 20 Lines	bool SLPVectorizerPass::vectorizeHorReduction(
// horizontal reduction.		// horizontal reduction.
// Interrupt the process if the Root instruction itself was vectorized or all		// Interrupt the process if the Root instruction itself was vectorized or all
// sub-trees not higher that RecursionMaxDepth were analyzed/vectorized.		// sub-trees not higher that RecursionMaxDepth were analyzed/vectorized.
// If a horizintal reduction was not matched or vectorized we collect		// If a horizintal reduction was not matched or vectorized we collect
// instructions for possible later attempts for vectorization.		// instructions for possible later attempts for vectorization.
std::queue<std::pair<Instruction *, unsigned>> Stack;		std::queue<std::pair<Instruction *, unsigned>> Stack;
Stack.emplace(Root, 0);		Stack.emplace(Root, 0);
SmallPtrSet<Value *, 8> VisitedInstrs;		SmallPtrSet<Value *, 8> VisitedInstrs;
bool Res = false;		bool Res = false;
		ABataevUnsubmitted Not Done Reply Inline Actions Why do you need this one? In case of successful reduction, the vectorizer restarts the analysis and rebuilds the reduction graph. ABataev: Why do you need this one? In case of successful reduction, the vectorizer restarts the analysis…
		labrineaAuthorUnsubmitted Done Reply Inline Actions The idea is to separate the "matching" of a horizontal reduction (matchAssociativeReduction) from the "processing" of it (tryToReduce). That way we can postpone the processing until we have found at least another one. This allows us to identify mutual reductions and ignore the extraction cost of the common scalar values. As Vasilis mentioned this limits the amount of mutual reductions to two, which is not ideal. labrinea: The idea is to separate the "matching" of a horizontal reduction (matchAssociativeReduction)…
		ABataevUnsubmitted Not Done Reply Inline Actions No need to postpone it, the pass will just repeat the same steps once again after the changes. ABataev: No need to postpone it, the pass will just repeat the same steps once again after the changes.
		labrineaAuthorUnsubmitted Done Reply Inline Actions I am not following. What should the code look like then? How can we solve the problem of mutual reductions without looking ahead? PendingReduction serves the purpose of one element buffer if that makes it clearer. labrinea: I am not following. What should the code look like then? How can we solve the problem of mutual…
		ABataevUnsubmitted Not Done Reply Inline Actions Just adjust the cost, everything else should be handled by the SLP vectorizer pass. ABataev: Just adjust the cost, everything else should be handled by the SLP vectorizer pass.
		labrineaAuthorUnsubmitted Done Reply Inline Actions As I explained on the other thread of comments, the unsuccessful reductions are not rescheduled the way you describe. I am still looking into how else could I make it work. labrinea: As I explained on the other thread of comments, the unsuccessful reductions are not rescheduled…
auto &&TryToReduce = [this, TTI, &P, &R](Instruction Inst, Value &B0,		auto &&TryToReduce = [this, TTI, &P, &R](Instruction Inst, Value &B0,
Value &B1) -> Value {		Value &B1) -> Value {
if (R.isAnalyzedReductionRoot(Inst))		if (R.isAnalyzedReductionRoot(Inst))
return nullptr;		return nullptr;
bool IsBinop = matchRdxBop(Inst, B0, B1);		bool IsBinop = matchRdxBop(Inst, B0, B1);
bool IsSelect = match(Inst, m_Select(m_Value(), m_Value(), m_Value()));		bool IsSelect = match(Inst, m_Select(m_Value(), m_Value(), m_Value()));
if (IsBinop \|\| IsSelect) {		if (IsBinop \|\| IsSelect) {
HorizontalReduction HorRdx;		HorizontalReduction HorRdx;
Show All 15 Lines	while (!Stack.empty()) {
Value B0 = nullptr, B1 = nullptr;		Value B0 = nullptr, B1 = nullptr;
if (Value *V = TryToReduce(Inst, B0, B1)) {		if (Value *V = TryToReduce(Inst, B0, B1)) {
Res = true;		Res = true;
// Set P to nullptr to avoid re-analysis of phi node in		// Set P to nullptr to avoid re-analysis of phi node in
// matchAssociativeReduction function unless this is the root node.		// matchAssociativeReduction function unless this is the root node.
P = nullptr;		P = nullptr;
if (auto *I = dyn_cast<Instruction>(V)) {		if (auto *I = dyn_cast<Instruction>(V)) {
// Try to find another reduction.		// Try to find another reduction.
Stack.emplace(I, Level);		Stack.emplace(I, Level);
		labrineaAuthorUnsubmitted Done Reply Inline Actions When `I == Inst` here it means that `tryToReduce()` returned the root of the horizontal reduction and not a newly vectorized tree. labrinea: When `I == Inst` here it means that `tryToReduce()` returned the root of the horizontal…
continue;		continue;
}		}
} else {		} else {
bool IsBinop = B0 && B1;		bool IsBinop = B0 && B1;
if (P && IsBinop) {		if (P && IsBinop) {
Inst = dyn_cast<Instruction>(B0);		Inst = dyn_cast<Instruction>(B0);
if (Inst == P)		if (Inst == P)
Inst = dyn_cast<Instruction>(B1);		Inst = dyn_cast<Instruction>(B1);
▲ Show 20 Lines • Show All 755 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/multiple_reduction.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -basic-aa -slp-vectorizer -S \| FileCheck %s			; RUN: opt < %s -basic-aa -slp-vectorizer -S \| FileCheck %s

	target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64-arm-none-eabi"			target triple = "aarch64-arm-none-eabi"

	; This test has mutual reductions, referencing the same data:			; This test has mutual reductions, referencing the same data:
	; for i = ...			; for i = ...
	; sm += x[i];			; sm += x[i];
	; sq += x[i] * x[i];			; sq += x[i] * x[i];
	; It currently doesn't SLP vectorize, but should.			; It currently doesn't SLP vectorize, but should.

	define i64 @straight(i16* nocapture noundef readonly %p, i32 noundef %st) {			define i64 @straight(i16* nocapture noundef readonly %p, i32 noundef %st) {
	; CHECK-LABEL: @straight(			; CHECK-LABEL: @straight(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[IDX_EXT:%.]] = sext i32 [[ST:%.]] to i64			; CHECK-NEXT: [[IDX_EXT:%.]] = sext i32 [[ST:%.]] to i64
	; CHECK-NEXT: [[TMP0:%.]] = load i16, i16 [[P:%.*]], align 2			; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds i16, i16 [[P:%.*]], i64 [[IDX_EXT]]
	; CHECK-NEXT: [[CONV:%.*]] = zext i16 [[TMP0]] to i32
	; CHECK-NEXT: [[MUL:%.*]] = mul nuw nsw i32 [[CONV]], [[CONV]]
	; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i16, i16 [[P]], i64 1
	; CHECK-NEXT: [[TMP1:%.]] = load i16, i16 [[ARRAYIDX_1]], align 2
	; CHECK-NEXT: [[CONV_1:%.*]] = zext i16 [[TMP1]] to i32
	; CHECK-NEXT: [[ADD_1:%.*]] = add nuw nsw i32 [[CONV]], [[CONV_1]]
	; CHECK-NEXT: [[MUL_1:%.*]] = mul nuw nsw i32 [[CONV_1]], [[CONV_1]]
	; CHECK-NEXT: [[ADD11_1:%.*]] = add nuw i32 [[MUL_1]], [[MUL]]
	; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i16, i16 [[P]], i64 2
	; CHECK-NEXT: [[TMP2:%.]] = load i16, i16 [[ARRAYIDX_2]], align 2
	; CHECK-NEXT: [[CONV_2:%.*]] = zext i16 [[TMP2]] to i32
	; CHECK-NEXT: [[ADD_2:%.*]] = add nuw nsw i32 [[ADD_1]], [[CONV_2]]
	; CHECK-NEXT: [[MUL_2:%.*]] = mul nuw nsw i32 [[CONV_2]], [[CONV_2]]
	; CHECK-NEXT: [[ADD11_2:%.*]] = add i32 [[MUL_2]], [[ADD11_1]]
	; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i16, i16 [[P]], i64 3
	; CHECK-NEXT: [[TMP3:%.]] = load i16, i16 [[ARRAYIDX_3]], align 2
	; CHECK-NEXT: [[CONV_3:%.*]] = zext i16 [[TMP3]] to i32
	; CHECK-NEXT: [[ADD_3:%.*]] = add nuw nsw i32 [[ADD_2]], [[CONV_3]]
	; CHECK-NEXT: [[MUL_3:%.*]] = mul nuw nsw i32 [[CONV_3]], [[CONV_3]]
	; CHECK-NEXT: [[ADD11_3:%.*]] = add i32 [[MUL_3]], [[ADD11_2]]
	; CHECK-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds i16, i16 [[P]], i64 4
	; CHECK-NEXT: [[TMP4:%.]] = load i16, i16 [[ARRAYIDX_4]], align 2
	; CHECK-NEXT: [[CONV_4:%.*]] = zext i16 [[TMP4]] to i32
	; CHECK-NEXT: [[ADD_4:%.*]] = add nuw nsw i32 [[ADD_3]], [[CONV_4]]
	; CHECK-NEXT: [[MUL_4:%.*]] = mul nuw nsw i32 [[CONV_4]], [[CONV_4]]
	; CHECK-NEXT: [[ADD11_4:%.*]] = add i32 [[MUL_4]], [[ADD11_3]]
	; CHECK-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds i16, i16 [[P]], i64 5
	; CHECK-NEXT: [[TMP5:%.]] = load i16, i16 [[ARRAYIDX_5]], align 2
	; CHECK-NEXT: [[CONV_5:%.*]] = zext i16 [[TMP5]] to i32
	; CHECK-NEXT: [[ADD_5:%.*]] = add nuw nsw i32 [[ADD_4]], [[CONV_5]]
	; CHECK-NEXT: [[MUL_5:%.*]] = mul nuw nsw i32 [[CONV_5]], [[CONV_5]]
	; CHECK-NEXT: [[ADD11_5:%.*]] = add i32 [[MUL_5]], [[ADD11_4]]
	; CHECK-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds i16, i16 [[P]], i64 6
	; CHECK-NEXT: [[TMP6:%.]] = load i16, i16 [[ARRAYIDX_6]], align 2
	; CHECK-NEXT: [[CONV_6:%.*]] = zext i16 [[TMP6]] to i32
	; CHECK-NEXT: [[ADD_6:%.*]] = add nuw nsw i32 [[ADD_5]], [[CONV_6]]
	; CHECK-NEXT: [[MUL_6:%.*]] = mul nuw nsw i32 [[CONV_6]], [[CONV_6]]
	; CHECK-NEXT: [[ADD11_6:%.*]] = add i32 [[MUL_6]], [[ADD11_5]]
	; CHECK-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds i16, i16 [[P]], i64 7
	; CHECK-NEXT: [[TMP7:%.]] = load i16, i16 [[ARRAYIDX_7]], align 2
	; CHECK-NEXT: [[CONV_7:%.*]] = zext i16 [[TMP7]] to i32
	; CHECK-NEXT: [[ADD_7:%.*]] = add nuw nsw i32 [[ADD_6]], [[CONV_7]]
	; CHECK-NEXT: [[MUL_7:%.*]] = mul nuw nsw i32 [[CONV_7]], [[CONV_7]]
	; CHECK-NEXT: [[ADD11_7:%.*]] = add i32 [[MUL_7]], [[ADD11_6]]
	; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds i16, i16 [[P]], i64 [[IDX_EXT]]
	; CHECK-NEXT: [[TMP8:%.]] = load i16, i16 [[ADD_PTR]], align 2
	; CHECK-NEXT: [[CONV_140:%.*]] = zext i16 [[TMP8]] to i32
	; CHECK-NEXT: [[ADD_141:%.*]] = add nuw nsw i32 [[ADD_7]], [[CONV_140]]
	; CHECK-NEXT: [[MUL_142:%.*]] = mul nuw nsw i32 [[CONV_140]], [[CONV_140]]
	; CHECK-NEXT: [[ADD11_143:%.*]] = add i32 [[MUL_142]], [[ADD11_7]]
	; CHECK-NEXT: [[ARRAYIDX_1_1:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR]], i64 1
	; CHECK-NEXT: [[TMP9:%.]] = load i16, i16 [[ARRAYIDX_1_1]], align 2
	; CHECK-NEXT: [[CONV_1_1:%.*]] = zext i16 [[TMP9]] to i32
	; CHECK-NEXT: [[ADD_1_1:%.*]] = add nuw nsw i32 [[ADD_141]], [[CONV_1_1]]
	; CHECK-NEXT: [[MUL_1_1:%.*]] = mul nuw nsw i32 [[CONV_1_1]], [[CONV_1_1]]
	; CHECK-NEXT: [[ADD11_1_1:%.*]] = add i32 [[MUL_1_1]], [[ADD11_143]]
	; CHECK-NEXT: [[ARRAYIDX_2_1:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR]], i64 2
	; CHECK-NEXT: [[TMP10:%.]] = load i16, i16 [[ARRAYIDX_2_1]], align 2
	; CHECK-NEXT: [[CONV_2_1:%.*]] = zext i16 [[TMP10]] to i32
	; CHECK-NEXT: [[ADD_2_1:%.*]] = add nuw nsw i32 [[ADD_1_1]], [[CONV_2_1]]
	; CHECK-NEXT: [[MUL_2_1:%.*]] = mul nuw nsw i32 [[CONV_2_1]], [[CONV_2_1]]
	; CHECK-NEXT: [[ADD11_2_1:%.*]] = add i32 [[MUL_2_1]], [[ADD11_1_1]]
	; CHECK-NEXT: [[ARRAYIDX_3_1:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR]], i64 3
	; CHECK-NEXT: [[TMP11:%.]] = load i16, i16 [[ARRAYIDX_3_1]], align 2
	; CHECK-NEXT: [[CONV_3_1:%.*]] = zext i16 [[TMP11]] to i32
	; CHECK-NEXT: [[ADD_3_1:%.*]] = add nuw nsw i32 [[ADD_2_1]], [[CONV_3_1]]
	; CHECK-NEXT: [[MUL_3_1:%.*]] = mul nuw nsw i32 [[CONV_3_1]], [[CONV_3_1]]
	; CHECK-NEXT: [[ADD11_3_1:%.*]] = add i32 [[MUL_3_1]], [[ADD11_2_1]]
	; CHECK-NEXT: [[ARRAYIDX_4_1:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR]], i64 4
	; CHECK-NEXT: [[TMP12:%.]] = load i16, i16 [[ARRAYIDX_4_1]], align 2
	; CHECK-NEXT: [[CONV_4_1:%.*]] = zext i16 [[TMP12]] to i32
	; CHECK-NEXT: [[ADD_4_1:%.*]] = add nuw nsw i32 [[ADD_3_1]], [[CONV_4_1]]
	; CHECK-NEXT: [[MUL_4_1:%.*]] = mul nuw nsw i32 [[CONV_4_1]], [[CONV_4_1]]
	; CHECK-NEXT: [[ADD11_4_1:%.*]] = add i32 [[MUL_4_1]], [[ADD11_3_1]]
	; CHECK-NEXT: [[ARRAYIDX_5_1:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR]], i64 5
	; CHECK-NEXT: [[TMP13:%.]] = load i16, i16 [[ARRAYIDX_5_1]], align 2
	; CHECK-NEXT: [[CONV_5_1:%.*]] = zext i16 [[TMP13]] to i32
	; CHECK-NEXT: [[ADD_5_1:%.*]] = add nuw nsw i32 [[ADD_4_1]], [[CONV_5_1]]
	; CHECK-NEXT: [[MUL_5_1:%.*]] = mul nuw nsw i32 [[CONV_5_1]], [[CONV_5_1]]
	; CHECK-NEXT: [[ADD11_5_1:%.*]] = add i32 [[MUL_5_1]], [[ADD11_4_1]]
	; CHECK-NEXT: [[ARRAYIDX_6_1:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR]], i64 6
	; CHECK-NEXT: [[TMP14:%.]] = load i16, i16 [[ARRAYIDX_6_1]], align 2
	; CHECK-NEXT: [[CONV_6_1:%.*]] = zext i16 [[TMP14]] to i32
	; CHECK-NEXT: [[ADD_6_1:%.*]] = add nuw nsw i32 [[ADD_5_1]], [[CONV_6_1]]
	; CHECK-NEXT: [[MUL_6_1:%.*]] = mul nuw nsw i32 [[CONV_6_1]], [[CONV_6_1]]
	; CHECK-NEXT: [[ADD11_6_1:%.*]] = add i32 [[MUL_6_1]], [[ADD11_5_1]]
	; CHECK-NEXT: [[ARRAYIDX_7_1:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR]], i64 7
	; CHECK-NEXT: [[TMP15:%.]] = load i16, i16 [[ARRAYIDX_7_1]], align 2
	; CHECK-NEXT: [[CONV_7_1:%.*]] = zext i16 [[TMP15]] to i32
	; CHECK-NEXT: [[ADD_7_1:%.*]] = add nuw nsw i32 [[ADD_6_1]], [[CONV_7_1]]
	; CHECK-NEXT: [[MUL_7_1:%.*]] = mul nuw nsw i32 [[CONV_7_1]], [[CONV_7_1]]
	; CHECK-NEXT: [[ADD11_7_1:%.*]] = add i32 [[MUL_7_1]], [[ADD11_6_1]]
	; CHECK-NEXT: [[ADD_PTR_1:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR]], i64 [[IDX_EXT]]			; CHECK-NEXT: [[ADD_PTR_1:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR]], i64 [[IDX_EXT]]
	; CHECK-NEXT: [[TMP16:%.]] = load i16, i16 [[ADD_PTR_1]], align 2
	; CHECK-NEXT: [[CONV_244:%.*]] = zext i16 [[TMP16]] to i32
	; CHECK-NEXT: [[ADD_245:%.*]] = add nuw nsw i32 [[ADD_7_1]], [[CONV_244]]
	; CHECK-NEXT: [[MUL_246:%.*]] = mul nuw nsw i32 [[CONV_244]], [[CONV_244]]
	; CHECK-NEXT: [[ADD11_247:%.*]] = add i32 [[MUL_246]], [[ADD11_7_1]]
	; CHECK-NEXT: [[ARRAYIDX_1_2:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_1]], i64 1
	; CHECK-NEXT: [[TMP17:%.]] = load i16, i16 [[ARRAYIDX_1_2]], align 2
	; CHECK-NEXT: [[CONV_1_2:%.*]] = zext i16 [[TMP17]] to i32
	; CHECK-NEXT: [[ADD_1_2:%.*]] = add nuw nsw i32 [[ADD_245]], [[CONV_1_2]]
	; CHECK-NEXT: [[MUL_1_2:%.*]] = mul nuw nsw i32 [[CONV_1_2]], [[CONV_1_2]]
	; CHECK-NEXT: [[ADD11_1_2:%.*]] = add i32 [[MUL_1_2]], [[ADD11_247]]
	; CHECK-NEXT: [[ARRAYIDX_2_2:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_1]], i64 2
	; CHECK-NEXT: [[TMP18:%.]] = load i16, i16 [[ARRAYIDX_2_2]], align 2
	; CHECK-NEXT: [[CONV_2_2:%.*]] = zext i16 [[TMP18]] to i32
	; CHECK-NEXT: [[ADD_2_2:%.*]] = add nuw nsw i32 [[ADD_1_2]], [[CONV_2_2]]
	; CHECK-NEXT: [[MUL_2_2:%.*]] = mul nuw nsw i32 [[CONV_2_2]], [[CONV_2_2]]
	; CHECK-NEXT: [[ADD11_2_2:%.*]] = add i32 [[MUL_2_2]], [[ADD11_1_2]]
	; CHECK-NEXT: [[ARRAYIDX_3_2:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_1]], i64 3
	; CHECK-NEXT: [[TMP19:%.]] = load i16, i16 [[ARRAYIDX_3_2]], align 2
	; CHECK-NEXT: [[CONV_3_2:%.*]] = zext i16 [[TMP19]] to i32
	; CHECK-NEXT: [[ADD_3_2:%.*]] = add nuw nsw i32 [[ADD_2_2]], [[CONV_3_2]]
	; CHECK-NEXT: [[MUL_3_2:%.*]] = mul nuw nsw i32 [[CONV_3_2]], [[CONV_3_2]]
	; CHECK-NEXT: [[ADD11_3_2:%.*]] = add i32 [[MUL_3_2]], [[ADD11_2_2]]
	; CHECK-NEXT: [[ARRAYIDX_4_2:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_1]], i64 4
	; CHECK-NEXT: [[TMP20:%.]] = load i16, i16 [[ARRAYIDX_4_2]], align 2
	; CHECK-NEXT: [[CONV_4_2:%.*]] = zext i16 [[TMP20]] to i32
	; CHECK-NEXT: [[ADD_4_2:%.*]] = add nuw nsw i32 [[ADD_3_2]], [[CONV_4_2]]
	; CHECK-NEXT: [[MUL_4_2:%.*]] = mul nuw nsw i32 [[CONV_4_2]], [[CONV_4_2]]
	; CHECK-NEXT: [[ADD11_4_2:%.*]] = add i32 [[MUL_4_2]], [[ADD11_3_2]]
	; CHECK-NEXT: [[ARRAYIDX_5_2:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_1]], i64 5
	; CHECK-NEXT: [[TMP21:%.]] = load i16, i16 [[ARRAYIDX_5_2]], align 2
	; CHECK-NEXT: [[CONV_5_2:%.*]] = zext i16 [[TMP21]] to i32
	; CHECK-NEXT: [[ADD_5_2:%.*]] = add nuw nsw i32 [[ADD_4_2]], [[CONV_5_2]]
	; CHECK-NEXT: [[MUL_5_2:%.*]] = mul nuw nsw i32 [[CONV_5_2]], [[CONV_5_2]]
	; CHECK-NEXT: [[ADD11_5_2:%.*]] = add i32 [[MUL_5_2]], [[ADD11_4_2]]
	; CHECK-NEXT: [[ARRAYIDX_6_2:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_1]], i64 6
	; CHECK-NEXT: [[TMP22:%.]] = load i16, i16 [[ARRAYIDX_6_2]], align 2
	; CHECK-NEXT: [[CONV_6_2:%.*]] = zext i16 [[TMP22]] to i32
	; CHECK-NEXT: [[ADD_6_2:%.*]] = add nuw nsw i32 [[ADD_5_2]], [[CONV_6_2]]
	; CHECK-NEXT: [[MUL_6_2:%.*]] = mul nuw nsw i32 [[CONV_6_2]], [[CONV_6_2]]
	; CHECK-NEXT: [[ADD11_6_2:%.*]] = add i32 [[MUL_6_2]], [[ADD11_5_2]]
	; CHECK-NEXT: [[ARRAYIDX_7_2:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_1]], i64 7
	; CHECK-NEXT: [[TMP23:%.]] = load i16, i16 [[ARRAYIDX_7_2]], align 2
	; CHECK-NEXT: [[CONV_7_2:%.*]] = zext i16 [[TMP23]] to i32
	; CHECK-NEXT: [[ADD_7_2:%.*]] = add nuw nsw i32 [[ADD_6_2]], [[CONV_7_2]]
	; CHECK-NEXT: [[MUL_7_2:%.*]] = mul nuw nsw i32 [[CONV_7_2]], [[CONV_7_2]]
	; CHECK-NEXT: [[ADD11_7_2:%.*]] = add i32 [[MUL_7_2]], [[ADD11_6_2]]
	; CHECK-NEXT: [[ADD_PTR_2:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_1]], i64 [[IDX_EXT]]			; CHECK-NEXT: [[ADD_PTR_2:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_1]], i64 [[IDX_EXT]]
	; CHECK-NEXT: [[TMP24:%.]] = load i16, i16 [[ADD_PTR_2]], align 2
	; CHECK-NEXT: [[CONV_348:%.*]] = zext i16 [[TMP24]] to i32
	; CHECK-NEXT: [[ADD_349:%.*]] = add nuw nsw i32 [[ADD_7_2]], [[CONV_348]]
	; CHECK-NEXT: [[MUL_350:%.*]] = mul nuw nsw i32 [[CONV_348]], [[CONV_348]]
	; CHECK-NEXT: [[ADD11_351:%.*]] = add i32 [[MUL_350]], [[ADD11_7_2]]
	; CHECK-NEXT: [[ARRAYIDX_1_3:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_2]], i64 1
	; CHECK-NEXT: [[TMP25:%.]] = load i16, i16 [[ARRAYIDX_1_3]], align 2
	; CHECK-NEXT: [[CONV_1_3:%.*]] = zext i16 [[TMP25]] to i32
	; CHECK-NEXT: [[ADD_1_3:%.*]] = add nuw nsw i32 [[ADD_349]], [[CONV_1_3]]
	; CHECK-NEXT: [[MUL_1_3:%.*]] = mul nuw nsw i32 [[CONV_1_3]], [[CONV_1_3]]
	; CHECK-NEXT: [[ADD11_1_3:%.*]] = add i32 [[MUL_1_3]], [[ADD11_351]]
	; CHECK-NEXT: [[ARRAYIDX_2_3:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_2]], i64 2
	; CHECK-NEXT: [[TMP26:%.]] = load i16, i16 [[ARRAYIDX_2_3]], align 2
	; CHECK-NEXT: [[CONV_2_3:%.*]] = zext i16 [[TMP26]] to i32
	; CHECK-NEXT: [[ADD_2_3:%.*]] = add nuw nsw i32 [[ADD_1_3]], [[CONV_2_3]]
	; CHECK-NEXT: [[MUL_2_3:%.*]] = mul nuw nsw i32 [[CONV_2_3]], [[CONV_2_3]]
	; CHECK-NEXT: [[ADD11_2_3:%.*]] = add i32 [[MUL_2_3]], [[ADD11_1_3]]
	; CHECK-NEXT: [[ARRAYIDX_3_3:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_2]], i64 3
	; CHECK-NEXT: [[TMP27:%.]] = load i16, i16 [[ARRAYIDX_3_3]], align 2
	; CHECK-NEXT: [[CONV_3_3:%.*]] = zext i16 [[TMP27]] to i32
	; CHECK-NEXT: [[ADD_3_3:%.*]] = add nuw nsw i32 [[ADD_2_3]], [[CONV_3_3]]
	; CHECK-NEXT: [[MUL_3_3:%.*]] = mul nuw nsw i32 [[CONV_3_3]], [[CONV_3_3]]
	; CHECK-NEXT: [[ADD11_3_3:%.*]] = add i32 [[MUL_3_3]], [[ADD11_2_3]]
	; CHECK-NEXT: [[ARRAYIDX_4_3:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_2]], i64 4
	; CHECK-NEXT: [[TMP28:%.]] = load i16, i16 [[ARRAYIDX_4_3]], align 2
	; CHECK-NEXT: [[CONV_4_3:%.*]] = zext i16 [[TMP28]] to i32
	; CHECK-NEXT: [[ADD_4_3:%.*]] = add nuw nsw i32 [[ADD_3_3]], [[CONV_4_3]]
	; CHECK-NEXT: [[MUL_4_3:%.*]] = mul nuw nsw i32 [[CONV_4_3]], [[CONV_4_3]]
	; CHECK-NEXT: [[ADD11_4_3:%.*]] = add i32 [[MUL_4_3]], [[ADD11_3_3]]
	; CHECK-NEXT: [[ARRAYIDX_5_3:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_2]], i64 5
	; CHECK-NEXT: [[TMP29:%.]] = load i16, i16 [[ARRAYIDX_5_3]], align 2
	; CHECK-NEXT: [[CONV_5_3:%.*]] = zext i16 [[TMP29]] to i32
	; CHECK-NEXT: [[ADD_5_3:%.*]] = add nuw nsw i32 [[ADD_4_3]], [[CONV_5_3]]
	; CHECK-NEXT: [[MUL_5_3:%.*]] = mul nuw nsw i32 [[CONV_5_3]], [[CONV_5_3]]
	; CHECK-NEXT: [[ADD11_5_3:%.*]] = add i32 [[MUL_5_3]], [[ADD11_4_3]]
	; CHECK-NEXT: [[ARRAYIDX_6_3:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_2]], i64 6
	; CHECK-NEXT: [[TMP30:%.]] = load i16, i16 [[ARRAYIDX_6_3]], align 2
	; CHECK-NEXT: [[CONV_6_3:%.*]] = zext i16 [[TMP30]] to i32
	; CHECK-NEXT: [[ADD_6_3:%.*]] = add nuw nsw i32 [[ADD_5_3]], [[CONV_6_3]]
	; CHECK-NEXT: [[MUL_6_3:%.*]] = mul nuw nsw i32 [[CONV_6_3]], [[CONV_6_3]]
	; CHECK-NEXT: [[ADD11_6_3:%.*]] = add i32 [[MUL_6_3]], [[ADD11_5_3]]
	; CHECK-NEXT: [[ARRAYIDX_7_3:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_2]], i64 7
	; CHECK-NEXT: [[TMP31:%.]] = load i16, i16 [[ARRAYIDX_7_3]], align 2
	; CHECK-NEXT: [[CONV_7_3:%.*]] = zext i16 [[TMP31]] to i32
	; CHECK-NEXT: [[ADD_7_3:%.*]] = add nuw nsw i32 [[ADD_6_3]], [[CONV_7_3]]
	; CHECK-NEXT: [[MUL_7_3:%.*]] = mul nuw nsw i32 [[CONV_7_3]], [[CONV_7_3]]
	; CHECK-NEXT: [[ADD11_7_3:%.*]] = add i32 [[MUL_7_3]], [[ADD11_6_3]]
	; CHECK-NEXT: [[ADD_PTR_3:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_2]], i64 [[IDX_EXT]]			; CHECK-NEXT: [[ADD_PTR_3:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_2]], i64 [[IDX_EXT]]
	; CHECK-NEXT: [[TMP32:%.]] = load i16, i16 [[ADD_PTR_3]], align 2
	; CHECK-NEXT: [[CONV_452:%.*]] = zext i16 [[TMP32]] to i32
	; CHECK-NEXT: [[ADD_453:%.*]] = add nuw nsw i32 [[ADD_7_3]], [[CONV_452]]
	; CHECK-NEXT: [[MUL_454:%.*]] = mul nuw nsw i32 [[CONV_452]], [[CONV_452]]
	; CHECK-NEXT: [[ADD11_455:%.*]] = add i32 [[MUL_454]], [[ADD11_7_3]]
	; CHECK-NEXT: [[ARRAYIDX_1_4:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_3]], i64 1
	; CHECK-NEXT: [[TMP33:%.]] = load i16, i16 [[ARRAYIDX_1_4]], align 2
	; CHECK-NEXT: [[CONV_1_4:%.*]] = zext i16 [[TMP33]] to i32
	; CHECK-NEXT: [[ADD_1_4:%.*]] = add nuw nsw i32 [[ADD_453]], [[CONV_1_4]]
	; CHECK-NEXT: [[MUL_1_4:%.*]] = mul nuw nsw i32 [[CONV_1_4]], [[CONV_1_4]]
	; CHECK-NEXT: [[ADD11_1_4:%.*]] = add i32 [[MUL_1_4]], [[ADD11_455]]
	; CHECK-NEXT: [[ARRAYIDX_2_4:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_3]], i64 2
	; CHECK-NEXT: [[TMP34:%.]] = load i16, i16 [[ARRAYIDX_2_4]], align 2
	; CHECK-NEXT: [[CONV_2_4:%.*]] = zext i16 [[TMP34]] to i32
	; CHECK-NEXT: [[ADD_2_4:%.*]] = add nuw nsw i32 [[ADD_1_4]], [[CONV_2_4]]
	; CHECK-NEXT: [[MUL_2_4:%.*]] = mul nuw nsw i32 [[CONV_2_4]], [[CONV_2_4]]
	; CHECK-NEXT: [[ADD11_2_4:%.*]] = add i32 [[MUL_2_4]], [[ADD11_1_4]]
	; CHECK-NEXT: [[ARRAYIDX_3_4:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_3]], i64 3
	; CHECK-NEXT: [[TMP35:%.]] = load i16, i16 [[ARRAYIDX_3_4]], align 2
	; CHECK-NEXT: [[CONV_3_4:%.*]] = zext i16 [[TMP35]] to i32
	; CHECK-NEXT: [[ADD_3_4:%.*]] = add nuw nsw i32 [[ADD_2_4]], [[CONV_3_4]]
	; CHECK-NEXT: [[MUL_3_4:%.*]] = mul nuw nsw i32 [[CONV_3_4]], [[CONV_3_4]]
	; CHECK-NEXT: [[ADD11_3_4:%.*]] = add i32 [[MUL_3_4]], [[ADD11_2_4]]
	; CHECK-NEXT: [[ARRAYIDX_4_4:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_3]], i64 4
	; CHECK-NEXT: [[TMP36:%.]] = load i16, i16 [[ARRAYIDX_4_4]], align 2
	; CHECK-NEXT: [[CONV_4_4:%.*]] = zext i16 [[TMP36]] to i32
	; CHECK-NEXT: [[ADD_4_4:%.*]] = add nuw nsw i32 [[ADD_3_4]], [[CONV_4_4]]
	; CHECK-NEXT: [[MUL_4_4:%.*]] = mul nuw nsw i32 [[CONV_4_4]], [[CONV_4_4]]
	; CHECK-NEXT: [[ADD11_4_4:%.*]] = add i32 [[MUL_4_4]], [[ADD11_3_4]]
	; CHECK-NEXT: [[ARRAYIDX_5_4:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_3]], i64 5
	; CHECK-NEXT: [[TMP37:%.]] = load i16, i16 [[ARRAYIDX_5_4]], align 2
	; CHECK-NEXT: [[CONV_5_4:%.*]] = zext i16 [[TMP37]] to i32
	; CHECK-NEXT: [[ADD_5_4:%.*]] = add nuw nsw i32 [[ADD_4_4]], [[CONV_5_4]]
	; CHECK-NEXT: [[MUL_5_4:%.*]] = mul nuw nsw i32 [[CONV_5_4]], [[CONV_5_4]]
	; CHECK-NEXT: [[ADD11_5_4:%.*]] = add i32 [[MUL_5_4]], [[ADD11_4_4]]
	; CHECK-NEXT: [[ARRAYIDX_6_4:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_3]], i64 6
	; CHECK-NEXT: [[TMP38:%.]] = load i16, i16 [[ARRAYIDX_6_4]], align 2
	; CHECK-NEXT: [[CONV_6_4:%.*]] = zext i16 [[TMP38]] to i32
	; CHECK-NEXT: [[ADD_6_4:%.*]] = add nuw nsw i32 [[ADD_5_4]], [[CONV_6_4]]
	; CHECK-NEXT: [[MUL_6_4:%.*]] = mul nuw nsw i32 [[CONV_6_4]], [[CONV_6_4]]
	; CHECK-NEXT: [[ADD11_6_4:%.*]] = add i32 [[MUL_6_4]], [[ADD11_5_4]]
	; CHECK-NEXT: [[ARRAYIDX_7_4:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_3]], i64 7
	; CHECK-NEXT: [[TMP39:%.]] = load i16, i16 [[ARRAYIDX_7_4]], align 2
	; CHECK-NEXT: [[CONV_7_4:%.*]] = zext i16 [[TMP39]] to i32
	; CHECK-NEXT: [[ADD_7_4:%.*]] = add nuw nsw i32 [[ADD_6_4]], [[CONV_7_4]]
	; CHECK-NEXT: [[MUL_7_4:%.*]] = mul nuw nsw i32 [[CONV_7_4]], [[CONV_7_4]]
	; CHECK-NEXT: [[ADD11_7_4:%.*]] = add i32 [[MUL_7_4]], [[ADD11_6_4]]
	; CHECK-NEXT: [[ADD_PTR_4:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_3]], i64 [[IDX_EXT]]			; CHECK-NEXT: [[ADD_PTR_4:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_3]], i64 [[IDX_EXT]]
	; CHECK-NEXT: [[TMP40:%.]] = load i16, i16 [[ADD_PTR_4]], align 2
	; CHECK-NEXT: [[CONV_556:%.*]] = zext i16 [[TMP40]] to i32
	; CHECK-NEXT: [[ADD_557:%.*]] = add nuw nsw i32 [[ADD_7_4]], [[CONV_556]]
	; CHECK-NEXT: [[MUL_558:%.*]] = mul nuw nsw i32 [[CONV_556]], [[CONV_556]]
	; CHECK-NEXT: [[ADD11_559:%.*]] = add i32 [[MUL_558]], [[ADD11_7_4]]
	; CHECK-NEXT: [[ARRAYIDX_1_5:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_4]], i64 1
	; CHECK-NEXT: [[TMP41:%.]] = load i16, i16 [[ARRAYIDX_1_5]], align 2
	; CHECK-NEXT: [[CONV_1_5:%.*]] = zext i16 [[TMP41]] to i32
	; CHECK-NEXT: [[ADD_1_5:%.*]] = add nuw nsw i32 [[ADD_557]], [[CONV_1_5]]
	; CHECK-NEXT: [[MUL_1_5:%.*]] = mul nuw nsw i32 [[CONV_1_5]], [[CONV_1_5]]
	; CHECK-NEXT: [[ADD11_1_5:%.*]] = add i32 [[MUL_1_5]], [[ADD11_559]]
	; CHECK-NEXT: [[ARRAYIDX_2_5:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_4]], i64 2
	; CHECK-NEXT: [[TMP42:%.]] = load i16, i16 [[ARRAYIDX_2_5]], align 2
	; CHECK-NEXT: [[CONV_2_5:%.*]] = zext i16 [[TMP42]] to i32
	; CHECK-NEXT: [[ADD_2_5:%.*]] = add nuw nsw i32 [[ADD_1_5]], [[CONV_2_5]]
	; CHECK-NEXT: [[MUL_2_5:%.*]] = mul nuw nsw i32 [[CONV_2_5]], [[CONV_2_5]]
	; CHECK-NEXT: [[ADD11_2_5:%.*]] = add i32 [[MUL_2_5]], [[ADD11_1_5]]
	; CHECK-NEXT: [[ARRAYIDX_3_5:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_4]], i64 3
	; CHECK-NEXT: [[TMP43:%.]] = load i16, i16 [[ARRAYIDX_3_5]], align 2
	; CHECK-NEXT: [[CONV_3_5:%.*]] = zext i16 [[TMP43]] to i32
	; CHECK-NEXT: [[ADD_3_5:%.*]] = add nuw nsw i32 [[ADD_2_5]], [[CONV_3_5]]
	; CHECK-NEXT: [[MUL_3_5:%.*]] = mul nuw nsw i32 [[CONV_3_5]], [[CONV_3_5]]
	; CHECK-NEXT: [[ADD11_3_5:%.*]] = add i32 [[MUL_3_5]], [[ADD11_2_5]]
	; CHECK-NEXT: [[ARRAYIDX_4_5:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_4]], i64 4
	; CHECK-NEXT: [[TMP44:%.]] = load i16, i16 [[ARRAYIDX_4_5]], align 2
	; CHECK-NEXT: [[CONV_4_5:%.*]] = zext i16 [[TMP44]] to i32
	; CHECK-NEXT: [[ADD_4_5:%.*]] = add nuw nsw i32 [[ADD_3_5]], [[CONV_4_5]]
	; CHECK-NEXT: [[MUL_4_5:%.*]] = mul nuw nsw i32 [[CONV_4_5]], [[CONV_4_5]]
	; CHECK-NEXT: [[ADD11_4_5:%.*]] = add i32 [[MUL_4_5]], [[ADD11_3_5]]
	; CHECK-NEXT: [[ARRAYIDX_5_5:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_4]], i64 5
	; CHECK-NEXT: [[TMP45:%.]] = load i16, i16 [[ARRAYIDX_5_5]], align 2
	; CHECK-NEXT: [[CONV_5_5:%.*]] = zext i16 [[TMP45]] to i32
	; CHECK-NEXT: [[ADD_5_5:%.*]] = add nuw nsw i32 [[ADD_4_5]], [[CONV_5_5]]
	; CHECK-NEXT: [[MUL_5_5:%.*]] = mul nuw nsw i32 [[CONV_5_5]], [[CONV_5_5]]
	; CHECK-NEXT: [[ADD11_5_5:%.*]] = add i32 [[MUL_5_5]], [[ADD11_4_5]]
	; CHECK-NEXT: [[ARRAYIDX_6_5:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_4]], i64 6
	; CHECK-NEXT: [[TMP46:%.]] = load i16, i16 [[ARRAYIDX_6_5]], align 2
	; CHECK-NEXT: [[CONV_6_5:%.*]] = zext i16 [[TMP46]] to i32
	; CHECK-NEXT: [[ADD_6_5:%.*]] = add nuw nsw i32 [[ADD_5_5]], [[CONV_6_5]]
	; CHECK-NEXT: [[MUL_6_5:%.*]] = mul nuw nsw i32 [[CONV_6_5]], [[CONV_6_5]]
	; CHECK-NEXT: [[ADD11_6_5:%.*]] = add i32 [[MUL_6_5]], [[ADD11_5_5]]
	; CHECK-NEXT: [[ARRAYIDX_7_5:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_4]], i64 7
	; CHECK-NEXT: [[TMP47:%.]] = load i16, i16 [[ARRAYIDX_7_5]], align 2
	; CHECK-NEXT: [[CONV_7_5:%.*]] = zext i16 [[TMP47]] to i32
	; CHECK-NEXT: [[ADD_7_5:%.*]] = add nuw nsw i32 [[ADD_6_5]], [[CONV_7_5]]
	; CHECK-NEXT: [[MUL_7_5:%.*]] = mul nuw nsw i32 [[CONV_7_5]], [[CONV_7_5]]
	; CHECK-NEXT: [[ADD11_7_5:%.*]] = add i32 [[MUL_7_5]], [[ADD11_6_5]]
	; CHECK-NEXT: [[ADD_PTR_5:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_4]], i64 [[IDX_EXT]]			; CHECK-NEXT: [[ADD_PTR_5:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_4]], i64 [[IDX_EXT]]
	; CHECK-NEXT: [[TMP48:%.]] = load i16, i16 [[ADD_PTR_5]], align 2
	; CHECK-NEXT: [[CONV_660:%.*]] = zext i16 [[TMP48]] to i32
	; CHECK-NEXT: [[ADD_661:%.*]] = add nuw nsw i32 [[ADD_7_5]], [[CONV_660]]
	; CHECK-NEXT: [[MUL_662:%.*]] = mul nuw nsw i32 [[CONV_660]], [[CONV_660]]
	; CHECK-NEXT: [[ADD11_663:%.*]] = add i32 [[MUL_662]], [[ADD11_7_5]]
	; CHECK-NEXT: [[ARRAYIDX_1_6:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_5]], i64 1
	; CHECK-NEXT: [[TMP49:%.]] = load i16, i16 [[ARRAYIDX_1_6]], align 2
	; CHECK-NEXT: [[CONV_1_6:%.*]] = zext i16 [[TMP49]] to i32
	; CHECK-NEXT: [[ADD_1_6:%.*]] = add nuw nsw i32 [[ADD_661]], [[CONV_1_6]]
	; CHECK-NEXT: [[MUL_1_6:%.*]] = mul nuw nsw i32 [[CONV_1_6]], [[CONV_1_6]]
	; CHECK-NEXT: [[ADD11_1_6:%.*]] = add i32 [[MUL_1_6]], [[ADD11_663]]
	; CHECK-NEXT: [[ARRAYIDX_2_6:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_5]], i64 2
	; CHECK-NEXT: [[TMP50:%.]] = load i16, i16 [[ARRAYIDX_2_6]], align 2
	; CHECK-NEXT: [[CONV_2_6:%.*]] = zext i16 [[TMP50]] to i32
	; CHECK-NEXT: [[ADD_2_6:%.*]] = add nuw nsw i32 [[ADD_1_6]], [[CONV_2_6]]
	; CHECK-NEXT: [[MUL_2_6:%.*]] = mul nuw nsw i32 [[CONV_2_6]], [[CONV_2_6]]
	; CHECK-NEXT: [[ADD11_2_6:%.*]] = add i32 [[MUL_2_6]], [[ADD11_1_6]]
	; CHECK-NEXT: [[ARRAYIDX_3_6:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_5]], i64 3
	; CHECK-NEXT: [[TMP51:%.]] = load i16, i16 [[ARRAYIDX_3_6]], align 2
	; CHECK-NEXT: [[CONV_3_6:%.*]] = zext i16 [[TMP51]] to i32
	; CHECK-NEXT: [[ADD_3_6:%.*]] = add nuw nsw i32 [[ADD_2_6]], [[CONV_3_6]]
	; CHECK-NEXT: [[MUL_3_6:%.*]] = mul nuw nsw i32 [[CONV_3_6]], [[CONV_3_6]]
	; CHECK-NEXT: [[ADD11_3_6:%.*]] = add i32 [[MUL_3_6]], [[ADD11_2_6]]
	; CHECK-NEXT: [[ARRAYIDX_4_6:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_5]], i64 4
	; CHECK-NEXT: [[TMP52:%.]] = load i16, i16 [[ARRAYIDX_4_6]], align 2
	; CHECK-NEXT: [[CONV_4_6:%.*]] = zext i16 [[TMP52]] to i32
	; CHECK-NEXT: [[ADD_4_6:%.*]] = add nuw nsw i32 [[ADD_3_6]], [[CONV_4_6]]
	; CHECK-NEXT: [[MUL_4_6:%.*]] = mul nuw nsw i32 [[CONV_4_6]], [[CONV_4_6]]
	; CHECK-NEXT: [[ADD11_4_6:%.*]] = add i32 [[MUL_4_6]], [[ADD11_3_6]]
	; CHECK-NEXT: [[ARRAYIDX_5_6:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_5]], i64 5
	; CHECK-NEXT: [[TMP53:%.]] = load i16, i16 [[ARRAYIDX_5_6]], align 2
	; CHECK-NEXT: [[CONV_5_6:%.*]] = zext i16 [[TMP53]] to i32
	; CHECK-NEXT: [[ADD_5_6:%.*]] = add nuw nsw i32 [[ADD_4_6]], [[CONV_5_6]]
	; CHECK-NEXT: [[MUL_5_6:%.*]] = mul nuw nsw i32 [[CONV_5_6]], [[CONV_5_6]]
	; CHECK-NEXT: [[ADD11_5_6:%.*]] = add i32 [[MUL_5_6]], [[ADD11_4_6]]
	; CHECK-NEXT: [[ARRAYIDX_6_6:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_5]], i64 6
	; CHECK-NEXT: [[TMP54:%.]] = load i16, i16 [[ARRAYIDX_6_6]], align 2
	; CHECK-NEXT: [[CONV_6_6:%.*]] = zext i16 [[TMP54]] to i32
	; CHECK-NEXT: [[ADD_6_6:%.*]] = add nuw nsw i32 [[ADD_5_6]], [[CONV_6_6]]
	; CHECK-NEXT: [[MUL_6_6:%.*]] = mul nuw nsw i32 [[CONV_6_6]], [[CONV_6_6]]
	; CHECK-NEXT: [[ADD11_6_6:%.*]] = add i32 [[MUL_6_6]], [[ADD11_5_6]]
	; CHECK-NEXT: [[ARRAYIDX_7_6:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_5]], i64 7
	; CHECK-NEXT: [[TMP55:%.]] = load i16, i16 [[ARRAYIDX_7_6]], align 2
	; CHECK-NEXT: [[CONV_7_6:%.*]] = zext i16 [[TMP55]] to i32
	; CHECK-NEXT: [[ADD_7_6:%.*]] = add nuw nsw i32 [[ADD_6_6]], [[CONV_7_6]]
	; CHECK-NEXT: [[MUL_7_6:%.*]] = mul nuw nsw i32 [[CONV_7_6]], [[CONV_7_6]]
	; CHECK-NEXT: [[ADD11_7_6:%.*]] = add i32 [[MUL_7_6]], [[ADD11_6_6]]
	; CHECK-NEXT: [[ADD_PTR_6:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_5]], i64 [[IDX_EXT]]			; CHECK-NEXT: [[ADD_PTR_6:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_5]], i64 [[IDX_EXT]]
	; CHECK-NEXT: [[TMP56:%.]] = load i16, i16 [[ADD_PTR_6]], align 2			; CHECK-NEXT: [[TMP0:%.]] = bitcast i16 [[P]] to <8 x i16>*
	; CHECK-NEXT: [[CONV_764:%.*]] = zext i16 [[TMP56]] to i32			; CHECK-NEXT: [[TMP1:%.]] = load <8 x i16>, <8 x i16> [[TMP0]], align 2
	; CHECK-NEXT: [[ADD_765:%.*]] = add nuw nsw i32 [[ADD_7_6]], [[CONV_764]]			; CHECK-NEXT: [[TMP2:%.]] = bitcast i16 [[ADD_PTR]] to <8 x i16>*
	; CHECK-NEXT: [[MUL_766:%.*]] = mul nuw nsw i32 [[CONV_764]], [[CONV_764]]			; CHECK-NEXT: [[TMP3:%.]] = load <8 x i16>, <8 x i16> [[TMP2]], align 2
	; CHECK-NEXT: [[ADD11_767:%.*]] = add i32 [[MUL_766]], [[ADD11_7_6]]			; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[ADD_PTR_1]] to <8 x i16>*
	; CHECK-NEXT: [[ARRAYIDX_1_7:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_6]], i64 1			; CHECK-NEXT: [[TMP5:%.]] = load <8 x i16>, <8 x i16> [[TMP4]], align 2
	; CHECK-NEXT: [[TMP57:%.]] = load i16, i16 [[ARRAYIDX_1_7]], align 2			; CHECK-NEXT: [[TMP6:%.]] = bitcast i16 [[ADD_PTR_2]] to <8 x i16>*
	; CHECK-NEXT: [[CONV_1_7:%.*]] = zext i16 [[TMP57]] to i32			; CHECK-NEXT: [[TMP7:%.]] = load <8 x i16>, <8 x i16> [[TMP6]], align 2
	; CHECK-NEXT: [[ADD_1_7:%.*]] = add nuw nsw i32 [[ADD_765]], [[CONV_1_7]]			; CHECK-NEXT: [[TMP8:%.]] = bitcast i16 [[ADD_PTR_3]] to <8 x i16>*
	; CHECK-NEXT: [[MUL_1_7:%.*]] = mul nuw nsw i32 [[CONV_1_7]], [[CONV_1_7]]			; CHECK-NEXT: [[TMP9:%.]] = load <8 x i16>, <8 x i16> [[TMP8]], align 2
	; CHECK-NEXT: [[ADD11_1_7:%.*]] = add i32 [[MUL_1_7]], [[ADD11_767]]			; CHECK-NEXT: [[TMP10:%.]] = bitcast i16 [[ADD_PTR_4]] to <8 x i16>*
	; CHECK-NEXT: [[ARRAYIDX_2_7:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_6]], i64 2			; CHECK-NEXT: [[TMP11:%.]] = load <8 x i16>, <8 x i16> [[TMP10]], align 2
	; CHECK-NEXT: [[TMP58:%.]] = load i16, i16 [[ARRAYIDX_2_7]], align 2			; CHECK-NEXT: [[TMP12:%.]] = bitcast i16 [[ADD_PTR_5]] to <8 x i16>*
	; CHECK-NEXT: [[CONV_2_7:%.*]] = zext i16 [[TMP58]] to i32			; CHECK-NEXT: [[TMP13:%.]] = load <8 x i16>, <8 x i16> [[TMP12]], align 2
	; CHECK-NEXT: [[ADD_2_7:%.*]] = add nuw nsw i32 [[ADD_1_7]], [[CONV_2_7]]			; CHECK-NEXT: [[TMP14:%.]] = bitcast i16 [[ADD_PTR_6]] to <8 x i16>*
	; CHECK-NEXT: [[MUL_2_7:%.*]] = mul nuw nsw i32 [[CONV_2_7]], [[CONV_2_7]]			; CHECK-NEXT: [[TMP15:%.]] = load <8 x i16>, <8 x i16> [[TMP14]], align 2
	; CHECK-NEXT: [[ADD11_2_7:%.*]] = add i32 [[MUL_2_7]], [[ADD11_1_7]]			; CHECK-NEXT: [[TMP16:%.*]] = shufflevector <8 x i16> [[TMP1]], <8 x i16> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[ARRAYIDX_3_7:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_6]], i64 3			; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <8 x i16> [[TMP3]], <8 x i16> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP59:%.]] = load i16, i16 [[ARRAYIDX_3_7]], align 2			; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <64 x i16> [[TMP16]], <64 x i16> [[TMP17]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	; CHECK-NEXT: [[CONV_3_7:%.*]] = zext i16 [[TMP59]] to i32			; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <8 x i16> [[TMP5]], <8 x i16> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[ADD_3_7:%.*]] = add nuw nsw i32 [[ADD_2_7]], [[CONV_3_7]]			; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <64 x i16> [[TMP18]], <64 x i16> [[TMP19]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	; CHECK-NEXT: [[MUL_3_7:%.*]] = mul nuw nsw i32 [[CONV_3_7]], [[CONV_3_7]]			; CHECK-NEXT: [[TMP21:%.*]] = shufflevector <8 x i16> [[TMP7]], <8 x i16> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[ADD11_3_7:%.*]] = add i32 [[MUL_3_7]], [[ADD11_2_7]]			; CHECK-NEXT: [[TMP22:%.*]] = shufflevector <64 x i16> [[TMP20]], <64 x i16> [[TMP21]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	; CHECK-NEXT: [[ARRAYIDX_4_7:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_6]], i64 4			; CHECK-NEXT: [[TMP23:%.*]] = shufflevector <8 x i16> [[TMP9]], <8 x i16> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP60:%.]] = load i16, i16 [[ARRAYIDX_4_7]], align 2			; CHECK-NEXT: [[TMP24:%.*]] = shufflevector <64 x i16> [[TMP22]], <64 x i16> [[TMP23]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	; CHECK-NEXT: [[CONV_4_7:%.*]] = zext i16 [[TMP60]] to i32			; CHECK-NEXT: [[TMP25:%.*]] = shufflevector <8 x i16> [[TMP11]], <8 x i16> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[ADD_4_7:%.*]] = add nuw nsw i32 [[ADD_3_7]], [[CONV_4_7]]			; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <64 x i16> [[TMP24]], <64 x i16> [[TMP25]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	; CHECK-NEXT: [[MUL_4_7:%.*]] = mul nuw nsw i32 [[CONV_4_7]], [[CONV_4_7]]			; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <8 x i16> [[TMP13]], <8 x i16> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[ADD11_4_7:%.*]] = add i32 [[MUL_4_7]], [[ADD11_3_7]]			; CHECK-NEXT: [[TMP28:%.*]] = shufflevector <64 x i16> [[TMP26]], <64 x i16> [[TMP27]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	; CHECK-NEXT: [[ARRAYIDX_5_7:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_6]], i64 5			; CHECK-NEXT: [[TMP29:%.*]] = shufflevector <8 x i16> [[TMP15]], <8 x i16> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP61:%.]] = load i16, i16 [[ARRAYIDX_5_7]], align 2			; CHECK-NEXT: [[TMP30:%.*]] = shufflevector <64 x i16> [[TMP28]], <64 x i16> [[TMP29]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71>
	; CHECK-NEXT: [[CONV_5_7:%.*]] = zext i16 [[TMP61]] to i32			; CHECK-NEXT: [[TMP31:%.*]] = zext <64 x i16> [[TMP30]] to <64 x i32>
	; CHECK-NEXT: [[ADD_5_7:%.*]] = add nuw nsw i32 [[ADD_4_7]], [[CONV_5_7]]			; CHECK-NEXT: [[TMP32:%.*]] = mul nuw nsw <64 x i32> [[TMP31]], [[TMP31]]
	; CHECK-NEXT: [[MUL_5_7:%.*]] = mul nuw nsw i32 [[CONV_5_7]], [[CONV_5_7]]			; CHECK-NEXT: [[TMP33:%.*]] = call i32 @llvm.vector.reduce.add.v64i32(<64 x i32> [[TMP31]])
	; CHECK-NEXT: [[ADD11_5_7:%.*]] = add i32 [[MUL_5_7]], [[ADD11_4_7]]			; CHECK-NEXT: [[TMP34:%.*]] = call i32 @llvm.vector.reduce.add.v64i32(<64 x i32> [[TMP32]])
	; CHECK-NEXT: [[ARRAYIDX_6_7:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_6]], i64 6			; CHECK-NEXT: [[CONV15:%.*]] = zext i32 [[TMP33]] to i64
	; CHECK-NEXT: [[TMP62:%.]] = load i16, i16 [[ARRAYIDX_6_7]], align 2			; CHECK-NEXT: [[CONV16:%.*]] = zext i32 [[TMP34]] to i64
	; CHECK-NEXT: [[CONV_6_7:%.*]] = zext i16 [[TMP62]] to i32
	; CHECK-NEXT: [[ADD_6_7:%.*]] = add nuw nsw i32 [[ADD_5_7]], [[CONV_6_7]]
	; CHECK-NEXT: [[MUL_6_7:%.*]] = mul nuw nsw i32 [[CONV_6_7]], [[CONV_6_7]]
	; CHECK-NEXT: [[ADD11_6_7:%.*]] = add i32 [[MUL_6_7]], [[ADD11_5_7]]
	; CHECK-NEXT: [[ARRAYIDX_7_7:%.]] = getelementptr inbounds i16, i16 [[ADD_PTR_6]], i64 7
	; CHECK-NEXT: [[TMP63:%.]] = load i16, i16 [[ARRAYIDX_7_7]], align 2
	; CHECK-NEXT: [[CONV_7_7:%.*]] = zext i16 [[TMP63]] to i32
	; CHECK-NEXT: [[ADD_7_7:%.*]] = add nuw nsw i32 [[ADD_6_7]], [[CONV_7_7]]
	; CHECK-NEXT: [[MUL_7_7:%.*]] = mul nuw nsw i32 [[CONV_7_7]], [[CONV_7_7]]
	; CHECK-NEXT: [[ADD11_7_7:%.*]] = add i32 [[MUL_7_7]], [[ADD11_6_7]]
	; CHECK-NEXT: [[CONV15:%.*]] = zext i32 [[ADD_7_7]] to i64
	; CHECK-NEXT: [[CONV16:%.*]] = zext i32 [[ADD11_7_7]] to i64
	; CHECK-NEXT: [[SHL:%.*]] = shl nuw i64 [[CONV16]], 32			; CHECK-NEXT: [[SHL:%.*]] = shl nuw i64 [[CONV16]], 32
	; CHECK-NEXT: [[ADD17:%.*]] = or i64 [[SHL]], [[CONV15]]			; CHECK-NEXT: [[ADD17:%.*]] = or i64 [[SHL]], [[CONV15]]
	; CHECK-NEXT: ret i64 [[ADD17]]			; CHECK-NEXT: ret i64 [[ADD17]]
	;			;
	entry:			entry:
	%idx.ext = sext i32 %st to i64			%idx.ext = sext i32 %st to i64
	%0 = load i16, i16* %p, align 2			%0 = load i16, i16* %p, align 2
	%conv = zext i16 %0 to i32			%conv = zext i16 %0 to i32
	▲ Show 20 Lines • Show All 619 Lines • Show Last 20 Lines