This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
transpose.ll
-
X86/
-
lookahead.ll

Differential D60897

[SLP] Look-ahead operand reordering heuristic.
ClosedPublic

Authored by vporpo on Apr 18 2019, 4:36 PM.

Download Raw Diff

Details

Reviewers

RKSimon
ABataev
dtemirbulatov
Ayal
hfinkel
rnk

Commits

rG6a18a9548761: [SLP] Look-ahead operand reordering heuristic.
rGcf47ff5ffb1a: [SLP] Recommit: Look-ahead operand reordering heuristic.
rL364964: [SLP] Recommit: Look-ahead operand reordering heuristic.
rG574cb0eb3a7a: [SLP] Look-ahead operand reordering heuristic.
rL364478: [SLP] Look-ahead operand reordering heuristic.
rG5698921be2d5: [SLP] Look-ahead operand reordering heuristic.
rL364084: [SLP] Look-ahead operand reordering heuristic.

Summary

This patch introduces a new heuristic for guiding operand reordering. The new "look-ahead" heuristic can look beyond the immediate predecessors. This helps break ties when the immediate predecessors have identical opcodes (see lit test for an example).

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 40727
Build 40857: arc lint + arc unit

Event Timeline

vporpo created this revision.Apr 18 2019, 4:36 PM

Herald added a subscriber: llvm-commits. · View Herald TranscriptApr 18 2019, 4:36 PM

RKSimon added inline comments.Apr 27 2019, 8:35 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
149 ↗	(On Diff #195824)	"slp-max-look-ahead-depth" might be better?
744 ↗	(On Diff #195824)	These kinds of hard coded costs scare me, especially without any context/description.

vporpo updated this revision to Diff 197190.Apr 29 2019, 2:31 PM

vporpo marked 2 inline comments as done.

vporpo added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
744 ↗	(On Diff #195824)	I agree, they look quite scary but the values themselves are not very important. What we we really care about is figuring out whether the values could be potentially vectorized or not. The relative differences between the scores is not very important either. It just shows that matching loads is usually preferable to matching instructions with the same opcode etc. We could still use the same score of 1 for all of them and we would still get decent reordering. Another alternative would be to check TTI for each potential candidate vector, but it looks like an overkill.

Tests?

lib/Transforms/Vectorize/SLPVectorizer.cpp
756 ↗	(On Diff #197190)	Seems to me, the code is not formatted
757 ↗	(On Diff #197190)	`auto *`
758 ↗	(On Diff #197190)	`auto *`
764 ↗	(On Diff #197190)	`auto *`
765 ↗	(On Diff #197190)	`auto *`
769 ↗	(On Diff #197190)	`auto *`
770 ↗	(On Diff #197190)	`auto *`
815 ↗	(On Diff #197190)	`auto *`
816 ↗	(On Diff #197190)	`auto *`

vporpo updated this revision to Diff 197195.Apr 29 2019, 3:13 PM

vporpo marked 9 inline comments as done.

Better to commit the test itself at first, without this patch

Rebased.

RKSimon added inline comments.May 1 2019, 9:41 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
760 ↗	(On Diff #197366)	(style) remove outer brackets
773 ↗	(On Diff #197366)	What happens in the case where we have alt opcodes? Should we have a preference for all the same opcode vs with alt-opcode? Sometimes the alt-opcodes will fold away (shl + mul etc.) - other times it won't (shl + lshr).

Addressed comments and updated lit test.

lib/Transforms/Vectorize/SLPVectorizer.cpp
773 ↗	(On Diff #197366)	Hmm good point. Well, currently 'getScoreAtLevelRec()' will simply walk past the alt instructions and will assign them `ScoreSameOpcode`. This is does not look very accurate because alt opcodes usually require shuffles and should have a lower score. I introduced a new `ScoreAltOpcodes = 1` so that alt opcodes are not given the same score as identical opcodes (please see the new functions in lit test). As for the alt opcodes that fold away, maybe that should be fixed in the `getSameOpcode()` and in `struct InstructionState` ? If the get folded then maybe `isAltShuffle()` should return false?

Updated getBestOperand() to use getLookAheadScore() for Load and Constant, not just Opcode.

please can you rebase this?

Rebased.

dtemirbulatov added inline comments.May 22 2019, 12:26 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
932 ↗	(On Diff #200564)	one extra space here.

oh, I notice some regression with the change for AArch64/matmul.ll and AArch64/transpose.ll. Maybe there is a way to isolate it with heuristics?

Yes, I will take a look. Maybe it is worth using TTI for the scores after all.

rcorcs added a subscriber: rcorcs.May 28 2019, 4:57 AM

I investigated the two AArch64 failing tests. These tests feature the exact problem that we are trying to solve with this look-ahead heuristic. A commutative instruction had operands of the same opcode that the current heuristic has no way of reordering in an informed way. The current reordering was just lucky to pick the proper one, while the look-ahead heuristic was reordering the operands according to the score. However, the problem was that the score calculation was not considering external uses and was therefore favoring a sub-optimal operand ordering.

I updated the patch to factor in the cost of external uses, and both failures are now gone. I also updated the lit-test with a test that shows the problem with the external-uses.

Rebased.

hmm, I have another failure with this change on my setup, now it is PR39774.ll. Probably, it might be sort algorithm differents of similar, since it just swap of two inserts. I don't have this failure if PR39774.ll is intact.

Removed changes in PR39774.ll.

Hmm I am now getting the same failure as you @dtemirbulatov . I am not sure what was wrong before, but it seems that the change in PR39774.ll is no longer needed.

LGTM.

This revision is now accepted and ready to land.Jun 13 2019, 5:35 AM

@ABataev @RKSimon any comments?

RKSimon mentioned this in rG72186a24942f: [SLP][X86] Add lookahead reordering tests from D60897.Jun 20 2019, 5:52 AM

RKSimon mentioned this in rL363925: [SLP][X86] Add lookahead reordering tests from D60897.

@vporpo I've committed the lookahead.ll changes at rL363925 with current (trunk) codegen - please can you rebase?

RKSimon requested changes to this revision.Jun 20 2019, 6:15 AM

This revision now requires changes to proceed.Jun 20 2019, 6:15 AM

Rebased

LGTM - thanks!

This revision is now accepted and ready to land.Jun 21 2019, 3:59 AM

Thank you for the reviews. Please commit the patch.

Closed by commit rL364084: [SLP] Look-ahead operand reordering heuristic. (authored by RKSimon). · Explain WhyJun 21 2019, 10:57 AM

This revision was automatically updated to reflect the committed changes.

This was reverted in r364111 since it was causing a failure in Chromium reported by @rnk.

This revision is now accepted and ready to land.Jun 25 2019, 12:05 AM

@rnk do you have a repro yet please?

This revision now requires changes to proceed.Jun 25 2019, 12:18 AM

Yes, @RKSimon . It was posted in llvm-commits. I did reproduce it and I will update the patch with the fix + lit test.

Fixed crash in chromium reported by @rnk.

The crash was caused by two call instructions with different number of arguments (see lookahead_crash() function in lit test).

Harbormaster completed remote builds in B33863: Diff 206383.Jun 25 2019, 1:08 AM

vporpo marked an inline comment as done.Jun 25 2019, 1:15 AM

vporpo added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
911 ↗	(On Diff #206383)	This was the cause of the crash. There was no `std::min` here, so ToIdx could be `OpIdx1 + 1` even if `I2` had fewer operands than that.

LGTM

This revision is now accepted and ready to land.Jun 26 2019, 4:31 AM

Closed by commit rL364478: [SLP] Look-ahead operand reordering heuristic. (authored by vporpo). · Explain WhyJun 26 2019, 2:30 PM

This revision was automatically updated to reflect the committed changes.

This change has caused a massive increase in build times when using LTO. On my workstations when building Clang toolchain itself with LTO, the build time has increased from 50 minutes to 5+ hours. On our continuous builders, we seen timeouts everywhere because they aren't able to finish within the allocated time (5 hours). Would it be possible to revert this change?

In D60897#1562427, @phosek wrote:

This change has caused a massive increase in build times when using LTO. On my workstations when building Clang toolchain itself with LTO, the build time has increased from 50 minutes to 5+ hours. On our continuous builders, we seen timeouts everywhere because they aren't able to finish within the allocated time (5 hours). Would it be possible to revert this change?

Reversion makes sense to me - @phosek are you in position to profile it to see where the time is going in SLP please?

This revision is now accepted and ready to land.Jun 28 2019, 9:54 AM

RKSimon requested changes to this revision.Jun 28 2019, 9:54 AM

This revision now requires changes to proceed.Jun 28 2019, 9:54 AM

I think There are two possible causes for the compilation time increase:

Line 901 : We can restrict the number of operands to a max of 2
Line 820: We can restrict the visited users to a ma x of 2.

I can either create a quick patch, or I can revert it. Either is fine.

In D60897#1562519, @vporpo wrote:

I think There are two possible causes for the compilation time increase:

Line 901 : We can restrict the number of operands to a max of 2

Line 820: We can restrict the visited users to a ma x of 2.

I can either create a quick patch, or I can revert it. Either is fine.

If you can create a quick patch, I'd be happy to try it locally to see if it addresses the problem.

Thanks, that would be really helpful @phosek . Let me create the quick patch.

vporpo mentioned this in D63948: [SLP] Limit compilation time of look-ahead operand reordering heuristic..Jun 28 2019, 11:42 AM

The fix for the compilation-time issue reported by @phosek is in D63948 .

Fixed the compilation-time issue with the introduction of the user budget limit from D63948. Also added lit test for it.

Herald added a subscriber: hiraditya. · View Herald TranscriptJul 1 2019, 3:06 PM

Harbormaster completed remote builds in B34163: Diff 207417.Jul 1 2019, 3:07 PM

LGTM

This revision is now accepted and ready to land.Jul 2 2019, 1:11 AM

Closed by commit rL364964: [SLP] Recommit: Look-ahead operand reordering heuristic. (authored by vporpo). · Explain WhyJul 2 2019, 1:21 PM

This revision was automatically updated to reflect the committed changes.

Following is a simplified test case that shows the performance regression of eigen. The number of instructions in inner loop is increased from 85 to 93. Hope it can be helpful.

#include "third_party/eigen3/Eigen/Core"

template <typename ValueType>
inline void CheckNonnegative(ValueType value) {

if (!(value >= 0)) {
  fprintf(stderr, "Check failed.\n");
}

}

template <typename ValueType>
inline decltype(std::norm(ValueType(0))) SquaredMagnitude(ValueType value) {

typedef decltype(std::norm(ValueType(0))) ComponentType;
ComponentType r = std::real(value);
ComponentType i = std::imag(value);
return r * r + i * i;

}

template <typename VectorType2>
class DotEigen {
public:

DotEigen(): first_vector_(16), second_vector_(16) {
}

inline void Execute() {
  result_ = first_vector_.dot(second_vector_);
  CheckNonnegative(SquaredMagnitude(result_));
}

public:

Eigen::Matrix<float, 16, 1> first_vector_;
VectorType2 second_vector_;
std::complex<float> result_;

};

typedef DotEigen<Eigen::Matrix<std::complex<float>, 16, 1>> MOperation;
MOperation* operation;

void BM_Dot_RealComplex_EigenDotFixed(int num_iterations) {

for (int iter = 0; iter < num_iterations; ++iter) {
  operation->Execute();
}

}

Thanks @Carrot, I will investigate the issue.

This revision is now accepted and ready to land.Oct 17 2019, 6:55 PM

@Carrot What build settings are you using? I'm not seeing this here: https://godbolt.org/z/4jq0g8

@RKSimon, my command line options are:

-O3 -m64 -msse4.2 -mpclmul -maes -mprefer-vector-width=128 -fexperimental-new-pass-manager -fPIE

I compared r364964 and r364963.

This fixes the issue reported by @Carrot. It updates getShallowScore() to better cope with extracts from consecutive locations of the same vector (see last lit test). @Carrot, if you have the time please verify that this patch no longer causes a regression. Thanks!

Harbormaster completed remote builds in B39873: Diff 225923.Oct 21 2019, 11:05 AM

@vporpo, it's verified that the eigen test doesn't regress now, actually the inner loop is the same as before.
Thanks

Please reland this patch again.

@RKSimon any comments about the latest changes?

Ping @RKSimon

@vporpo Please can you rebase this?

Rebased.

Harbormaster completed remote builds in B40727: Diff 228629.Nov 10 2019, 8:41 PM

LGTM - cheers

Closed by commit rG6a18a9548761: [SLP] Look-ahead operand reordering heuristic. (authored by vporpo). · Explain WhyNov 11 2019, 9:21 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

306 lines

test/

Transforms/

SLPVectorizer/

AArch64/

transpose.ll

99 lines

X86/

lookahead.ll

256 lines

Diff 228629

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 141 Lines • ▼ Show 20 Lines
static cl::opt<unsigned> RecursionMaxDepth(		static cl::opt<unsigned> RecursionMaxDepth(
"slp-recursion-max-depth", cl::init(12), cl::Hidden,		"slp-recursion-max-depth", cl::init(12), cl::Hidden,
cl::desc("Limit the recursion depth when building a vectorizable tree"));		cl::desc("Limit the recursion depth when building a vectorizable tree"));

static cl::opt<unsigned> MinTreeSize(		static cl::opt<unsigned> MinTreeSize(
"slp-min-tree-size", cl::init(3), cl::Hidden,		"slp-min-tree-size", cl::init(3), cl::Hidden,
cl::desc("Only vectorize small trees if they are fully vectorizable"));		cl::desc("Only vectorize small trees if they are fully vectorizable"));

		// The maximum depth that the look-ahead score heuristic will explore.
		// The higher this value, the higher the compilation time overhead.
		static cl::opt<int> LookAheadMaxDepth(
		"slp-max-look-ahead-depth", cl::init(2), cl::Hidden,
		cl::desc("The maximum look-ahead depth for operand reordering scores"));

		// The Look-ahead heuristic goes through the users of the bundle to calculate
		// the users cost in getExternalUsesCost(). To avoid compilation time increase
		// we limit the number of users visited to this value.
		static cl::opt<unsigned> LookAheadUsersBudget(
		"slp-look-ahead-users-budget", cl::init(2), cl::Hidden,
		cl::desc("The maximum number of users to visit while visiting the "
		"predecessors. This prevents compilation time increase."));

static cl::opt<bool>		static cl::opt<bool>
ViewSLPTree("view-slp-tree", cl::Hidden,		ViewSLPTree("view-slp-tree", cl::Hidden,
cl::desc("Display the SLP trees with Graphviz"));		cl::desc("Display the SLP trees with Graphviz"));

// Limit the number of alias checks. The limit is chosen so that		// Limit the number of alias checks. The limit is chosen so that
// it has no negative effect on the llvm benchmarks.		// it has no negative effect on the llvm benchmarks.
static const unsigned AliasedCheckLimit = 10;		static const unsigned AliasedCheckLimit = 10;

▲ Show 20 Lines • Show All 558 Lines • ▼ Show 20 Lines	class VLOperands {

using OperandDataVec = SmallVector<OperandData, 2>;		using OperandDataVec = SmallVector<OperandData, 2>;

/// A vector of operand vectors.		/// A vector of operand vectors.
SmallVector<OperandDataVec, 4> OpsVec;		SmallVector<OperandDataVec, 4> OpsVec;

const DataLayout &DL;		const DataLayout &DL;
ScalarEvolution &SE;		ScalarEvolution &SE;
		const BoUpSLP &R;

/// \returns the operand data at \p OpIdx and \p Lane.		/// \returns the operand data at \p OpIdx and \p Lane.
OperandData &getData(unsigned OpIdx, unsigned Lane) {		OperandData &getData(unsigned OpIdx, unsigned Lane) {
return OpsVec[OpIdx][Lane];		return OpsVec[OpIdx][Lane];
}		}

/// \returns the operand data at \p OpIdx and \p Lane. Const version.		/// \returns the operand data at \p OpIdx and \p Lane. Const version.
const OperandData &getData(unsigned OpIdx, unsigned Lane) const {		const OperandData &getData(unsigned OpIdx, unsigned Lane) const {
Show All 9 Lines	void clearUsed() {
OpsVec[OpIdx][Lane].IsUsed = false;		OpsVec[OpIdx][Lane].IsUsed = false;
}		}

/// Swap the operand at \p OpIdx1 with that one at \p OpIdx2.		/// Swap the operand at \p OpIdx1 with that one at \p OpIdx2.
void swap(unsigned OpIdx1, unsigned OpIdx2, unsigned Lane) {		void swap(unsigned OpIdx1, unsigned OpIdx2, unsigned Lane) {
std::swap(OpsVec[OpIdx1][Lane], OpsVec[OpIdx2][Lane]);		std::swap(OpsVec[OpIdx1][Lane], OpsVec[OpIdx2][Lane]);
}		}

		// The hard-coded scores listed here are not very important. When computing
		// the scores of matching one sub-tree with another, we are basically
		// counting the number of values that are matching. So even if all scores
		// are set to 1, we would still get a decent matching result.
		// However, sometimes we have to break ties. For example we may have to
		// choose between matching loads vs matching opcodes. This is what these
		// scores are helping us with: they provide the order of preference.

		/// Loads from consecutive memory addresses, e.g. load(A[i]), load(A[i+1]).
		static const int ScoreConsecutiveLoads = 3;
		/// ExtractElementInst from same vector and consecutive indexes.
		static const int ScoreConsecutiveExtracts = 3;
		/// Constants.
		static const int ScoreConstants = 2;
		/// Instructions with the same opcode.
		static const int ScoreSameOpcode = 2;
		/// Instructions with alt opcodes (e.g, add + sub).
		static const int ScoreAltOpcodes = 1;
		/// Identical instructions (a.k.a. splat or broadcast).
		static const int ScoreSplat = 1;
		/// Matching with an undef is preferable to failing.
		static const int ScoreUndef = 1;
		/// Score for failing to find a decent match.
		static const int ScoreFail = 0;
		/// User exteranl to the vectorized code.
		static const int ExternalUseCost = 1;
		/// The user is internal but in a different lane.
		static const int UserInDiffLaneCost = ExternalUseCost;

		/// \returns the score of placing \p V1 and \p V2 in consecutive lanes.
		static int getShallowScore(Value V1, Value V2, const DataLayout &DL,
		ScalarEvolution &SE) {
		auto *LI1 = dyn_cast<LoadInst>(V1);
		auto *LI2 = dyn_cast<LoadInst>(V2);
		if (LI1 && LI2)
		return isConsecutiveAccess(LI1, LI2, DL, SE)
		? VLOperands::ScoreConsecutiveLoads
		: VLOperands::ScoreFail;

		auto *C1 = dyn_cast<Constant>(V1);
		auto *C2 = dyn_cast<Constant>(V2);
		if (C1 && C2)
		return VLOperands::ScoreConstants;

		// Extracts from consecutive indexes of the same vector better score as
		// the extracts could be optimized away.
		auto *Ex1 = dyn_cast<ExtractElementInst>(V1);
		auto *Ex2 = dyn_cast<ExtractElementInst>(V2);
		if (Ex1 && Ex2 && Ex1->getVectorOperand() == Ex2->getVectorOperand() &&
		cast<ConstantInt>(Ex1->getIndexOperand())->getZExtValue() + 1 ==
		cast<ConstantInt>(Ex2->getIndexOperand())->getZExtValue()) {
		return VLOperands::ScoreConsecutiveExtracts;
		}

		auto *I1 = dyn_cast<Instruction>(V1);
		auto *I2 = dyn_cast<Instruction>(V2);
		if (I1 && I2) {
		if (I1 == I2)
		return VLOperands::ScoreSplat;
		InstructionsState S = getSameOpcode({I1, I2});
		// Note: Only consider instructions with <= 2 operands to avoid
		// complexity explosion.
		if (S.getOpcode() && S.MainOp->getNumOperands() <= 2)
		return S.isAltShuffle() ? VLOperands::ScoreAltOpcodes
		: VLOperands::ScoreSameOpcode;
		}

		if (isa<UndefValue>(V2))
		return VLOperands::ScoreUndef;

		return VLOperands::ScoreFail;
		}

		/// Holds the values and their lane that are taking part in the look-ahead
		/// score calculation. This is used in the external uses cost calculation.
		SmallDenseMap<Value *, int> InLookAheadValues;

		/// \Returns the additinal cost due to uses of \p LHS and \p RHS that are
		/// either external to the vectorized code, or require shuffling.
		int getExternalUsesCost(const std::pair<Value *, int> &LHS,
		const std::pair<Value *, int> &RHS) {
		int Cost = 0;
		SmallVector<std::pair<Value *, int>, 2> Values = {LHS, RHS};
		for (int Idx = 0, IdxE = Values.size(); Idx != IdxE; ++Idx) {
		Value *V = Values[Idx].first;
		// Calculate the absolute lane, using the minimum relative lane of LHS
		// and RHS as base and Idx as the offset.
		int Ln = std::min(LHS.second, RHS.second) + Idx;
		assert(Ln >= 0 && "Bad lane calculation");
		unsigned UsersBudget = LookAheadUsersBudget;
		for (User *U : V->users()) {
		if (const TreeEntry *UserTE = R.getTreeEntry(U)) {
		// The user is in the VectorizableTree. Check if we need to insert.
		auto It = llvm::find(UserTE->Scalars, U);
		assert(It != UserTE->Scalars.end() && "U is in UserTE");
		int UserLn = std::distance(UserTE->Scalars.begin(), It);
		assert(UserLn >= 0 && "Bad lane");
		if (UserLn != Ln)
		Cost += UserInDiffLaneCost;
		} else {
		// Check if the user is in the look-ahead code.
		auto It2 = InLookAheadValues.find(U);
		if (It2 != InLookAheadValues.end()) {
		// The user is in the look-ahead code. Check the lane.
		if (It2->second != Ln)
		Cost += UserInDiffLaneCost;
		} else {
		// The user is neither in SLP tree nor in the look-ahead code.
		Cost += ExternalUseCost;
		}
		}
		// Limit the number of visited uses to cap compilation time.
		if (--UsersBudget == 0)
		break;
		}
		}
		return Cost;
		}

		/// Go through the operands of \p LHS and \p RHS recursively until \p
		/// MaxLevel, and return the cummulative score. For example:
		/// \verbatim
		/// A[0] B[0] A[1] B[1] C[0] D[0] B[1] A[1]
		/// \ / \ / \ / \ /
		/// + + + +
		/// G1 G2 G3 G4
		/// \endverbatim
		/// The getScoreAtLevelRec(G1, G2) function will try to match the nodes at
		/// each level recursively, accumulating the score. It starts from matching
		/// the additions at level 0, then moves on to the loads (level 1). The
		/// score of G1 and G2 is higher than G1 and G3, because {A[0],A[1]} and
		/// {B[0],B[1]} match with VLOperands::ScoreConsecutiveLoads, while
		/// {A[0],C[0]} has a score of VLOperands::ScoreFail.
		/// Please note that the order of the operands does not matter, as we
		/// evaluate the score of all profitable combinations of operands. In
		/// other words the score of G1 and G4 is the same as G1 and G2. This
		/// heuristic is based on ideas described in:
		/// Look-ahead SLP: Auto-vectorization in the presence of commutative
		/// operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,
		/// Luís F. W. Góes
		int getScoreAtLevelRec(const std::pair<Value *, int> &LHS,
		const std::pair<Value *, int> &RHS, int CurrLevel,
		int MaxLevel) {

		Value *V1 = LHS.first;
		Value *V2 = RHS.first;
		// Get the shallow score of V1 and V2.
		int ShallowScoreAtThisLevel =
		std::max((int)ScoreFail, getShallowScore(V1, V2, DL, SE) -
		getExternalUsesCost(LHS, RHS));
		int Lane1 = LHS.second;
		int Lane2 = RHS.second;

		// If reached MaxLevel,
		// or if V1 and V2 are not instructions,
		// or if they are SPLAT,
		// or if they are not consecutive, early return the current cost.
		auto *I1 = dyn_cast<Instruction>(V1);
		auto *I2 = dyn_cast<Instruction>(V2);
		if (CurrLevel == MaxLevel \|\| !(I1 && I2) \|\| I1 == I2 \|\|
		ShallowScoreAtThisLevel == VLOperands::ScoreFail \|\|
		(isa<LoadInst>(I1) && isa<LoadInst>(I2) && ShallowScoreAtThisLevel))
		return ShallowScoreAtThisLevel;
		assert(I1 && I2 && "Should have early exited.");

		// Keep track of in-tree values for determining the external-use cost.
		InLookAheadValues[V1] = Lane1;
		InLookAheadValues[V2] = Lane2;

		// Contains the I2 operand indexes that got matched with I1 operands.
		SmallSet<unsigned, 4> Op2Used;

		// Recursion towards the operands of I1 and I2. We are trying all possbile
		// operand pairs, and keeping track of the best score.
		for (unsigned OpIdx1 = 0, NumOperands1 = I1->getNumOperands();
		OpIdx1 != NumOperands1; ++OpIdx1) {
		// Try to pair op1I with the best operand of I2.
		int MaxTmpScore = 0;
		unsigned MaxOpIdx2 = 0;
		bool FoundBest = false;
		// If I2 is commutative try all combinations.
		unsigned FromIdx = isCommutative(I2) ? 0 : OpIdx1;
		unsigned ToIdx = isCommutative(I2)
		? I2->getNumOperands()
		: std::min(I2->getNumOperands(), OpIdx1 + 1);
		assert(FromIdx <= ToIdx && "Bad index");
		for (unsigned OpIdx2 = FromIdx; OpIdx2 != ToIdx; ++OpIdx2) {
		// Skip operands already paired with OpIdx1.
		if (Op2Used.count(OpIdx2))
		continue;
		// Recursively calculate the cost at each level
		int TmpScore = getScoreAtLevelRec({I1->getOperand(OpIdx1), Lane1},
		{I2->getOperand(OpIdx2), Lane2},
		CurrLevel + 1, MaxLevel);
		// Look for the best score.
		if (TmpScore > VLOperands::ScoreFail && TmpScore > MaxTmpScore) {
		MaxTmpScore = TmpScore;
		MaxOpIdx2 = OpIdx2;
		FoundBest = true;
		}
		}
		if (FoundBest) {
		// Pair {OpIdx1, MaxOpIdx2} was found to be best. Never revisit it.
		Op2Used.insert(MaxOpIdx2);
		ShallowScoreAtThisLevel += MaxTmpScore;
		}
		}
		return ShallowScoreAtThisLevel;
		}

		/// \Returns the look-ahead score, which tells us how much the sub-trees
		/// rooted at \p LHS and \p RHS match, the more they match the higher the
		/// score. This helps break ties in an informed way when we cannot decide on
		/// the order of the operands by just considering the immediate
		/// predecessors.
		int getLookAheadScore(const std::pair<Value *, int> &LHS,
		const std::pair<Value *, int> &RHS) {
		InLookAheadValues.clear();
		return getScoreAtLevelRec(LHS, RHS, 1, LookAheadMaxDepth);
		}

// Search all operands in Ops[*][Lane] for the one that matches best		// Search all operands in Ops[*][Lane] for the one that matches best
// Ops[OpIdx][LastLane] and return its opreand index.		// Ops[OpIdx][LastLane] and return its opreand index.
// If no good match can be found, return None.		// If no good match can be found, return None.
Optional<unsigned>		Optional<unsigned>
getBestOperand(unsigned OpIdx, int Lane, int LastLane,		getBestOperand(unsigned OpIdx, int Lane, int LastLane,
ArrayRef<ReorderingMode> ReorderingModes) {		ArrayRef<ReorderingMode> ReorderingModes) {
unsigned NumOperands = getNumOperands();		unsigned NumOperands = getNumOperands();

// The operand of the previous lane at OpIdx.		// The operand of the previous lane at OpIdx.
Value *OpLastLane = getData(OpIdx, LastLane).V;		Value *OpLastLane = getData(OpIdx, LastLane).V;

// Our strategy mode for OpIdx.		// Our strategy mode for OpIdx.
ReorderingMode RMode = ReorderingModes[OpIdx];		ReorderingMode RMode = ReorderingModes[OpIdx];

// The linearized opcode of the operand at OpIdx, Lane.		// The linearized opcode of the operand at OpIdx, Lane.
bool OpIdxAPO = getData(OpIdx, Lane).APO;		bool OpIdxAPO = getData(OpIdx, Lane).APO;

const unsigned BestScore = 2;
const unsigned GoodScore = 1;

// The best operand index and its score.		// The best operand index and its score.
// Sometimes we have more than one option (e.g., Opcode and Undefs), so we		// Sometimes we have more than one option (e.g., Opcode and Undefs), so we
// are using the score to differentiate between the two.		// are using the score to differentiate between the two.
struct BestOpData {		struct BestOpData {
Optional<unsigned> Idx = None;		Optional<unsigned> Idx = None;
unsigned Score = 0;		unsigned Score = 0;
} BestOp;		} BestOp;

Show All 12 Lines	getBestOperand(unsigned OpIdx, int Lane, int LastLane,
// different opcode in the linearized tree form. This would break the		// different opcode in the linearized tree form. This would break the
// semantics.		// semantics.
if (OpAPO != OpIdxAPO)		if (OpAPO != OpIdxAPO)
continue;		continue;

// Look for an operand that matches the current mode.		// Look for an operand that matches the current mode.
switch (RMode) {		switch (RMode) {
case ReorderingMode::Load:		case ReorderingMode::Load:
if (isa<LoadInst>(Op)) {		case ReorderingMode::Constant:
// Figure out which is left and right, so that we can check for		case ReorderingMode::Opcode: {
// consecutive loads
bool LeftToRight = Lane > LastLane;		bool LeftToRight = Lane > LastLane;
Value *OpLeft = (LeftToRight) ? OpLastLane : Op;		Value *OpLeft = (LeftToRight) ? OpLastLane : Op;
Value *OpRight = (LeftToRight) ? Op : OpLastLane;		Value *OpRight = (LeftToRight) ? Op : OpLastLane;
if (isConsecutiveAccess(cast<LoadInst>(OpLeft),		unsigned Score =
cast<LoadInst>(OpRight), DL, SE))		getLookAheadScore({OpLeft, LastLane}, {OpRight, Lane});
BestOp.Idx = Idx;
}
break;
case ReorderingMode::Opcode:
// We accept both Instructions and Undefs, but with different scores.
if ((isa<Instruction>(Op) && isa<Instruction>(OpLastLane) &&
cast<Instruction>(Op)->getOpcode() ==
cast<Instruction>(OpLastLane)->getOpcode()) \|\|
(isa<UndefValue>(OpLastLane) && isa<Instruction>(Op)) \|\|
isa<UndefValue>(Op)) {
// An instruction has a higher score than an undef.
unsigned Score = (isa<UndefValue>(Op)) ? GoodScore : BestScore;
if (Score > BestOp.Score) {		if (Score > BestOp.Score) {
BestOp.Idx = Idx;		BestOp.Idx = Idx;
BestOp.Score = Score;		BestOp.Score = Score;
}		}
}
break;		break;
case ReorderingMode::Constant:
if (isa<Constant>(Op)) {
unsigned Score = (isa<UndefValue>(Op)) ? GoodScore : BestScore;
if (Score > BestOp.Score) {
BestOp.Idx = Idx;
BestOp.Score = Score;
}		}
}
break;
case ReorderingMode::Splat:		case ReorderingMode::Splat:
if (Op == OpLastLane)		if (Op == OpLastLane)
BestOp.Idx = Idx;		BestOp.Idx = Idx;
break;		break;
case ReorderingMode::Failed:		case ReorderingMode::Failed:
return None;		return None;
}		}
}		}
▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	bool shouldBroadcast(Value *Op, unsigned OpIdx, unsigned Lane) {
return false;		return false;
}		}
return true;		return true;
}		}

public:		public:
/// Initialize with all the operands of the instruction vector \p RootVL.		/// Initialize with all the operands of the instruction vector \p RootVL.
VLOperands(ArrayRef<Value *> RootVL, const DataLayout &DL,		VLOperands(ArrayRef<Value *> RootVL, const DataLayout &DL,
ScalarEvolution &SE)		ScalarEvolution &SE, const BoUpSLP &R)
: DL(DL), SE(SE) {		: DL(DL), SE(SE), R(R) {
// Append all the operands of RootVL.		// Append all the operands of RootVL.
appendOperandsOfVL(RootVL);		appendOperandsOfVL(RootVL);
}		}

/// \Returns a value vector with the operands across all lanes for the		/// \Returns a value vector with the operands across all lanes for the
/// opearnd at \p OpIdx.		/// opearnd at \p OpIdx.
ValueList getVL(unsigned OpIdx) const {		ValueList getVL(unsigned OpIdx) const {
ValueList OpVL(OpsVec[OpIdx].size());		ValueList OpVL(OpsVec[OpIdx].size());
▲ Show 20 Lines • Show All 212 Lines • ▼ Show 20 Lines	private:
bool isFullyVectorizableTinyTree() const;		bool isFullyVectorizableTinyTree() const;

/// Reorder commutative or alt operands to get better probability of		/// Reorder commutative or alt operands to get better probability of
/// generating vectorized code.		/// generating vectorized code.
static void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,		static void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
SmallVectorImpl<Value *> &Left,		SmallVectorImpl<Value *> &Left,
SmallVectorImpl<Value *> &Right,		SmallVectorImpl<Value *> &Right,
const DataLayout &DL,		const DataLayout &DL,
ScalarEvolution &SE);		ScalarEvolution &SE,
		const BoUpSLP &R);
struct TreeEntry {		struct TreeEntry {
using VecTreeTy = SmallVector<std::unique_ptr<TreeEntry>, 8>;		using VecTreeTy = SmallVector<std::unique_ptr<TreeEntry>, 8>;
TreeEntry(VecTreeTy &Container) : Container(Container) {}		TreeEntry(VecTreeTy &Container) : Container(Container) {}

/// \returns true if the scalars in VL are equal to this entry.		/// \returns true if the scalars in VL are equal to this entry.
bool isSame(ArrayRef<Value *> VL) const {		bool isSame(ArrayRef<Value *> VL) const {
if (VL.size() == Scalars.size())		if (VL.size() == Scalars.size())
return std::equal(VL.begin(), VL.end(), Scalars.begin());		return std::equal(VL.begin(), VL.end(), Scalars.begin());
▲ Show 20 Lines • Show All 1,344 Lines • ▼ Show 20 Lines	case Instruction::FCmp: {
ReuseShuffleIndicies);		ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: added a vector of compares.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of compares.\n");

ValueList Left, Right;		ValueList Left, Right;
if (cast<CmpInst>(VL0)->isCommutative()) {		if (cast<CmpInst>(VL0)->isCommutative()) {
// Commutative predicate - collect + sort operands of the instructions		// Commutative predicate - collect + sort operands of the instructions
// so that each side is more likely to have the same opcode.		// so that each side is more likely to have the same opcode.
assert(P0 == SwapP0 && "Commutative Predicate mismatch");		assert(P0 == SwapP0 && "Commutative Predicate mismatch");
reorderInputsAccordingToOpcode(VL, Left, Right, DL, SE);		reorderInputsAccordingToOpcode(VL, Left, Right, DL, SE, *this);
} else {		} else {
// Collect operands - commute if it uses the swapped predicate.		// Collect operands - commute if it uses the swapped predicate.
for (Value *V : VL) {		for (Value *V : VL) {
auto *Cmp = cast<CmpInst>(V);		auto *Cmp = cast<CmpInst>(V);
Value *LHS = Cmp->getOperand(0);		Value *LHS = Cmp->getOperand(0);
Value *RHS = Cmp->getOperand(1);		Value *RHS = Cmp->getOperand(1);
if (Cmp->getPredicate() != P0)		if (Cmp->getPredicate() != P0)
std::swap(LHS, RHS);		std::swap(LHS, RHS);
Show All 30 Lines	case Instruction::Xor: {
TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S, UserTreeIdx,		TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: added a vector of un/bin op.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of un/bin op.\n");

// Sort operands of the instructions so that each side is more likely to		// Sort operands of the instructions so that each side is more likely to
// have the same opcode.		// have the same opcode.
if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {		if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {
ValueList Left, Right;		ValueList Left, Right;
reorderInputsAccordingToOpcode(VL, Left, Right, DL, SE);		reorderInputsAccordingToOpcode(VL, Left, Right, DL, SE, *this);
TE->setOperand(0, Left);		TE->setOperand(0, Left);
TE->setOperand(1, Right);		TE->setOperand(1, Right);
buildTree_rec(Left, Depth + 1, {TE, 0});		buildTree_rec(Left, Depth + 1, {TE, 0});
buildTree_rec(Right, Depth + 1, {TE, 1});		buildTree_rec(Right, Depth + 1, {TE, 1});
return;		return;
}		}

TE->setOperandsInOrder();		TE->setOperandsInOrder();
▲ Show 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	case Instruction::ShuffleVector: {
}		}
TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S, UserTreeIdx,		TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");		LLVM_DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");

// Reorder operands if reordering would enable vectorization.		// Reorder operands if reordering would enable vectorization.
if (isa<BinaryOperator>(VL0)) {		if (isa<BinaryOperator>(VL0)) {
ValueList Left, Right;		ValueList Left, Right;
reorderInputsAccordingToOpcode(VL, Left, Right, DL, SE);		reorderInputsAccordingToOpcode(VL, Left, Right, DL, SE, *this);
TE->setOperand(0, Left);		TE->setOperand(0, Left);
TE->setOperand(1, Right);		TE->setOperand(1, Right);
buildTree_rec(Left, Depth + 1, {TE, 0});		buildTree_rec(Left, Depth + 1, {TE, 0});
buildTree_rec(Right, Depth + 1, {TE, 1});		buildTree_rec(Right, Depth + 1, {TE, 1});
return;		return;
}		}

TE->setOperandsInOrder();		TE->setOperandsInOrder();
▲ Show 20 Lines • Show All 744 Lines • ▼ Show 20 Lines	for (unsigned I = VL.size(); I > 0; --I) {
if (!UniqueElements.insert(VL[Idx]).second)		if (!UniqueElements.insert(VL[Idx]).second)
ShuffledElements.insert(Idx);		ShuffledElements.insert(Idx);
}		}
return getGatherCost(VecTy, ShuffledElements);		return getGatherCost(VecTy, ShuffledElements);
}		}

// Perform operand reordering on the instructions in VL and return the reordered		// Perform operand reordering on the instructions in VL and return the reordered
// operands in Left and Right.		// operands in Left and Right.
void BoUpSLP::reorderInputsAccordingToOpcode(		void BoUpSLP::reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
ArrayRef<Value > VL, SmallVectorImpl<Value > &Left,		SmallVectorImpl<Value *> &Left,
SmallVectorImpl<Value *> &Right, const DataLayout &DL,		SmallVectorImpl<Value *> &Right,
ScalarEvolution &SE) {		const DataLayout &DL,
		ScalarEvolution &SE,
		const BoUpSLP &R) {
if (VL.empty())		if (VL.empty())
return;		return;
VLOperands Ops(VL, DL, SE);		VLOperands Ops(VL, DL, SE, R);
// Reorder the operands in place.		// Reorder the operands in place.
Ops.reorder();		Ops.reorder();
Left = Ops.getVL(0);		Left = Ops.getVL(0);
Right = Ops.getVL(1);		Right = Ops.getVL(1);
}		}

void BoUpSLP::setInsertPointAfterBundle(TreeEntry *E) {		void BoUpSLP::setInsertPointAfterBundle(TreeEntry *E) {
// Get the basic block this bundle is in. All instructions in the bundle		// Get the basic block this bundle is in. All instructions in the bundle
▲ Show 20 Lines • Show All 3,628 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s		; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"		target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"		target triple = "aarch64--linux-gnu"

define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {		define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {
; CHECK-LABEL: @build_vec_v2i64(		; CHECK-LABEL: @build_vec_v2i64(
; CHECK-NEXT: [[V0_0:%.]] = extractelement <2 x i64> [[V0:%.]], i32 0		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i64> [[V0:%.]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i64> [[V0]], i32 1		; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i64> [[V1:%.]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[V1_0:%.]] = extractelement <2 x i64> [[V1:%.]], i32 0		; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i64> [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i64> [[V1]], i32 1		; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i64> [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP0_0:%.*]] = add i64 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> [[TMP4]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP0_1:%.*]] = add i64 [[V0_1]], [[V1_1]]		; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP1_0:%.*]] = sub i64 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP1_1:%.*]] = sub i64 [[V0_1]], [[V1_1]]		; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i64> [[TMP6]], <2 x i64> [[TMP7]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP2_0:%.*]] = add i64 [[TMP0_0]], [[TMP0_1]]		; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i64> [[TMP8]], [[TMP5]]
; CHECK-NEXT: [[TMP2_1:%.*]] = add i64 [[TMP1_0]], [[TMP1_1]]		; CHECK-NEXT: ret <2 x i64> [[TMP9]]
; CHECK-NEXT: [[TMP3_0:%.*]] = insertelement <2 x i64> undef, i64 [[TMP2_0]], i32 0
; CHECK-NEXT: [[TMP3_1:%.*]] = insertelement <2 x i64> [[TMP3_0]], i64 [[TMP2_1]], i32 1
; CHECK-NEXT: ret <2 x i64> [[TMP3_1]]
;		;
%v0.0 = extractelement <2 x i64> %v0, i32 0		%v0.0 = extractelement <2 x i64> %v0, i32 0
%v0.1 = extractelement <2 x i64> %v0, i32 1		%v0.1 = extractelement <2 x i64> %v0, i32 1
%v1.0 = extractelement <2 x i64> %v1, i32 0		%v1.0 = extractelement <2 x i64> %v1, i32 0
%v1.1 = extractelement <2 x i64> %v1, i32 1		%v1.1 = extractelement <2 x i64> %v1, i32 1
%tmp0.0 = add i64 %v0.0, %v1.0		%tmp0.0 = add i64 %v0.0, %v1.0
%tmp0.1 = add i64 %v0.1, %v1.1		%tmp0.1 = add i64 %v0.1, %v1.1
%tmp1.0 = sub i64 %v0.0, %v1.0		%tmp1.0 = sub i64 %v0.0, %v1.0
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	;
%tmp2.1 = add i64 %tmp1.0, %tmp1.1		%tmp2.1 = add i64 %tmp1.0, %tmp1.1
store i64 %tmp2.0, i64* %c.0, align 8		store i64 %tmp2.0, i64* %c.0, align 8
store i64 %tmp2.1, i64* %c.1, align 8		store i64 %tmp2.1, i64* %c.1, align 8
ret void		ret void
}		}

define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32(		; CHECK-LABEL: @build_vec_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <2 x i32> <i32 0, i32 2>		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>		; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <2 x i32> <i32 0, i32 2>		; CHECK-NEXT: [[TMP3:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <2 x i32> [[TMP2]], <2 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>		; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP3:%.*]] = add <4 x i32> [[SHUFFLE]], [[SHUFFLE1]]
; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[SHUFFLE]], [[SHUFFLE1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[V0]], <4 x i32> undef, <2 x i32> <i32 1, i32 3>		; CHECK-NEXT: [[TMP6:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[SHUFFLE2:%.*]] = shufflevector <2 x i32> [[TMP6]], <2 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>		; CHECK-NEXT: [[TMP7:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <4 x i32> [[V1]], <4 x i32> undef, <2 x i32> <i32 1, i32 3>		; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
; CHECK-NEXT: [[SHUFFLE3:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>		; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: [[TMP8:%.*]] = add <4 x i32> [[SHUFFLE2]], [[SHUFFLE3]]		; CHECK-NEXT: ret <4 x i32> [[TMP9]]
; CHECK-NEXT: [[TMP9:%.*]] = sub <4 x i32> [[SHUFFLE2]], [[SHUFFLE3]]
; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
; CHECK-NEXT: [[TMP11:%.*]] = add <4 x i32> [[TMP5]], [[TMP10]]
; CHECK-NEXT: ret <4 x i32> [[TMP11]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
Show All 14 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_reuse_0(		; CHECK-LABEL: @build_vec_v4i32_reuse_0(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i32> [[V0:%.]], <2 x i32> undef, <2 x i32> zeroinitializer		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i32> [[V0:%.]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i32> [[V1:%.]], <2 x i32> undef, <2 x i32> zeroinitializer		; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i32> [[V1:%.]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i32> [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i32> [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[V0]], <2 x i32> undef, <2 x i32> <i32 1, i32 1>		; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <2 x i32> [[V1]], <2 x i32> undef, <2 x i32> <i32 1, i32 1>		; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP8:%.*]] = add <2 x i32> [[TMP6]], [[TMP7]]		; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP6]], <2 x i32> [[TMP7]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP9:%.*]] = sub <2 x i32> [[TMP6]], [[TMP7]]		; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <2 x i32> [[TMP8]], <2 x i32> [[TMP9]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP3_3:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: [[TMP11:%.*]] = add <2 x i32> [[TMP5]], [[TMP10]]
; CHECK-NEXT: [[TMP3_3:%.*]] = shufflevector <2 x i32> [[TMP11]], <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: ret <4 x i32> [[TMP3_3]]		; CHECK-NEXT: ret <4 x i32> [[TMP3_3]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @reduction_v4i32(		; CHECK-LABEL: @reduction_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <2 x i32> <i32 0, i32 2>		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>		; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <2 x i32> <i32 0, i32 2>		; CHECK-NEXT: [[TMP3:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <2 x i32> [[TMP2]], <2 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>		; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP3:%.*]] = sub <4 x i32> [[SHUFFLE]], [[SHUFFLE1]]
; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[SHUFFLE]], [[SHUFFLE1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[V0]], <4 x i32> undef, <2 x i32> <i32 1, i32 3>		; CHECK-NEXT: [[TMP6:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[SHUFFLE2:%.*]] = shufflevector <2 x i32> [[TMP6]], <2 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>		; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <4 x i32> [[V1]], <4 x i32> undef, <2 x i32> <i32 1, i32 3>		; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
; CHECK-NEXT: [[SHUFFLE3:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>		; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: [[TMP8:%.*]] = sub <4 x i32> [[SHUFFLE2]], [[SHUFFLE3]]		; CHECK-NEXT: [[TMP10:%.*]] = lshr <4 x i32> [[TMP9]], <i32 15, i32 15, i32 15, i32 15>
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[SHUFFLE2]], [[SHUFFLE3]]		; CHECK-NEXT: [[TMP11:%.*]] = and <4 x i32> [[TMP10]], <i32 65537, i32 65537, i32 65537, i32 65537>
; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP12:%.*]] = mul nuw <4 x i32> [[TMP11]], <i32 65535, i32 65535, i32 65535, i32 65535>
; CHECK-NEXT: [[TMP11:%.*]] = add <4 x i32> [[TMP5]], [[TMP10]]		; CHECK-NEXT: [[TMP13:%.*]] = add <4 x i32> [[TMP12]], [[TMP9]]
; CHECK-NEXT: [[TMP12:%.*]] = lshr <4 x i32> [[TMP11]], <i32 15, i32 15, i32 15, i32 15>		; CHECK-NEXT: [[TMP14:%.*]] = xor <4 x i32> [[TMP13]], [[TMP12]]
; CHECK-NEXT: [[TMP13:%.*]] = and <4 x i32> [[TMP12]], <i32 65537, i32 65537, i32 65537, i32 65537>		; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP14]])
; CHECK-NEXT: [[TMP14:%.*]] = mul nuw <4 x i32> [[TMP13]], <i32 65535, i32 65535, i32 65535, i32 65535>		; CHECK-NEXT: ret i32 [[TMP15]]
; CHECK-NEXT: [[TMP15:%.*]] = add <4 x i32> [[TMP14]], [[TMP11]]
; CHECK-NEXT: [[TMP16:%.*]] = xor <4 x i32> [[TMP15]], [[TMP14]]
; CHECK-NEXT: [[TMP17:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP16]])
; CHECK-NEXT: ret i32 [[TMP17]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
Show All 38 Lines

llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll

Show All 21 Lines
; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0		; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0
; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1		; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1
; CHECK-NEXT: [[IDX2:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 2		; CHECK-NEXT: [[IDX2:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 2
; CHECK-NEXT: [[IDX3:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 3		; CHECK-NEXT: [[IDX3:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 3
; CHECK-NEXT: [[IDX4:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 4		; CHECK-NEXT: [[IDX4:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 4
; CHECK-NEXT: [[IDX5:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 5		; CHECK-NEXT: [[IDX5:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 5
; CHECK-NEXT: [[IDX6:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 6		; CHECK-NEXT: [[IDX6:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 6
; CHECK-NEXT: [[IDX7:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 7		; CHECK-NEXT: [[IDX7:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 7
; CHECK-NEXT: [[A_0:%.]] = load double, double [[IDX0]], align 8		; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[IDX0]] to <2 x double>*
; CHECK-NEXT: [[A_1:%.]] = load double, double [[IDX1]], align 8		; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
; CHECK-NEXT: [[B_0:%.]] = load double, double [[IDX2]], align 8		; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[IDX2]] to <2 x double>*
; CHECK-NEXT: [[B_1:%.]] = load double, double [[IDX3]], align 8		; CHECK-NEXT: [[TMP3:%.]] = load <2 x double>, <2 x double> [[TMP2]], align 8
; CHECK-NEXT: [[C_0:%.]] = load double, double [[IDX4]], align 8		; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[IDX4]] to <2 x double>*
; CHECK-NEXT: [[C_1:%.]] = load double, double [[IDX5]], align 8		; CHECK-NEXT: [[TMP5:%.]] = load <2 x double>, <2 x double> [[TMP4]], align 8
; CHECK-NEXT: [[D_0:%.]] = load double, double [[IDX6]], align 8		; CHECK-NEXT: [[TMP6:%.]] = bitcast double [[IDX6]] to <2 x double>*
; CHECK-NEXT: [[D_1:%.]] = load double, double [[IDX7]], align 8		; CHECK-NEXT: [[TMP7:%.]] = load <2 x double>, <2 x double> [[TMP6]], align 8
; CHECK-NEXT: [[SUBAB_0:%.*]] = fsub fast double [[A_0]], [[B_0]]		; CHECK-NEXT: [[TMP8:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]]
; CHECK-NEXT: [[SUBCD_0:%.*]] = fsub fast double [[C_0]], [[D_0]]		; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP5]], [[TMP7]]
; CHECK-NEXT: [[SUBAB_1:%.*]] = fsub fast double [[A_1]], [[B_1]]		; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP8]], [[TMP9]]
; CHECK-NEXT: [[SUBCD_1:%.*]] = fsub fast double [[C_1]], [[D_1]]		; CHECK-NEXT: [[TMP11:%.]] = bitcast double [[IDX0]] to <2 x double>*
; CHECK-NEXT: [[ADDABCD_0:%.*]] = fadd fast double [[SUBAB_0]], [[SUBCD_0]]		; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8
; CHECK-NEXT: [[ADDCDAB_1:%.*]] = fadd fast double [[SUBCD_1]], [[SUBAB_1]]
; CHECK-NEXT: store double [[ADDABCD_0]], double* [[IDX0]], align 8
; CHECK-NEXT: store double [[ADDCDAB_1]], double* [[IDX1]], align 8
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%idx0 = getelementptr inbounds double, double* %array, i64 0		%idx0 = getelementptr inbounds double, double* %array, i64 0
%idx1 = getelementptr inbounds double, double* %array, i64 1		%idx1 = getelementptr inbounds double, double* %array, i64 1
%idx2 = getelementptr inbounds double, double* %array, i64 2		%idx2 = getelementptr inbounds double, double* %array, i64 2
%idx3 = getelementptr inbounds double, double* %array, i64 3		%idx3 = getelementptr inbounds double, double* %array, i64 3
%idx4 = getelementptr inbounds double, double* %array, i64 4		%idx4 = getelementptr inbounds double, double* %array, i64 4
▲ Show 20 Lines • Show All 105 Lines • ▼ Show 20 Lines
; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0		; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0
; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1		; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1
; CHECK-NEXT: [[IDX2:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 2		; CHECK-NEXT: [[IDX2:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 2
; CHECK-NEXT: [[IDX3:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 3		; CHECK-NEXT: [[IDX3:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 3
; CHECK-NEXT: [[IDX4:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 4		; CHECK-NEXT: [[IDX4:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 4
; CHECK-NEXT: [[IDX5:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 5		; CHECK-NEXT: [[IDX5:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 5
; CHECK-NEXT: [[IDX6:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 6		; CHECK-NEXT: [[IDX6:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 6
; CHECK-NEXT: [[IDX7:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 7		; CHECK-NEXT: [[IDX7:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 7
; CHECK-NEXT: [[A_0:%.]] = load double, double [[IDX0]], align 8		; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[IDX0]] to <2 x double>*
; CHECK-NEXT: [[A_1:%.]] = load double, double [[IDX1]], align 8		; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
; CHECK-NEXT: [[B_0:%.]] = load double, double [[IDX2]], align 8		; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[IDX2]] to <2 x double>*
; CHECK-NEXT: [[B_1:%.]] = load double, double [[IDX3]], align 8		; CHECK-NEXT: [[TMP3:%.]] = load <2 x double>, <2 x double> [[TMP2]], align 8
; CHECK-NEXT: [[C_0:%.]] = load double, double [[IDX4]], align 8		; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[IDX4]] to <2 x double>*
; CHECK-NEXT: [[C_1:%.]] = load double, double [[IDX5]], align 8		; CHECK-NEXT: [[TMP5:%.]] = load <2 x double>, <2 x double> [[TMP4]], align 8
; CHECK-NEXT: [[D_0:%.]] = load double, double [[IDX6]], align 8		; CHECK-NEXT: [[TMP6:%.]] = bitcast double [[IDX6]] to <2 x double>*
; CHECK-NEXT: [[D_1:%.]] = load double, double [[IDX7]], align 8		; CHECK-NEXT: [[TMP7:%.]] = load <2 x double>, <2 x double> [[TMP6]], align 8
; CHECK-NEXT: [[ADDAB_0:%.*]] = fadd fast double [[A_0]], [[B_0]]		; CHECK-NEXT: [[TMP8:%.*]] = fsub fast <2 x double> [[TMP5]], [[TMP7]]
; CHECK-NEXT: [[SUBCD_0:%.*]] = fsub fast double [[C_0]], [[D_0]]		; CHECK-NEXT: [[TMP9:%.*]] = fadd fast <2 x double> [[TMP5]], [[TMP7]]
; CHECK-NEXT: [[ADDCD_1:%.*]] = fadd fast double [[C_1]], [[D_1]]		; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <2 x double> [[TMP8]], <2 x double> [[TMP9]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[SUBAB_1:%.*]] = fsub fast double [[A_1]], [[B_1]]		; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP1]], [[TMP3]]
; CHECK-NEXT: [[ADDABCD_0:%.*]] = fadd fast double [[ADDAB_0]], [[SUBCD_0]]		; CHECK-NEXT: [[TMP12:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]]
; CHECK-NEXT: [[ADDCDAB_1:%.*]] = fadd fast double [[ADDCD_1]], [[SUBAB_1]]		; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <2 x double> [[TMP11]], <2 x double> [[TMP12]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: store double [[ADDABCD_0]], double* [[IDX0]], align 8		; CHECK-NEXT: [[TMP14:%.*]] = fadd fast <2 x double> [[TMP13]], [[TMP10]]
; CHECK-NEXT: store double [[ADDCDAB_1]], double* [[IDX1]], align 8		; CHECK-NEXT: [[TMP15:%.]] = bitcast double [[IDX0]] to <2 x double>*
		; CHECK-NEXT: store <2 x double> [[TMP14]], <2 x double>* [[TMP15]], align 8
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%idx0 = getelementptr inbounds double, double* %array, i64 0		%idx0 = getelementptr inbounds double, double* %array, i64 0
%idx1 = getelementptr inbounds double, double* %array, i64 1		%idx1 = getelementptr inbounds double, double* %array, i64 1
%idx2 = getelementptr inbounds double, double* %array, i64 2		%idx2 = getelementptr inbounds double, double* %array, i64 2
%idx3 = getelementptr inbounds double, double* %array, i64 3		%idx3 = getelementptr inbounds double, double* %array, i64 3
%idx4 = getelementptr inbounds double, double* %array, i64 4		%idx4 = getelementptr inbounds double, double* %array, i64 4
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
; CHECK-NEXT: [[IDXA0:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 0		; CHECK-NEXT: [[IDXA0:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 0
; CHECK-NEXT: [[IDXB0:%.]] = getelementptr inbounds double, double [[B:%.*]], i64 0		; CHECK-NEXT: [[IDXB0:%.]] = getelementptr inbounds double, double [[B:%.*]], i64 0
; CHECK-NEXT: [[IDXC0:%.]] = getelementptr inbounds double, double [[C:%.*]], i64 0		; CHECK-NEXT: [[IDXC0:%.]] = getelementptr inbounds double, double [[C:%.*]], i64 0
; CHECK-NEXT: [[IDXD0:%.]] = getelementptr inbounds double, double [[D:%.*]], i64 0		; CHECK-NEXT: [[IDXD0:%.]] = getelementptr inbounds double, double [[D:%.*]], i64 0
; CHECK-NEXT: [[IDXA1:%.]] = getelementptr inbounds double, double [[A]], i64 1		; CHECK-NEXT: [[IDXA1:%.]] = getelementptr inbounds double, double [[A]], i64 1
; CHECK-NEXT: [[IDXB2:%.]] = getelementptr inbounds double, double [[B]], i64 2		; CHECK-NEXT: [[IDXB2:%.]] = getelementptr inbounds double, double [[B]], i64 2
; CHECK-NEXT: [[IDXA2:%.]] = getelementptr inbounds double, double [[A]], i64 2		; CHECK-NEXT: [[IDXA2:%.]] = getelementptr inbounds double, double [[A]], i64 2
; CHECK-NEXT: [[IDXB1:%.]] = getelementptr inbounds double, double [[B]], i64 1		; CHECK-NEXT: [[IDXB1:%.]] = getelementptr inbounds double, double [[B]], i64 1
		; CHECK-NEXT: [[A0:%.]] = load double, double [[IDXA0]], align 8
		; CHECK-NEXT: [[C0:%.]] = load double, double [[IDXC0]], align 8
		; CHECK-NEXT: [[D0:%.]] = load double, double [[IDXD0]], align 8
		; CHECK-NEXT: [[A1:%.]] = load double, double [[IDXA1]], align 8
		; CHECK-NEXT: [[B2:%.]] = load double, double [[IDXB2]], align 8
		; CHECK-NEXT: [[A2:%.]] = load double, double [[IDXA2]], align 8
		; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[IDXB0]] to <2 x double>*
		; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
		; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[C0]], i32 0
		; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[A1]], i32 1
		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> undef, double [[D0]], i32 0
		; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[B2]], i32 1
		; CHECK-NEXT: [[TMP6:%.*]] = fsub fast <2 x double> [[TMP3]], [[TMP5]]
		; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> undef, double [[A0]], i32 0
		; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[A2]], i32 1
		; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP8]], [[TMP1]]
		; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP9]], [[TMP6]]
		; CHECK-NEXT: [[IDXS0:%.]] = getelementptr inbounds double, double [[S:%.*]], i64 0
		; CHECK-NEXT: [[IDXS1:%.]] = getelementptr inbounds double, double [[S]], i64 1
		; CHECK-NEXT: [[TMP11:%.]] = bitcast double [[IDXS0]] to <2 x double>*
		; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8
		; CHECK-NEXT: store double [[A1]], double* [[EXT1:%.*]], align 8
		; CHECK-NEXT: ret void
		;
		entry:
		%IdxA0 = getelementptr inbounds double, double* %A, i64 0
		%IdxB0 = getelementptr inbounds double, double* %B, i64 0
		%IdxC0 = getelementptr inbounds double, double* %C, i64 0
		%IdxD0 = getelementptr inbounds double, double* %D, i64 0

		%IdxA1 = getelementptr inbounds double, double* %A, i64 1
		%IdxB2 = getelementptr inbounds double, double* %B, i64 2
		%IdxA2 = getelementptr inbounds double, double* %A, i64 2
		%IdxB1 = getelementptr inbounds double, double* %B, i64 1

		%A0 = load double, double *%IdxA0, align 8
		%B0 = load double, double *%IdxB0, align 8
		%C0 = load double, double *%IdxC0, align 8
		%D0 = load double, double *%IdxD0, align 8

		%A1 = load double, double *%IdxA1, align 8
		%B2 = load double, double *%IdxB2, align 8
		%A2 = load double, double *%IdxA2, align 8
		%B1 = load double, double *%IdxB1, align 8

		%subA0B0 = fsub fast double %A0, %B0
		%subC0D0 = fsub fast double %C0, %D0

		%subA1B2 = fsub fast double %A1, %B2
		%subA2B1 = fsub fast double %A2, %B1

		%add0 = fadd fast double %subA0B0, %subC0D0
		%add1 = fadd fast double %subA1B2, %subA2B1

		%IdxS0 = getelementptr inbounds double, double* %S, i64 0
		%IdxS1 = getelementptr inbounds double, double* %S, i64 1

		store double %add0, double *%IdxS0, align 8
		store double %add1, double *%IdxS1, align 8

		; External use
		store double %A1, double *%Ext1, align 8
		ret void
		}

		; A[0] B[0] C[0] D[0] A[1] B[2] A[2] B[1]
		; \ / \ / / \ / \ / \
		; - - U1,U2,U3 - - U4,U5
		; \ / \ /
		; + +
		; \| \|
		; S[0] S[1]
		;
		;
		; If we limit the users budget for the look-ahead heuristic to 2, then the
		; look-ahead heuristic has no way of choosing B[1] (with 2 external users)
		; over A[1] (with 3 external users).
		; The result is that the operands are of the Add not reordered and the loads
		; from A get vectorized instead of the loads from B.
		;
		define void @lookahead_limit_users_budget(double* %A, double %B, double %C, double %D, double %S, double %Ext1, double %Ext2, double %Ext3, double %Ext4, double *%Ext5) {
		; CHECK-LABEL: @lookahead_limit_users_budget(
		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[IDXA0:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 0
		; CHECK-NEXT: [[IDXB0:%.]] = getelementptr inbounds double, double [[B:%.*]], i64 0
		; CHECK-NEXT: [[IDXC0:%.]] = getelementptr inbounds double, double [[C:%.*]], i64 0
		; CHECK-NEXT: [[IDXD0:%.]] = getelementptr inbounds double, double [[D:%.*]], i64 0
		; CHECK-NEXT: [[IDXA1:%.]] = getelementptr inbounds double, double [[A]], i64 1
		; CHECK-NEXT: [[IDXB2:%.]] = getelementptr inbounds double, double [[B]], i64 2
		; CHECK-NEXT: [[IDXA2:%.]] = getelementptr inbounds double, double [[A]], i64 2
		; CHECK-NEXT: [[IDXB1:%.]] = getelementptr inbounds double, double [[B]], i64 1
; CHECK-NEXT: [[B0:%.]] = load double, double [[IDXB0]], align 8		; CHECK-NEXT: [[B0:%.]] = load double, double [[IDXB0]], align 8
; CHECK-NEXT: [[C0:%.]] = load double, double [[IDXC0]], align 8		; CHECK-NEXT: [[C0:%.]] = load double, double [[IDXC0]], align 8
; CHECK-NEXT: [[D0:%.]] = load double, double [[IDXD0]], align 8		; CHECK-NEXT: [[D0:%.]] = load double, double [[IDXD0]], align 8
; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[IDXA0]] to <2 x double>*		; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[IDXA0]] to <2 x double>*
; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8		; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
; CHECK-NEXT: [[B2:%.]] = load double, double [[IDXB2]], align 8		; CHECK-NEXT: [[B2:%.]] = load double, double [[IDXB2]], align 8
; CHECK-NEXT: [[A2:%.]] = load double, double [[IDXA2]], align 8		; CHECK-NEXT: [[A2:%.]] = load double, double [[IDXA2]], align 8
; CHECK-NEXT: [[B1:%.]] = load double, double [[IDXB1]], align 8		; CHECK-NEXT: [[B1:%.]] = load double, double [[IDXB1]], align 8
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[B0]], i32 0		; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[B0]], i32 0
; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[B2]], i32 1		; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[B2]], i32 1
; CHECK-NEXT: [[TMP4:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]]		; CHECK-NEXT: [[TMP4:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]]
; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> undef, double [[C0]], i32 0		; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> undef, double [[C0]], i32 0
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[A2]], i32 1		; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[A2]], i32 1
; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> undef, double [[D0]], i32 0		; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> undef, double [[D0]], i32 0
; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[B1]], i32 1		; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[B1]], i32 1
; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP6]], [[TMP8]]		; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP6]], [[TMP8]]
; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP4]], [[TMP9]]		; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP4]], [[TMP9]]
; CHECK-NEXT: [[IDXS0:%.]] = getelementptr inbounds double, double [[S:%.*]], i64 0		; CHECK-NEXT: [[IDXS0:%.]] = getelementptr inbounds double, double [[S:%.*]], i64 0
; CHECK-NEXT: [[IDXS1:%.]] = getelementptr inbounds double, double [[S]], i64 1		; CHECK-NEXT: [[IDXS1:%.]] = getelementptr inbounds double, double [[S]], i64 1
; CHECK-NEXT: [[TMP11:%.]] = bitcast double [[IDXS0]] to <2 x double>*		; CHECK-NEXT: [[TMP11:%.]] = bitcast double [[IDXS0]] to <2 x double>*
; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8		; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8
; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[TMP1]], i32 1		; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[TMP1]], i32 1
; CHECK-NEXT: store double [[TMP12]], double* [[EXT1:%.*]], align 8		; CHECK-NEXT: store double [[TMP12]], double* [[EXT1:%.*]], align 8
		; CHECK-NEXT: store double [[TMP12]], double* [[EXT2:%.*]], align 8
		; CHECK-NEXT: store double [[TMP12]], double* [[EXT3:%.*]], align 8
		; CHECK-NEXT: store double [[B1]], double* [[EXT4:%.*]], align 8
		; CHECK-NEXT: store double [[B1]], double* [[EXT5:%.*]], align 8
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%IdxA0 = getelementptr inbounds double, double* %A, i64 0		%IdxA0 = getelementptr inbounds double, double* %A, i64 0
%IdxB0 = getelementptr inbounds double, double* %B, i64 0		%IdxB0 = getelementptr inbounds double, double* %B, i64 0
%IdxC0 = getelementptr inbounds double, double* %C, i64 0		%IdxC0 = getelementptr inbounds double, double* %C, i64 0
%IdxD0 = getelementptr inbounds double, double* %D, i64 0		%IdxD0 = getelementptr inbounds double, double* %D, i64 0

Show All 22 Lines	entry:
%add1 = fadd fast double %subA1B2, %subA2B1		%add1 = fadd fast double %subA1B2, %subA2B1

%IdxS0 = getelementptr inbounds double, double* %S, i64 0		%IdxS0 = getelementptr inbounds double, double* %S, i64 0
%IdxS1 = getelementptr inbounds double, double* %S, i64 1		%IdxS1 = getelementptr inbounds double, double* %S, i64 1

store double %add0, double *%IdxS0, align 8		store double %add0, double *%IdxS0, align 8
store double %add1, double *%IdxS1, align 8		store double %add1, double *%IdxS1, align 8

; External use		; External uses of A1
store double %A1, double *%Ext1, align 8		store double %A1, double *%Ext1, align 8
		store double %A1, double *%Ext2, align 8
		store double %A1, double *%Ext3, align 8

		; External uses of B1
		store double %B1, double *%Ext4, align 8
		store double %B1, double *%Ext5, align 8

		ret void
		}

		; This checks that the lookahead code does not crash when instructions with the same opcodes have different numbers of operands (in this case the calls).

		%Class = type { i8 }
		declare double @_ZN1i2ayEv(%Class*)
		declare double @_ZN1i2axEv()

		define void @lookahead_crash(double* %A, double %S, %Class %Arg0) {
		; CHECK-LABEL: @lookahead_crash(
		; CHECK-NEXT: [[IDXA0:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 0
		; CHECK-NEXT: [[IDXA1:%.]] = getelementptr inbounds double, double [[A]], i64 1
		; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[IDXA0]] to <2 x double>*
		; CHECK-NEXT: [[TMP2:%.]] = load <2 x double>, <2 x double> [[TMP1]], align 8
		; CHECK-NEXT: [[C0:%.]] = call double @_ZN1i2ayEv(%Class [[ARG0:%.*]])
		; CHECK-NEXT: [[C1:%.*]] = call double @_ZN1i2axEv()
		; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> undef, double [[C0]], i32 0
		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP3]], double [[C1]], i32 1
		; CHECK-NEXT: [[TMP5:%.*]] = fadd fast <2 x double> [[TMP2]], [[TMP4]]
		; CHECK-NEXT: [[IDXS0:%.]] = getelementptr inbounds double, double [[S:%.*]], i64 0
		; CHECK-NEXT: [[IDXS1:%.]] = getelementptr inbounds double, double [[S]], i64 1
		; CHECK-NEXT: [[TMP6:%.]] = bitcast double [[IDXS0]] to <2 x double>*
		; CHECK-NEXT: store <2 x double> [[TMP5]], <2 x double>* [[TMP6]], align 8
		; CHECK-NEXT: ret void
		;
		%IdxA0 = getelementptr inbounds double, double* %A, i64 0
		%IdxA1 = getelementptr inbounds double, double* %A, i64 1

		%A0 = load double, double *%IdxA0, align 8
		%A1 = load double, double *%IdxA1, align 8

		%C0 = call double @_ZN1i2ayEv(%Class *%Arg0)
		%C1 = call double @_ZN1i2axEv()

		%add0 = fadd fast double %A0, %C0
		%add1 = fadd fast double %A1, %C1

		%IdxS0 = getelementptr inbounds double, double* %S, i64 0
		%IdxS1 = getelementptr inbounds double, double* %S, i64 1
		store double %add0, double *%IdxS0, align 8
		store double %add1, double *%IdxS1, align 8
		ret void
		}

		; This checks that we choose to group consecutive extracts from the same vectors.
		define void @ChecksExtractScores(double* %storeArray, double* %array, <2 x double> %vecPtr1, <2 x double> %vecPtr2) {
		; CHECK-LABEL: @ChecksExtractScores(
		; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0
		; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1
		; CHECK-NEXT: [[LOADA0:%.]] = load double, double [[IDX0]], align 4
		; CHECK-NEXT: [[LOADA1:%.]] = load double, double [[IDX1]], align 4
		; CHECK-NEXT: [[LOADVEC:%.]] = load <2 x double>, <2 x double> [[VECPTR1:%.*]], align 4
		; CHECK-NEXT: [[LOADVEC2:%.]] = load <2 x double>, <2 x double> [[VECPTR2:%.*]], align 4
		; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> undef, double [[LOADA0]], i32 0
		; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[LOADA0]], i32 1
		; CHECK-NEXT: [[TMP3:%.*]] = fmul <2 x double> [[LOADVEC]], [[TMP2]]
		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> undef, double [[LOADA1]], i32 0
		; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[LOADA1]], i32 1
		; CHECK-NEXT: [[TMP6:%.*]] = fmul <2 x double> [[LOADVEC2]], [[TMP5]]
		; CHECK-NEXT: [[TMP7:%.*]] = fadd <2 x double> [[TMP3]], [[TMP6]]
		; CHECK-NEXT: [[SIDX0:%.]] = getelementptr inbounds double, double [[STOREARRAY:%.*]], i64 0
		; CHECK-NEXT: [[SIDX1:%.]] = getelementptr inbounds double, double [[STOREARRAY]], i64 1
		; CHECK-NEXT: [[TMP8:%.]] = bitcast double [[SIDX0]] to <2 x double>*
		; CHECK-NEXT: store <2 x double> [[TMP7]], <2 x double>* [[TMP8]], align 8
		; CHECK-NEXT: ret void
		;
		%idx0 = getelementptr inbounds double, double* %array, i64 0
		%idx1 = getelementptr inbounds double, double* %array, i64 1
		%loadA0 = load double, double* %idx0, align 4
		%loadA1 = load double, double* %idx1, align 4

		%loadVec = load <2 x double>, <2 x double>* %vecPtr1, align 4
		%extrA0 = extractelement <2 x double> %loadVec, i32 0
		%extrA1 = extractelement <2 x double> %loadVec, i32 1
		%loadVec2 = load <2 x double>, <2 x double>* %vecPtr2, align 4
		%extrB0 = extractelement <2 x double> %loadVec2, i32 0
		%extrB1 = extractelement <2 x double> %loadVec2, i32 1

		%mul0 = fmul double %extrA0, %loadA0
		%mul1 = fmul double %extrA1, %loadA0
		%mul3 = fmul double %extrB0, %loadA1
		%mul4 = fmul double %extrB1, %loadA1
		%add0 = fadd double %mul0, %mul3
		%add1 = fadd double %mul1, %mul4

		%sidx0 = getelementptr inbounds double, double* %storeArray, i64 0
		%sidx1 = getelementptr inbounds double, double* %storeArray, i64 1
		store double %add0, double *%sidx0, align 8
		store double %add1, double *%sidx1, align 8
ret void		ret void
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Look-ahead operand reordering heuristic.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 228629

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll

llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll

[SLP] Look-ahead operand reordering heuristic.
ClosedPublic