This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
14/17
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
lookahead.ll

Differential D60897

[SLP] Look-ahead operand reordering heuristic.
ClosedPublic

Authored by vporpo on Apr 18 2019, 4:36 PM.

Download Raw Diff

Details

Reviewers

RKSimon
ABataev
dtemirbulatov
Ayal
hfinkel
rnk

Commits

rG6a18a9548761: [SLP] Look-ahead operand reordering heuristic.
rGcf47ff5ffb1a: [SLP] Recommit: Look-ahead operand reordering heuristic.
rL364964: [SLP] Recommit: Look-ahead operand reordering heuristic.
rG574cb0eb3a7a: [SLP] Look-ahead operand reordering heuristic.
rL364478: [SLP] Look-ahead operand reordering heuristic.
rG5698921be2d5: [SLP] Look-ahead operand reordering heuristic.
rL364084: [SLP] Look-ahead operand reordering heuristic.

Summary

This patch introduces a new heuristic for guiding operand reordering. The new "look-ahead" heuristic can look beyond the immediate predecessors. This helps break ties when the immediate predecessors have identical opcodes (see lit test for an example).

Diff Detail

Event Timeline

vporpo created this revision.Apr 18 2019, 4:36 PM

Herald added a subscriber: llvm-commits. · View Herald TranscriptApr 18 2019, 4:36 PM

RKSimon added inline comments.Apr 27 2019, 8:35 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
153	"slp-max-look-ahead-depth" might be better?
753	These kinds of hard coded costs scare me, especially without any context/description.

vporpo updated this revision to Diff 197190.Apr 29 2019, 2:31 PM

vporpo marked 2 inline comments as done.

vporpo added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
753	I agree, they look quite scary but the values themselves are not very important. What we we really care about is figuring out whether the values could be potentially vectorized or not. The relative differences between the scores is not very important either. It just shows that matching loads is usually preferable to matching instructions with the same opcode etc. We could still use the same score of 1 for all of them and we would still get decent reordering. Another alternative would be to check TTI for each potential candidate vector, but it looks like an overkill.

Tests?

lib/Transforms/Vectorize/SLPVectorizer.cpp
765	Seems to me, the code is not formatted
766	`auto *`
767	`auto *`
773	`auto *`
774	`auto *`
778	`auto *`
779	`auto *`
824	`auto *`
825	`auto *`

vporpo updated this revision to Diff 197195.Apr 29 2019, 3:13 PM

vporpo marked 9 inline comments as done.

Better to commit the test itself at first, without this patch

Rebased.

RKSimon added inline comments.May 1 2019, 9:41 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
769	(style) remove outer brackets
782	What happens in the case where we have alt opcodes? Should we have a preference for all the same opcode vs with alt-opcode? Sometimes the alt-opcodes will fold away (shl + mul etc.) - other times it won't (shl + lshr).

Addressed comments and updated lit test.

lib/Transforms/Vectorize/SLPVectorizer.cpp
782	Hmm good point. Well, currently 'getScoreAtLevelRec()' will simply walk past the alt instructions and will assign them `ScoreSameOpcode`. This is does not look very accurate because alt opcodes usually require shuffles and should have a lower score. I introduced a new `ScoreAltOpcodes = 1` so that alt opcodes are not given the same score as identical opcodes (please see the new functions in lit test). As for the alt opcodes that fold away, maybe that should be fixed in the `getSameOpcode()` and in `struct InstructionState` ? If the get folded then maybe `isAltShuffle()` should return false?

Updated getBestOperand() to use getLookAheadScore() for Load and Constant, not just Opcode.

please can you rebase this?

Rebased.

dtemirbulatov added inline comments.May 22 2019, 12:26 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
932	one extra space here.

oh, I notice some regression with the change for AArch64/matmul.ll and AArch64/transpose.ll. Maybe there is a way to isolate it with heuristics?

Yes, I will take a look. Maybe it is worth using TTI for the scores after all.

rcorcs added a subscriber: rcorcs.May 28 2019, 4:57 AM

I investigated the two AArch64 failing tests. These tests feature the exact problem that we are trying to solve with this look-ahead heuristic. A commutative instruction had operands of the same opcode that the current heuristic has no way of reordering in an informed way. The current reordering was just lucky to pick the proper one, while the look-ahead heuristic was reordering the operands according to the score. However, the problem was that the score calculation was not considering external uses and was therefore favoring a sub-optimal operand ordering.

I updated the patch to factor in the cost of external uses, and both failures are now gone. I also updated the lit-test with a test that shows the problem with the external-uses.

Rebased.

hmm, I have another failure with this change on my setup, now it is PR39774.ll. Probably, it might be sort algorithm differents of similar, since it just swap of two inserts. I don't have this failure if PR39774.ll is intact.

Removed changes in PR39774.ll.

Hmm I am now getting the same failure as you @dtemirbulatov . I am not sure what was wrong before, but it seems that the change in PR39774.ll is no longer needed.

LGTM.

This revision is now accepted and ready to land.Jun 13 2019, 5:35 AM

@ABataev @RKSimon any comments?

RKSimon mentioned this in rG72186a24942f: [SLP][X86] Add lookahead reordering tests from D60897.Jun 20 2019, 5:52 AM

RKSimon mentioned this in rL363925: [SLP][X86] Add lookahead reordering tests from D60897.

@vporpo I've committed the lookahead.ll changes at rL363925 with current (trunk) codegen - please can you rebase?

RKSimon requested changes to this revision.Jun 20 2019, 6:15 AM

This revision now requires changes to proceed.Jun 20 2019, 6:15 AM

Rebased

LGTM - thanks!

This revision is now accepted and ready to land.Jun 21 2019, 3:59 AM

Thank you for the reviews. Please commit the patch.

Closed by commit rL364084: [SLP] Look-ahead operand reordering heuristic. (authored by RKSimon). · Explain WhyJun 21 2019, 10:57 AM

This revision was automatically updated to reflect the committed changes.

This was reverted in r364111 since it was causing a failure in Chromium reported by @rnk.

This revision is now accepted and ready to land.Jun 25 2019, 12:05 AM

@rnk do you have a repro yet please?

This revision now requires changes to proceed.Jun 25 2019, 12:18 AM

Yes, @RKSimon . It was posted in llvm-commits. I did reproduce it and I will update the patch with the fix + lit test.

Fixed crash in chromium reported by @rnk.

The crash was caused by two call instructions with different number of arguments (see lookahead_crash() function in lit test).

Harbormaster completed remote builds in B33863: Diff 206383.Jun 25 2019, 1:08 AM

vporpo marked an inline comment as done.Jun 25 2019, 1:15 AM

vporpo added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
910	This was the cause of the crash. There was no `std::min` here, so ToIdx could be `OpIdx1 + 1` even if `I2` had fewer operands than that.

LGTM

This revision is now accepted and ready to land.Jun 26 2019, 4:31 AM

Closed by commit rL364478: [SLP] Look-ahead operand reordering heuristic. (authored by vporpo). · Explain WhyJun 26 2019, 2:30 PM

This revision was automatically updated to reflect the committed changes.

This change has caused a massive increase in build times when using LTO. On my workstations when building Clang toolchain itself with LTO, the build time has increased from 50 minutes to 5+ hours. On our continuous builders, we seen timeouts everywhere because they aren't able to finish within the allocated time (5 hours). Would it be possible to revert this change?

In D60897#1562427, @phosek wrote:

This change has caused a massive increase in build times when using LTO. On my workstations when building Clang toolchain itself with LTO, the build time has increased from 50 minutes to 5+ hours. On our continuous builders, we seen timeouts everywhere because they aren't able to finish within the allocated time (5 hours). Would it be possible to revert this change?

Reversion makes sense to me - @phosek are you in position to profile it to see where the time is going in SLP please?

This revision is now accepted and ready to land.Jun 28 2019, 9:54 AM

RKSimon requested changes to this revision.Jun 28 2019, 9:54 AM

This revision now requires changes to proceed.Jun 28 2019, 9:54 AM

I think There are two possible causes for the compilation time increase:

Line 901 : We can restrict the number of operands to a max of 2
Line 820: We can restrict the visited users to a ma x of 2.

I can either create a quick patch, or I can revert it. Either is fine.

In D60897#1562519, @vporpo wrote:

I think There are two possible causes for the compilation time increase:

Line 901 : We can restrict the number of operands to a max of 2

Line 820: We can restrict the visited users to a ma x of 2.

I can either create a quick patch, or I can revert it. Either is fine.

If you can create a quick patch, I'd be happy to try it locally to see if it addresses the problem.

Thanks, that would be really helpful @phosek . Let me create the quick patch.

vporpo mentioned this in D63948: [SLP] Limit compilation time of look-ahead operand reordering heuristic..Jun 28 2019, 11:42 AM

The fix for the compilation-time issue reported by @phosek is in D63948 .

Fixed the compilation-time issue with the introduction of the user budget limit from D63948. Also added lit test for it.

Herald added a subscriber: hiraditya. · View Herald TranscriptJul 1 2019, 3:06 PM

Harbormaster completed remote builds in B34163: Diff 207417.Jul 1 2019, 3:07 PM

LGTM

This revision is now accepted and ready to land.Jul 2 2019, 1:11 AM

Closed by commit rL364964: [SLP] Recommit: Look-ahead operand reordering heuristic. (authored by vporpo). · Explain WhyJul 2 2019, 1:21 PM

This revision was automatically updated to reflect the committed changes.

Following is a simplified test case that shows the performance regression of eigen. The number of instructions in inner loop is increased from 85 to 93. Hope it can be helpful.

#include "third_party/eigen3/Eigen/Core"

template <typename ValueType>
inline void CheckNonnegative(ValueType value) {

if (!(value >= 0)) {
  fprintf(stderr, "Check failed.\n");
}

}

template <typename ValueType>
inline decltype(std::norm(ValueType(0))) SquaredMagnitude(ValueType value) {

typedef decltype(std::norm(ValueType(0))) ComponentType;
ComponentType r = std::real(value);
ComponentType i = std::imag(value);
return r * r + i * i;

}

template <typename VectorType2>
class DotEigen {
public:

DotEigen(): first_vector_(16), second_vector_(16) {
}

inline void Execute() {
  result_ = first_vector_.dot(second_vector_);
  CheckNonnegative(SquaredMagnitude(result_));
}

public:

Eigen::Matrix<float, 16, 1> first_vector_;
VectorType2 second_vector_;
std::complex<float> result_;

};

typedef DotEigen<Eigen::Matrix<std::complex<float>, 16, 1>> MOperation;
MOperation* operation;

void BM_Dot_RealComplex_EigenDotFixed(int num_iterations) {

for (int iter = 0; iter < num_iterations; ++iter) {
  operation->Execute();
}

}

Thanks @Carrot, I will investigate the issue.

This revision is now accepted and ready to land.Oct 17 2019, 6:55 PM

@Carrot What build settings are you using? I'm not seeing this here: https://godbolt.org/z/4jq0g8

@RKSimon, my command line options are:

-O3 -m64 -msse4.2 -mpclmul -maes -mprefer-vector-width=128 -fexperimental-new-pass-manager -fPIE

I compared r364964 and r364963.

This fixes the issue reported by @Carrot. It updates getShallowScore() to better cope with extracts from consecutive locations of the same vector (see last lit test). @Carrot, if you have the time please verify that this patch no longer causes a regression. Thanks!

Harbormaster completed remote builds in B39873: Diff 225923.Oct 21 2019, 11:05 AM

@vporpo, it's verified that the eigen test doesn't regress now, actually the inner loop is the same as before.
Thanks

Please reland this patch again.

@RKSimon any comments about the latest changes?

Ping @RKSimon

@vporpo Please can you rebase this?

Rebased.

Harbormaster completed remote builds in B40727: Diff 228629.Nov 10 2019, 8:41 PM

LGTM - cheers

Closed by commit rG6a18a9548761: [SLP] Look-ahead operand reordering heuristic. (authored by vporpo). · Explain WhyNov 11 2019, 9:21 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

190 lines

test/

Transforms/

SLPVectorizer/

X86/

lookahead.ll

178 lines

Diff 200564

lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 141 Lines • ▼ Show 20 Lines
static cl::opt<unsigned> RecursionMaxDepth(		static cl::opt<unsigned> RecursionMaxDepth(
"slp-recursion-max-depth", cl::init(12), cl::Hidden,		"slp-recursion-max-depth", cl::init(12), cl::Hidden,
cl::desc("Limit the recursion depth when building a vectorizable tree"));		cl::desc("Limit the recursion depth when building a vectorizable tree"));

static cl::opt<unsigned> MinTreeSize(		static cl::opt<unsigned> MinTreeSize(
"slp-min-tree-size", cl::init(3), cl::Hidden,		"slp-min-tree-size", cl::init(3), cl::Hidden,
cl::desc("Only vectorize small trees if they are fully vectorizable"));		cl::desc("Only vectorize small trees if they are fully vectorizable"));

		// The maximum depth that the look-ahead score heuristic will explore.
		// The higher this value, the higher the compilation time overhead.
		static cl::opt<int> LookAheadMaxDepth(
		"slp-max-look-ahead-depth", cl::init(2), cl::Hidden,
		RKSimonUnsubmitted Done Reply Inline Actions "slp-max-look-ahead-depth" might be better? RKSimon: "slp-max-look-ahead-depth" might be better?
		cl::desc("The maximum look-ahead depth for operand reordering scores"));

static cl::opt<bool>		static cl::opt<bool>
ViewSLPTree("view-slp-tree", cl::Hidden,		ViewSLPTree("view-slp-tree", cl::Hidden,
cl::desc("Display the SLP trees with Graphviz"));		cl::desc("Display the SLP trees with Graphviz"));

// Limit the number of alias checks. The limit is chosen so that		// Limit the number of alias checks. The limit is chosen so that
// it has no negative effect on the llvm benchmarks.		// it has no negative effect on the llvm benchmarks.
static const unsigned AliasedCheckLimit = 10;		static const unsigned AliasedCheckLimit = 10;

▲ Show 20 Lines • Show All 570 Lines • ▼ Show 20 Lines	void clearUsed() {
OpsVec[OpIdx][Lane].IsUsed = false;		OpsVec[OpIdx][Lane].IsUsed = false;
}		}

/// Swap the operand at \p OpIdx1 with that one at \p OpIdx2.		/// Swap the operand at \p OpIdx1 with that one at \p OpIdx2.
void swap(unsigned OpIdx1, unsigned OpIdx2, unsigned Lane) {		void swap(unsigned OpIdx1, unsigned OpIdx2, unsigned Lane) {
std::swap(OpsVec[OpIdx1][Lane], OpsVec[OpIdx2][Lane]);		std::swap(OpsVec[OpIdx1][Lane], OpsVec[OpIdx2][Lane]);
}		}

		// The hard-coded scores listed here are not very important. When computing
		// the scores of matching one sub-tree with another, we are basically
		// counting the number of values that are matching. So even if all scores
		// are set to 1, we would still get a decent matching result.
		// However, sometimes we have to break ties. For example we may have to
		// choose between matching loads vs matching opcodes. This is what these
		// scores are helping us with: they provide the order of preference.

		/// Loads from consecutive memory addresses, e.g. load(A[i]), load(A[i+1]).
		static const int ScoreConsecutiveLoads = 3;
		/// Constants.
		static const int ScoreConstants = 2;
		RKSimonUnsubmitted Not Done Reply Inline Actions These kinds of hard coded costs scare me, especially without any context/description. RKSimon: These kinds of hard coded costs scare me, especially without any context/description.
		vporpoAuthorUnsubmitted Done Reply Inline Actions I agree, they look quite scary but the values themselves are not very important. What we we really care about is figuring out whether the values could be potentially vectorized or not. The relative differences between the scores is not very important either. It just shows that matching loads is usually preferable to matching instructions with the same opcode etc. We could still use the same score of 1 for all of them and we would still get decent reordering. Another alternative would be to check TTI for each potential candidate vector, but it looks like an overkill. vporpo: I agree, they look quite scary but the values themselves are not very important. What we we…
		/// Instructions with the same opcode.
		static const int ScoreSameOpcode = 2;
		/// Instructions with alt opcodes (e.g, add + sub).
		static const int ScoreAltOpcodes = 1;
		/// Identical instructions (a.k.a. splat or broadcast).
		static const int ScoreSplat = 1;
		/// Matching with an undef is preferable to failing.
		static const int ScoreUndef = 1;
		/// Score for failing to find a decent match.
		static const int ScoreFail = 0;

		/// \returns the score of placing \p V1 and \p V2 in consecutive lanes.
		ABataevUnsubmitted Done Reply Inline Actions Seems to me, the code is not formatted ABataev: Seems to me, the code is not formatted
		static int getShallowScore(Value V1, Value V2, const DataLayout &DL,
		ABataevUnsubmitted Done Reply Inline Actions `auto ` ABataev:* `auto *`
		ScalarEvolution &SE) {
		ABataevUnsubmitted Done Reply Inline Actions `auto ` ABataev:* `auto *`
		auto *LI1 = dyn_cast<LoadInst>(V1);
		auto *LI2 = dyn_cast<LoadInst>(V2);
		RKSimonUnsubmitted Done Reply Inline Actions (style) remove outer brackets RKSimon: (style) remove outer brackets
		if (LI1 && LI2)
		return isConsecutiveAccess(LI1, LI2, DL, SE)
		? VLOperands::ScoreConsecutiveLoads
		: VLOperands::ScoreFail;
		ABataevUnsubmitted Done Reply Inline Actions `auto ` ABataev:* `auto *`

		ABataevUnsubmitted Done Reply Inline Actions `auto ` ABataev:* `auto *`
		auto *C1 = dyn_cast<Constant>(V1);
		auto *C2 = dyn_cast<Constant>(V2);
		if (C1 && C2)
		return VLOperands::ScoreConstants;
		ABataevUnsubmitted Done Reply Inline Actions `auto ` ABataev:* `auto *`

		ABataevUnsubmitted Done Reply Inline Actions `auto ` ABataev:* `auto *`
		auto *I1 = dyn_cast<Instruction>(V1);
		auto *I2 = dyn_cast<Instruction>(V2);
		if (I1 && I2) {
		RKSimonUnsubmitted Not Done Reply Inline Actions What happens in the case where we have alt opcodes? Should we have a preference for all the same opcode vs with alt-opcode? Sometimes the alt-opcodes will fold away (shl + mul etc.) - other times it won't (shl + lshr). RKSimon: What happens in the case where we have alt opcodes? Should we have a preference for all the…
		vporpoAuthorUnsubmitted Done Reply Inline Actions Hmm good point. Well, currently 'getScoreAtLevelRec()' will simply walk past the alt instructions and will assign them `ScoreSameOpcode`. This is does not look very accurate because alt opcodes usually require shuffles and should have a lower score. I introduced a new `ScoreAltOpcodes = 1` so that alt opcodes are not given the same score as identical opcodes (please see the new functions in lit test). As for the alt opcodes that fold away, maybe that should be fixed in the `getSameOpcode()` and in `struct InstructionState` ? If the get folded then maybe `isAltShuffle()` should return false? vporpo: Hmm good point. Well, currently 'getScoreAtLevelRec()' will simply walk past the alt…
		if (I1 == I2)
		return VLOperands::ScoreSplat;
		InstructionsState S = getSameOpcode({I1, I2});
		if (S.getOpcode())
		return S.isAltShuffle() ? VLOperands::ScoreAltOpcodes
		: VLOperands::ScoreSameOpcode;
		}

		if (isa<UndefValue>(V2))
		return VLOperands::ScoreUndef;

		return VLOperands::ScoreFail;
		}

		/// Go through the operands of \p V1 and \p V2 recursively until \p
		/// MaxLevel, and return the cummulative score. For example:
		/// \verbatim
		/// A[0] B[0] A[1] B[1] C[0] D[0] B[1] A[1]
		/// \ / \ / \ / \ /
		/// + + + +
		/// G1 G2 G3 G4
		/// \endverbatim
		/// The getScoreAtLevelRec(G1, G2) function will try to match the nodes at
		/// each level recursively, accumulating the score. It starts from matching
		/// the additions at level 0, then moves on to the loads (level 1). The
		/// score of G1 and G2 is higher than G1 and G3, because {A[0],A[1]} and
		/// {B[0],B[1]} match with VLOperands::ScoreConsecutiveLoads, while
		/// {A[0],C[0]} has a score of VLOperands::ScoreFail.
		/// Please note that the order of the operands does not matter, as we
		/// evaluate the score of all profitable combinations of operands. In
		/// other words the score of G1 and G4 is the same as G1 and G2. This
		/// heuristic is based on ideas described in:
		/// Look-ahead SLP: Auto-vectorization in the presence of commutative
		/// operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,
		/// Luís F. W. Góes
		static int getScoreAtLevelRec(Value V1, Value V2, int CurrLevel,
		int MaxLevel, const DataLayout &DL,
		ScalarEvolution &SE) {
		// Get the shallow score of V1 and V2.
		int ShallowScoreAtThisLevel = getShallowScore(V1, V2, DL, SE);

		// If reached MaxLevel,
		ABataevUnsubmitted Done Reply Inline Actions `auto ` ABataev:* `auto *`
		// or if V1 and V2 are not instructions,
		ABataevUnsubmitted Done Reply Inline Actions `auto ` ABataev:* `auto *`
		// or if they are SPLAT,
		// or if they are not consecutive, early return the current cost.
		auto *I1 = dyn_cast<Instruction>(V1);
		auto *I2 = dyn_cast<Instruction>(V2);
		if (CurrLevel == MaxLevel \|\| !(I1 && I2) \|\| I1 == I2 \|\|
		ShallowScoreAtThisLevel == VLOperands::ScoreFail \|\|
		(isa<LoadInst>(I1) && isa<LoadInst>(I2) && ShallowScoreAtThisLevel))
		return ShallowScoreAtThisLevel;
		assert(I1 && I2 && "Should have early exited.");

		// Contains the I2 operand indexes that got matched with I1 operands.
		SmallSet<int, 4> Op2Used;

		// Recursion towards the operands of I1 and I2. We are trying all possbile
		// operand pairs, and keeping track of the best score.
		for (int OpIdx1 = 0, NumOperands1 = I1->getNumOperands();
		OpIdx1 != NumOperands1; ++OpIdx1) {
		// Try to pair op1I with the best operand of I2.
		int MaxTmpScore = 0;
		int MaxOpIdx2 = -1;
		for (int OpIdx2 = 0, NumOperands2 = I2->getNumOperands();
		OpIdx2 != NumOperands2; ++OpIdx2) {
		// Skip operands already paired with OpIdx1.
		if (Op2Used.count(OpIdx2))
		continue;
		// Recursively calculate the cost at each level
		int TmpScore =
		getScoreAtLevelRec(I1->getOperand(OpIdx1), I2->getOperand(OpIdx2),
		CurrLevel + 1, MaxLevel, DL, SE);
		// Look for the best score.
		if (TmpScore > VLOperands::ScoreFail && TmpScore > MaxTmpScore) {
		MaxTmpScore = TmpScore;
		MaxOpIdx2 = OpIdx2;
		}
		}
		if (MaxOpIdx2 >= 0) {
		// Pair {OpIdx1, MaxOpIdx2} was found to be best. Never revisit it.
		Op2Used.insert(MaxOpIdx2);
		ShallowScoreAtThisLevel += MaxTmpScore;
		}
		}
		return ShallowScoreAtThisLevel;
		}

		/// \returns the look-ahead score, which tells us how much the sub-trees
		/// rooted at \p LHS and \p RHS match, the more they match the higher the
		/// score.
		static int getLookAheadScore(Value LHS, Value RHS, const DataLayout &DL,
		ScalarEvolution &SE) {
		return getScoreAtLevelRec(LHS, RHS, 1, LookAheadMaxDepth, DL, SE);
		}

// Search all operands in Ops[*][Lane] for the one that matches best		// Search all operands in Ops[*][Lane] for the one that matches best
// Ops[OpIdx][LastLane] and return its opreand index.		// Ops[OpIdx][LastLane] and return its opreand index.
// If no good match can be found, return None.		// If no good match can be found, return None.
Optional<unsigned>		Optional<unsigned>
getBestOperand(unsigned OpIdx, int Lane, int LastLane,		getBestOperand(unsigned OpIdx, int Lane, int LastLane,
ArrayRef<ReorderingMode> ReorderingModes) {		ArrayRef<ReorderingMode> ReorderingModes) {
unsigned NumOperands = getNumOperands();		unsigned NumOperands = getNumOperands();

// The operand of the previous lane at OpIdx.		// The operand of the previous lane at OpIdx.
Value *OpLastLane = getData(OpIdx, LastLane).V;		Value *OpLastLane = getData(OpIdx, LastLane).V;

// Our strategy mode for OpIdx.		// Our strategy mode for OpIdx.
ReorderingMode RMode = ReorderingModes[OpIdx];		ReorderingMode RMode = ReorderingModes[OpIdx];

// The linearized opcode of the operand at OpIdx, Lane.		// The linearized opcode of the operand at OpIdx, Lane.
bool OpIdxAPO = getData(OpIdx, Lane).APO;		bool OpIdxAPO = getData(OpIdx, Lane).APO;

const unsigned BestScore = 2;
const unsigned GoodScore = 1;

// The best operand index and its score.		// The best operand index and its score.
// Sometimes we have more than one option (e.g., Opcode and Undefs), so we		// Sometimes we have more than one option (e.g., Opcode and Undefs), so we
// are using the score to differentiate between the two.		// are using the score to differentiate between the two.
struct BestOpData {		struct BestOpData {
Optional<unsigned> Idx = None;		Optional<unsigned> Idx = None;
unsigned Score = 0;		unsigned Score = 0;
} BestOp;		} BestOp;

// Iterate through all unused operands and look for the best.		// Iterate through all unused operands and look for the best.
for (unsigned Idx = 0; Idx != NumOperands; ++Idx) {		for (unsigned Idx = 0; Idx != NumOperands; ++Idx) {
// Get the operand at Idx and Lane.		// Get the operand at Idx and Lane.
OperandData &OpData = getData(Idx, Lane);		OperandData &OpData = getData(Idx, Lane);
Value *Op = OpData.V;		Value *Op = OpData.V;
bool OpAPO = OpData.APO;		bool OpAPO = OpData.APO;

// Skip already selected operands.		// Skip already selected operands.
		vporpoAuthorUnsubmitted Done Reply Inline Actions This was the cause of the crash. There was no `std::min` here, so ToIdx could be `OpIdx1 + 1` even if `I2` had fewer operands than that. vporpo: This was the cause of the crash. There was no `std::min` here, so ToIdx could be `OpIdx1 + 1`…
if (OpData.IsUsed)		if (OpData.IsUsed)
continue;		continue;

// Skip if we are trying to move the operand to a position with a		// Skip if we are trying to move the operand to a position with a
// different opcode in the linearized tree form. This would break the		// different opcode in the linearized tree form. This would break the
// semantics.		// semantics.
if (OpAPO != OpIdxAPO)		if (OpAPO != OpIdxAPO)
continue;		continue;

// Look for an operand that matches the current mode.		// Look for an operand that matches the current mode.
switch (RMode) {		switch (RMode) {
case ReorderingMode::Load:		case ReorderingMode::Load:
if (isa<LoadInst>(Op)) {		case ReorderingMode::Constant:
// Figure out which is left and right, so that we can check for		case ReorderingMode::Opcode: {
// consecutive loads
bool LeftToRight = Lane > LastLane;		bool LeftToRight = Lane > LastLane;
Value *OpLeft = (LeftToRight) ? OpLastLane : Op;		Value *OpLeft = (LeftToRight) ? OpLastLane : Op;
Value *OpRight = (LeftToRight) ? Op : OpLastLane;		Value *OpRight = (LeftToRight) ? Op : OpLastLane;
if (isConsecutiveAccess(cast<LoadInst>(OpLeft),		unsigned Score = getLookAheadScore(OpLeft, OpRight, DL, SE);
cast<LoadInst>(OpRight), DL, SE))
BestOp.Idx = Idx;
}
break;
case ReorderingMode::Opcode:
// We accept both Instructions and Undefs, but with different scores.
if ((isa<Instruction>(Op) && isa<Instruction>(OpLastLane) &&
cast<Instruction>(Op)->getOpcode() ==
cast<Instruction>(OpLastLane)->getOpcode()) \|\|
(isa<UndefValue>(OpLastLane) && isa<Instruction>(Op)) \|\|
isa<UndefValue>(Op)) {
// An instruction has a higher score than an undef.
unsigned Score = (isa<UndefValue>(Op)) ? GoodScore : BestScore;
if (Score > BestOp.Score) {		if (Score > BestOp.Score) {
BestOp.Idx = Idx;		BestOp.Idx = Idx;
BestOp.Score = Score;		BestOp.Score = Score;
}		}
		dtemirbulatovUnsubmitted Not Done Reply Inline Actions one extra space here. dtemirbulatov: one extra space here.
}
break;		break;
case ReorderingMode::Constant:
if (isa<Constant>(Op)) {
unsigned Score = (isa<UndefValue>(Op)) ? GoodScore : BestScore;
if (Score > BestOp.Score) {
BestOp.Idx = Idx;
BestOp.Score = Score;
}
}		}
break;
case ReorderingMode::Splat:		case ReorderingMode::Splat:
if (Op == OpLastLane)		if (Op == OpLastLane)
BestOp.Idx = Idx;		BestOp.Idx = Idx;
break;		break;
case ReorderingMode::Failed:		case ReorderingMode::Failed:
return None;		return None;
}		}
}		}
▲ Show 20 Lines • Show All 6,029 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/lookahead.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux -mcpu=corei7-avx \| FileCheck %s		; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux -mcpu=corei7-avx \| FileCheck %s
;		;
; This checks the look-ahead operand reordering heuristic		; This file tests the look-ahead operand reordering heuristic.
		;
		;
		; This checks that operand reordering will reorder the operands of the adds
		; by taking into consideration the instructions beyond the immediate
		; predecessors.
;		;
; A[0] B[0] C[0] D[0] C[1] D[1] A[1] B[1]		; A[0] B[0] C[0] D[0] C[1] D[1] A[1] B[1]
; \ / \ / \ / \ /		; \ / \ / \ / \ /
; - - - -		; - - - -
; \ / \ /		; \ / \ /
; + +		; + +
; \| \|		; \| \|
; S[0] S[1]		; S[0] S[1]
;		;
define void @test(double* %array) {		define void @lookahead_basic(double* %array) {
; CHECK-LABEL: @test(		; CHECK-LABEL: @lookahead_basic(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0		; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0
; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1		; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1
; CHECK-NEXT: [[IDX2:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 2		; CHECK-NEXT: [[IDX2:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 2
; CHECK-NEXT: [[IDX3:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 3		; CHECK-NEXT: [[IDX3:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 3
; CHECK-NEXT: [[IDX4:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 4		; CHECK-NEXT: [[IDX4:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 4
; CHECK-NEXT: [[IDX5:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 5		; CHECK-NEXT: [[IDX5:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 5
; CHECK-NEXT: [[IDX6:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 6		; CHECK-NEXT: [[IDX6:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 6
; CHECK-NEXT: [[IDX7:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 7		; CHECK-NEXT: [[IDX7:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 7
; CHECK-NEXT: [[A_0:%.]] = load double, double [[IDX0]], align 8		; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[IDX0]] to <2 x double>*
; CHECK-NEXT: [[A_1:%.]] = load double, double [[IDX1]], align 8		; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
; CHECK-NEXT: [[B_0:%.]] = load double, double [[IDX2]], align 8		; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[IDX2]] to <2 x double>*
; CHECK-NEXT: [[B_1:%.]] = load double, double [[IDX3]], align 8		; CHECK-NEXT: [[TMP3:%.]] = load <2 x double>, <2 x double> [[TMP2]], align 8
; CHECK-NEXT: [[C_0:%.]] = load double, double [[IDX4]], align 8		; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[IDX4]] to <2 x double>*
; CHECK-NEXT: [[C_1:%.]] = load double, double [[IDX5]], align 8		; CHECK-NEXT: [[TMP5:%.]] = load <2 x double>, <2 x double> [[TMP4]], align 8
; CHECK-NEXT: [[D_0:%.]] = load double, double [[IDX6]], align 8		; CHECK-NEXT: [[TMP6:%.]] = bitcast double [[IDX6]] to <2 x double>*
; CHECK-NEXT: [[D_1:%.]] = load double, double [[IDX7]], align 8		; CHECK-NEXT: [[TMP7:%.]] = load <2 x double>, <2 x double> [[TMP6]], align 8
; CHECK-NEXT: [[SUBAB_0:%.*]] = fsub fast double [[A_0]], [[B_0]]		; CHECK-NEXT: [[TMP8:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]]
; CHECK-NEXT: [[SUBCD_0:%.*]] = fsub fast double [[C_0]], [[D_0]]		; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP5]], [[TMP7]]
; CHECK-NEXT: [[SUBAB_1:%.*]] = fsub fast double [[A_1]], [[B_1]]		; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP8]], [[TMP9]]
; CHECK-NEXT: [[SUBCD_1:%.*]] = fsub fast double [[C_1]], [[D_1]]		; CHECK-NEXT: [[TMP11:%.]] = bitcast double [[IDX0]] to <2 x double>*
; CHECK-NEXT: [[ADDABCD_0:%.*]] = fadd fast double [[SUBAB_0]], [[SUBCD_0]]		; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8
; CHECK-NEXT: [[ADDCDAB_1:%.*]] = fadd fast double [[SUBCD_1]], [[SUBAB_1]]
; CHECK-NEXT: store double [[ADDABCD_0]], double* [[IDX0]], align 8
; CHECK-NEXT: store double [[ADDCDAB_1]], double* [[IDX1]], align 8
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%idx0 = getelementptr inbounds double, double* %array, i64 0		%idx0 = getelementptr inbounds double, double* %array, i64 0
%idx1 = getelementptr inbounds double, double* %array, i64 1		%idx1 = getelementptr inbounds double, double* %array, i64 1
%idx2 = getelementptr inbounds double, double* %array, i64 2		%idx2 = getelementptr inbounds double, double* %array, i64 2
%idx3 = getelementptr inbounds double, double* %array, i64 3		%idx3 = getelementptr inbounds double, double* %array, i64 3
%idx4 = getelementptr inbounds double, double* %array, i64 4		%idx4 = getelementptr inbounds double, double* %array, i64 4
Show All 18 Lines	entry:

%addABCD_0 = fadd fast double %subAB_0, %subCD_0		%addABCD_0 = fadd fast double %subAB_0, %subCD_0
%addCDAB_1 = fadd fast double %subCD_1, %subAB_1		%addCDAB_1 = fadd fast double %subCD_1, %subAB_1

store double %addABCD_0, double *%idx0, align 8		store double %addABCD_0, double *%idx0, align 8
store double %addCDAB_1, double *%idx1, align 8		store double %addCDAB_1, double *%idx1, align 8
ret void		ret void
}		}


		; Check whether the look-ahead operand reordering heuristic will avoid
		; bundling the alt opcodes. The vectorized code should have no shuffles.
		;
		; A[0] B[0] A[0] B[0] A[1] A[1] A[1] B[1]
		; \ / \ / \ / \ /
		; + - - +
		; \ / \ /
		; + +
		; \| \|
		; S[0] S[1]
		;
		define void @lookahead_alt1(double* %array) {
		; CHECK-LABEL: @lookahead_alt1(
		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0
		; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1
		; CHECK-NEXT: [[IDX2:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 2
		; CHECK-NEXT: [[IDX3:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 3
		; CHECK-NEXT: [[IDX4:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 4
		; CHECK-NEXT: [[IDX5:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 5
		; CHECK-NEXT: [[IDX6:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 6
		; CHECK-NEXT: [[IDX7:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 7
		; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[IDX0]] to <2 x double>*
		; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
		; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[IDX2]] to <2 x double>*
		; CHECK-NEXT: [[TMP3:%.]] = load <2 x double>, <2 x double> [[TMP2]], align 8
		; CHECK-NEXT: [[TMP4:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]]
		; CHECK-NEXT: [[TMP5:%.*]] = fadd fast <2 x double> [[TMP1]], [[TMP3]]
		; CHECK-NEXT: [[TMP6:%.*]] = fadd fast <2 x double> [[TMP5]], [[TMP4]]
		; CHECK-NEXT: [[TMP7:%.]] = bitcast double [[IDX0]] to <2 x double>*
		; CHECK-NEXT: store <2 x double> [[TMP6]], <2 x double>* [[TMP7]], align 8
		; CHECK-NEXT: ret void
		;
		entry:
		%idx0 = getelementptr inbounds double, double* %array, i64 0
		%idx1 = getelementptr inbounds double, double* %array, i64 1
		%idx2 = getelementptr inbounds double, double* %array, i64 2
		%idx3 = getelementptr inbounds double, double* %array, i64 3
		%idx4 = getelementptr inbounds double, double* %array, i64 4
		%idx5 = getelementptr inbounds double, double* %array, i64 5
		%idx6 = getelementptr inbounds double, double* %array, i64 6
		%idx7 = getelementptr inbounds double, double* %array, i64 7

		%A_0 = load double, double *%idx0, align 8
		%A_1 = load double, double *%idx1, align 8
		%B_0 = load double, double *%idx2, align 8
		%B_1 = load double, double *%idx3, align 8

		%addAB_0_L = fadd fast double %A_0, %B_0
		%subAB_0_R = fsub fast double %A_0, %B_0

		%subAB_1_L = fsub fast double %A_1, %B_1
		%addAB_1_R = fadd fast double %A_1, %B_1

		%addABCD_0 = fadd fast double %addAB_0_L, %subAB_0_R
		%addCDAB_1 = fadd fast double %subAB_1_L, %addAB_1_R

		store double %addABCD_0, double *%idx0, align 8
		store double %addCDAB_1, double *%idx1, align 8
		ret void
		}


		; This code should get vectorized all the way to the loads with shuffles for
		; the alt opcodes.
		;
		; A[0] B[0] C[0] D[0] C[1] D[1] A[1] B[1]
		; \ / \ / \ / \ /
		; + - + -
		; \ / \ /
		; + +
		; \| \|
		; S[0] S[1]
		;
		define void @lookahead_alt2(double* %array) {
		; CHECK-LABEL: @lookahead_alt2(
		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0
		; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1
		; CHECK-NEXT: [[IDX2:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 2
		; CHECK-NEXT: [[IDX3:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 3
		; CHECK-NEXT: [[IDX4:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 4
		; CHECK-NEXT: [[IDX5:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 5
		; CHECK-NEXT: [[IDX6:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 6
		; CHECK-NEXT: [[IDX7:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 7
		; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[IDX0]] to <2 x double>*
		; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
		; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[IDX2]] to <2 x double>*
		; CHECK-NEXT: [[TMP3:%.]] = load <2 x double>, <2 x double> [[TMP2]], align 8
		; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[IDX4]] to <2 x double>*
		; CHECK-NEXT: [[TMP5:%.]] = load <2 x double>, <2 x double> [[TMP4]], align 8
		; CHECK-NEXT: [[TMP6:%.]] = bitcast double [[IDX6]] to <2 x double>*
		; CHECK-NEXT: [[TMP7:%.]] = load <2 x double>, <2 x double> [[TMP6]], align 8
		; CHECK-NEXT: [[TMP8:%.*]] = fsub fast <2 x double> [[TMP5]], [[TMP7]]
		; CHECK-NEXT: [[TMP9:%.*]] = fadd fast <2 x double> [[TMP5]], [[TMP7]]
		; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <2 x double> [[TMP8]], <2 x double> [[TMP9]], <2 x i32> <i32 0, i32 3>
		; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP1]], [[TMP3]]
		; CHECK-NEXT: [[TMP12:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]]
		; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <2 x double> [[TMP11]], <2 x double> [[TMP12]], <2 x i32> <i32 0, i32 3>
		; CHECK-NEXT: [[TMP14:%.*]] = fadd fast <2 x double> [[TMP13]], [[TMP10]]
		; CHECK-NEXT: [[TMP15:%.]] = bitcast double [[IDX0]] to <2 x double>*
		; CHECK-NEXT: store <2 x double> [[TMP14]], <2 x double>* [[TMP15]], align 8
		; CHECK-NEXT: ret void
		;
		entry:
		%idx0 = getelementptr inbounds double, double* %array, i64 0
		%idx1 = getelementptr inbounds double, double* %array, i64 1
		%idx2 = getelementptr inbounds double, double* %array, i64 2
		%idx3 = getelementptr inbounds double, double* %array, i64 3
		%idx4 = getelementptr inbounds double, double* %array, i64 4
		%idx5 = getelementptr inbounds double, double* %array, i64 5
		%idx6 = getelementptr inbounds double, double* %array, i64 6
		%idx7 = getelementptr inbounds double, double* %array, i64 7

		%A_0 = load double, double *%idx0, align 8
		%A_1 = load double, double *%idx1, align 8
		%B_0 = load double, double *%idx2, align 8
		%B_1 = load double, double *%idx3, align 8
		%C_0 = load double, double *%idx4, align 8
		%C_1 = load double, double *%idx5, align 8
		%D_0 = load double, double *%idx6, align 8
		%D_1 = load double, double *%idx7, align 8

		%addAB_0 = fadd fast double %A_0, %B_0
		%subCD_0 = fsub fast double %C_0, %D_0

		%addCD_1 = fadd fast double %C_1, %D_1
		%subAB_1 = fsub fast double %A_1, %B_1

		%addABCD_0 = fadd fast double %addAB_0, %subCD_0
		%addCDAB_1 = fadd fast double %addCD_1, %subAB_1

		store double %addABCD_0, double *%idx0, align 8
		store double %addCDAB_1, double *%idx1, align 8
		ret void
		}

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Look-ahead operand reordering heuristic.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 200564

lib/Transforms/Vectorize/SLPVectorizer.cpp

test/Transforms/SLPVectorizer/X86/lookahead.ll

[SLP] Look-ahead operand reordering heuristic.
ClosedPublic