This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
11/24
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
transpose-inseltpoison.ll
-
transpose.ll
-
X86/
-
crash_exceed_scheduling.ll
-
insert-shuffle.ll
-
matched-shuffled-entries.ll
-
operandorder.ll
-
vec_list_bias-inseltpoison.ll
-
vec_list_bias.ll

Differential D116688

[SLP]Excluded external uses from the reordering estimation.
ClosedPublic

Authored by ABataev on Jan 5 2022, 12:16 PM.

Download Raw Diff

Details

Reviewers

vporpo
RKSimon
anton-afanasyev
dtemirbulatov

Commits

rG802ceb8343a2: [SLP]Excluded external uses from the reordering estimation.

Summary

Compiler adds the estimation for the external uses during operands
reordering analysis, which makes it tend to prefer duplicates in the
lanes rather than diamond/shuffled match in the graph. It changes the sizes of
the vector operands and may prevent some vectorization. We don't need
this kind of estimation for the analysis phase, because we just need to
choose the most compatible instruction and it does not matter if it has
external user or used in the non-matching lane. Instead, we count the number
of unique instruction in the lane and see if the reassociation changes
the number of unique scalars to be power of 2 or not. If we have power
of 2 unique scalars in the lane, it is considered more profitable rather
than having non-power-of-2 number of unique scalars.

Metric: SLP.NumVectorInstructions

Program results results0 diff

        test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test   70.00   86.00   22.9%
                    test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 4527.00 4630.00    2.3%
           test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test  346.00  353.00    2.0%
          test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test  346.00  353.00    2.0%
     test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9100.00 9275.00    1.9%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test  235.00  239.00    1.7%
       test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test  235.00  239.00    1.7%
   test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 8737.00 8859.00    1.4%
               test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 1051.00 1064.00    1.2%
        test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1628.00 1646.00    1.1%
       test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1628.00 1646.00    1.1%
  test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3565.00 3577.00    0.3%
   test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3565.00 3577.00    0.3%
     test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4240.00 4250.00    0.2%
            test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1996.00 1998.00    0.1%
               test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1671.00 1672.00    0.1%

test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 783.00 782.00 -0.1%

              test-suite :: SingleSource/Benchmarks/Misc/oourafft.test   69.00   68.00   -1.4%
test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test  207.00  192.00   -7.2%
 test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test  207.00  192.00   -7.2%

test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 89.00 80.00 -10.1%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 89.00 80.00 -10.1%

test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  260.00  215.00  -17.3%

test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 256.00 211.00 -17.6%

MultiSource/Benchmarks/Prolangs-C/TimberWolfMC - pretty the same.
SingleSource/Benchmarks/Misc/oourafft.test - 2 <2 x > loads replaced by
one <4 x> load.
External/SPEC/CINT2017speed/641.leela_s - function gets vectorized and
not inlined anymore.
External/SPEC/CINT2017rate/541.leela_r - same
xternal/SPEC/CINT2017rate/531.deepsjeng_r - changed the order in
multi-block tree, the result is pretty the same.
External/SPEC/CINT2017speed/631.deepsjeng_s - same.
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a - the result is the same
as before.
MultiSource/Benchmarks/MiBench/consumer-jpeg - same.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	70 ms	x64 debian > LLVM.Bindings/Go::go.test

Event Timeline

ABataev created this revision.Jan 5 2022, 12:16 PM

Herald added subscribers: dmgreen, hiraditya. · View Herald TranscriptJan 5 2022, 12:16 PM

ABataev requested review of this revision.Jan 5 2022, 12:16 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 5 2022, 12:16 PM

Harbormaster completed remote builds in B141745: Diff 397673.Jan 5 2022, 1:34 PM

I don't fully understand why you are completely removing the cost of external uses. This is modeling the additional extract instructions that will be needed if we decide to proceed with that specific operand order. This is useful when you have to break ties, for example if we to choose between instructions of the same opcode, one with external uses and the other without. In that case we would prefer to vectorize the instructions without uses. Perhaps the current implementation is causing issues because the external cost is always subtracted (line 1233) and is not just used as a tie breaker only when the costs are the same.
If I understand correctly the splat cost is orthogonal to the the external uses? Can't we have both?

Also could you split the MainAlt changes into a separate patch?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1143–1144	Please explain in the comments why it is more profitable.
1144	Could you also add to the comment what `OpIdx` and `Idx` are.
1146	This needs a better name because it gets a bit confusing later on when you introduce `OpV`. Perhaps call this `OpIdxV` and the other one `IdxV` ?
1150	Could you rename `I` to `Ln` ?
1159–1161	Perhaps replace these lines with: `int UniquesCountWithV = Uniques.contains(V) ? UniquesCount : UniquesCount + 1;`
1163–1165	same
1270	Could you add to the comment about the what each `unsigned` is in the pair? I think it is operand index and lane.
1327	Why skip if score is 0 ?
1328	Is it Score or Cost? Score usually suggests the higher the better (and Cost the lower the better).
3047	Why is it better to do this bottom-up ? Could this be a separate patch?

In D116688#3224141, @vporpo wrote:

I don't fully understand why you are completely removing the cost of external uses. This is modeling the additional extract instructions that will be needed if we decide to proceed with that specific operand order. This is useful when you have to break ties, for example if we to choose between instructions of the same opcode, one with external uses and the other without. In that case we would prefer to vectorize the instructions without uses. Perhaps the current implementation is causing issues because the external cost is always subtracted (line 1233) and is not just used as a tie breaker only when the costs are the same.
If I understand correctly the splat cost is orthogonal to the the external uses? Can't we have both?

Also could you split the MainAlt changes into a separate patch?

Sure, I will split the patch (already did it for the first part).

There are several reasons.

External uses. It will affect the cost in any case (does not matter in which lane the scalar is used) so we can just ignore it.
In-tree use or use in graph. Currently it is too pessimistic. We subtract 1 for each such use and the total cost of the lane becomes NumLanes less, though in the worst case it should be just 1-2 less because we end up just with a single shuffle, not with the gather of extracts. Better just to ignore all these uses and just count number of unique scalar to see if we can vectorize the lane after removing reused scalars.

ABataev mentioned this in D116740: [SLP]Improve reordering for the nodes beeing used in alternate vectorization..Jan 6 2022, 6:43 AM

ABataev added inline comments.Jan 6 2022, 7:18 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1143–1144	Do you suggest to ad thу explanation here?
1146	Ok
1150	Ok
1159–1161	Ok
1270	Sure, will do
1327	Score 0 means failed match, it won't be vectorized for sure so no need to check for the reused scalars.
1328	It is cost currently, will rework it to score.
3047	Yes, see D116740

ABataev mentioned this in rGd130df544d6c: [SLP]Improve reordering for the nodes beeing used in alternate vectorization..Jan 6 2022, 11:20 AM

I agree with your second point, it is too pessimistic and should be fixed.
But I don't fully follow your first point. What do you mean that it will affect the cost in all lanes?

Here is an example where the external uses cost can help:

%1 = load A[0]
%2 = load A[1] // %2 has external use
%3 = load B[0]
%4 = load A[1]
%Ln1 = add %1, %3
%Ln2 = add %2, %4
...
... = %2 // External use of %2

While doing the operand reordering we can choose to vectorize either {%1, %2} or {%1, %4}.
Both have the same opcodes etc. so the rest of the cost calculation will give them the exact same score.
But wouldn't we prefer to vectorize {%1, %4} rather than {%1, %2} to avoid the extract instruction?
How would we do this without taking into account the cost of the external uses?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1143–1144	Yes, please add the explanation in the comments.

In D116688#3225955, @vporpo wrote:
I agree with your second point, it is too pessimistic and should be fixed.
But I don't fully follow your first point. What do you mean that it will affect the cost in all lanes?

Here is an example where the external uses cost can help:
%1 = load A[0]
%2 = load A[1] // %2 has external use
%3 = load B[0]
%4 = load A[1]
%Ln1 = add %1, %3
%Ln2 = add %2, %4
...
... = %2 // External use of %2
While doing the operand reordering we can choose to vectorize either {%1, %2} or {%1, %4}.
Both have the same opcodes etc. so the rest of the cost calculation will give them the exact same score.
But wouldn't we prefer to vectorize {%1, %4} rather than {%1, %2} to avoid the extract instruction?
How would we do this without taking into account the cost of the external uses?

These 2 loads from A[1] will be combined by instcombine before SLP.

This is obviously a contrived example, but it highlights the issue. It still holds if you replace the loads with other instructions of the same opcode that lead to a similar situation requiring tie-breaking using the cost of external uses.

In D116688#3226015, @vporpo wrote:

This is obviously a contrived example, but it highlights the issue. It still holds if you replace the loads with other instructions of the same opcode that lead to a similar situation requiring tie-breaking using the cost of external uses.

If the instructions are the same, they will be combined. If the instructions are different, still better to choose the instruction with the higher score, even if it is externally used (we may have a deeper graph, which is still better for the vectorization). There is only one relevant case - if the instructions are very-very similar (their scores are equal). If the instruction is externally used, it still might be vectorized as part of another tree. The only preference here - instruction with a single use (or all vectorized users) and instruction with many uses, I believe. We can check for something like this and consider instruction with a single use (or all vectorized users) as a better choice rather than the instruction with not all vectorized users.

Yes, I agree, we need a better way of dealing with uses. We should consider the score first and only check for uses if the scores are exactly the same.
Checking for all-vectorized uses versus not-all-vectorized also makes sense.

Could you please add a TODO (perhaps near line 1336?) with a brief description on how we should be dealing with uses?

In D116688#3226339, @vporpo wrote:

Yes, I agree, we need a better way of dealing with uses. We should consider the score first and only check for uses if the scores are exactly the same.
Checking for all-vectorized uses versus not-all-vectorized also makes sense.

Could you please add a TODO (perhaps near line 1336?) with a brief description on how we should be dealing with uses?

I was going to add the analysis of the usage to the patch, have a quick prototype but it has some regressions, will work on the improvements tomorrow.

Added score for all vectorized users.

Harbormaster completed remote builds in B142153: Diff 398233.Jan 7 2022, 2:44 PM

vporpo added inline comments.Jan 7 2022, 3:13 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1180	Could you add a comment explaining why we return `ScoreAllUserVectorized` for this type of instructions?
1326–1336	Since the score calculation is a bit more complicated now, I think it makes sense to move all the score calculation logic into a separate function like `getScore()` which will help hide all the calls to the separate score functions `getLookAheadScore()`, `getSplatScore()`, `getExternalScore()` and the score scaling. What do you think?
1340	nit: use a static constexpr for the scaling factor

RKSimon retitled this revision from [SLP]Excluded external uses from the reprdering estimation. to [SLP]Excluded external uses from the reordering estimation..Jan 12 2022, 2:58 AM

ABataev marked an inline comment as done.Feb 1 2022, 10:17 AM

ABataev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1326–1336	I would keep all these smaller functions but will create `getScore()` for a final score.
1340	Yep, will fix it.

Rebase + address comments

vporpo accepted this revision.Feb 2 2022, 11:22 AM

This revision is now accepted and ready to land.Feb 2 2022, 11:22 AM

Harbormaster completed remote builds in B147208: Diff 405365.Feb 2 2022, 2:51 PM

This revision was landed with ongoing or failed builds.Feb 3 2022, 6:55 AM

Closed by commit rG802ceb8343a2: [SLP]Excluded external uses from the reordering estimation. (authored by ABataev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rG802ceb8343a2: [SLP]Excluded external uses from the reordering estimation..

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

238 lines

test/

Transforms/

SLPVectorizer/

AArch64/

transpose-inseltpoison.ll

19 lines

transpose.ll

19 lines

X86/

crash_exceed_scheduling.ll

8 lines

insert-shuffle.ll

2 lines

matched-shuffled-entries.ll

27 lines

operandorder.ll

2 lines

vec_list_bias-inseltpoison.ll

4 lines

vec_list_bias.ll

4 lines

Diff 397673

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	static cl::opt<unsigned> MinTreeSize(
cl::desc("Only vectorize small trees if they are fully vectorizable"));		cl::desc("Only vectorize small trees if they are fully vectorizable"));

// The maximum depth that the look-ahead score heuristic will explore.		// The maximum depth that the look-ahead score heuristic will explore.
// The higher this value, the higher the compilation time overhead.		// The higher this value, the higher the compilation time overhead.
static cl::opt<int> LookAheadMaxDepth(		static cl::opt<int> LookAheadMaxDepth(
"slp-max-look-ahead-depth", cl::init(2), cl::Hidden,		"slp-max-look-ahead-depth", cl::init(2), cl::Hidden,
cl::desc("The maximum look-ahead depth for operand reordering scores"));		cl::desc("The maximum look-ahead depth for operand reordering scores"));

// The Look-ahead heuristic goes through the users of the bundle to calculate
// the users cost in getExternalUsesCost(). To avoid compilation time increase
// we limit the number of users visited to this value.
static cl::opt<unsigned> LookAheadUsersBudget(
"slp-look-ahead-users-budget", cl::init(2), cl::Hidden,
cl::desc("The maximum number of users to visit while visiting the "
"predecessors. This prevents compilation time increase."));

static cl::opt<bool>		static cl::opt<bool>
ViewSLPTree("view-slp-tree", cl::Hidden,		ViewSLPTree("view-slp-tree", cl::Hidden,
cl::desc("Display the SLP trees with Graphviz"));		cl::desc("Display the SLP trees with Graphviz"));

// Limit the number of alias checks. The limit is chosen so that		// Limit the number of alias checks. The limit is chosen so that
// it has no negative effect on the llvm benchmarks.		// it has no negative effect on the llvm benchmarks.
static const unsigned AliasedCheckLimit = 10;		static const unsigned AliasedCheckLimit = 10;

▲ Show 20 Lines • Show All 802 Lines • ▼ Show 20 Lines	class VLOperands {

using OperandDataVec = SmallVector<OperandData, 2>;		using OperandDataVec = SmallVector<OperandData, 2>;

/// A vector of operand vectors.		/// A vector of operand vectors.
SmallVector<OperandDataVec, 4> OpsVec;		SmallVector<OperandDataVec, 4> OpsVec;

const DataLayout &DL;		const DataLayout &DL;
ScalarEvolution &SE;		ScalarEvolution &SE;
const BoUpSLP &R;

/// \returns the operand data at \p OpIdx and \p Lane.		/// \returns the operand data at \p OpIdx and \p Lane.
OperandData &getData(unsigned OpIdx, unsigned Lane) {		OperandData &getData(unsigned OpIdx, unsigned Lane) {
return OpsVec[OpIdx][Lane];		return OpsVec[OpIdx][Lane];
}		}

/// \returns the operand data at \p OpIdx and \p Lane. Const version.		/// \returns the operand data at \p OpIdx and \p Lane. Const version.
const OperandData &getData(unsigned OpIdx, unsigned Lane) const {		const OperandData &getData(unsigned OpIdx, unsigned Lane) const {
Show All 40 Lines	class VLOperands {
/// Instructions with alt opcodes (e.g, add + sub).		/// Instructions with alt opcodes (e.g, add + sub).
static const int ScoreAltOpcodes = 1;		static const int ScoreAltOpcodes = 1;
/// Identical instructions (a.k.a. splat or broadcast).		/// Identical instructions (a.k.a. splat or broadcast).
static const int ScoreSplat = 1;		static const int ScoreSplat = 1;
/// Matching with an undef is preferable to failing.		/// Matching with an undef is preferable to failing.
static const int ScoreUndef = 1;		static const int ScoreUndef = 1;
/// Score for failing to find a decent match.		/// Score for failing to find a decent match.
static const int ScoreFail = 0;		static const int ScoreFail = 0;
/// User exteranl to the vectorized code.
static const int ExternalUseCost = 1;
/// The user is internal but in a different lane.
static const int UserInDiffLaneCost = ExternalUseCost;

/// \returns the score of placing \p V1 and \p V2 in consecutive lanes.		/// \returns the score of placing \p V1 and \p V2 in consecutive lanes.
		/// Also, checks if \p V1 and \p V2 are compatible with instructions in \p
		/// MainAltOps.
static int getShallowScore(Value V1, Value V2, const DataLayout &DL,		static int getShallowScore(Value V1, Value V2, const DataLayout &DL,
ScalarEvolution &SE, int NumLanes) {		ScalarEvolution &SE, int NumLanes,
		ArrayRef<Value *> MainAltOps) {
if (V1 == V2)		if (V1 == V2)
return VLOperands::ScoreSplat;		return VLOperands::ScoreSplat;

auto *LI1 = dyn_cast<LoadInst>(V1);		auto *LI1 = dyn_cast<LoadInst>(V1);
auto *LI2 = dyn_cast<LoadInst>(V2);		auto *LI2 = dyn_cast<LoadInst>(V2);
if (LI1 && LI2) {		if (LI1 && LI2) {
if (LI1->getParent() != LI2->getParent())		if (LI1->getParent() != LI2->getParent())
return VLOperands::ScoreFail;		return VLOperands::ScoreFail;
Show All 38 Lines	static int getShallowScore(Value V1, Value V2, const DataLayout &DL,
if (isUndefVector(EV2) && EV2->getType() == EV1->getType())		if (isUndefVector(EV2) && EV2->getType() == EV1->getType())
return VLOperands::ScoreConsecutiveExtracts;		return VLOperands::ScoreConsecutiveExtracts;
if (EV2 == EV1) {		if (EV2 == EV1) {
int Idx1 = Ex1Idx->getZExtValue();		int Idx1 = Ex1Idx->getZExtValue();
int Idx2 = Ex2Idx->getZExtValue();		int Idx2 = Ex2Idx->getZExtValue();
int Dist = Idx2 - Idx1;		int Dist = Idx2 - Idx1;
// The distance is too large - still may be profitable to use		// The distance is too large - still may be profitable to use
// shuffles.		// shuffles.
		if (std::abs(Dist) == 0)
		return VLOperands::ScoreSplat;
if (std::abs(Dist) > NumLanes / 2)		if (std::abs(Dist) > NumLanes / 2)
return VLOperands::ScoreAltOpcodes;		return VLOperands::ScoreSameOpcode;
return (Dist > 0) ? VLOperands::ScoreConsecutiveExtracts		return (Dist > 0) ? VLOperands::ScoreConsecutiveExtracts
: VLOperands::ScoreReversedExtracts;		: VLOperands::ScoreReversedExtracts;
}		}
		return VLOperands::ScoreAltOpcodes;
}		}
		return VLOperands::ScoreFail;
}		}

auto *I1 = dyn_cast<Instruction>(V1);		auto *I1 = dyn_cast<Instruction>(V1);
auto *I2 = dyn_cast<Instruction>(V2);		auto *I2 = dyn_cast<Instruction>(V2);
if (I1 && I2) {		if (I1 && I2) {
if (I1->getParent() != I2->getParent())		if (I1->getParent() != I2->getParent())
return VLOperands::ScoreFail;		return VLOperands::ScoreFail;
InstructionsState S = getSameOpcode({I1, I2});		SmallVector<Value *, 4> Ops(MainAltOps.begin(), MainAltOps.end());
		Ops.push_back(I1);
		Ops.push_back(I2);
		InstructionsState S = getSameOpcode(Ops);
// Note: Only consider instructions with <= 2 operands to avoid		// Note: Only consider instructions with <= 2 operands to avoid
// complexity explosion.		// complexity explosion.
if (S.getOpcode() && S.MainOp->getNumOperands() <= 2)		if (S.getOpcode() &&
		(S.MainOp->getNumOperands() <= 2 \|\| !MainAltOps.empty() \|\|
		!S.isAltShuffle()) &&
		all_of(Ops, [&S](Value *V) {
		return cast<Instruction>(V)->getNumOperands() ==
		S.MainOp->getNumOperands();
		}))
return S.isAltShuffle() ? VLOperands::ScoreAltOpcodes		return S.isAltShuffle() ? VLOperands::ScoreAltOpcodes
: VLOperands::ScoreSameOpcode;		: VLOperands::ScoreSameOpcode;
}		}

if (isa<UndefValue>(V2))		if (isa<UndefValue>(V2))
return VLOperands::ScoreUndef;		return VLOperands::ScoreUndef;

return VLOperands::ScoreFail;		return VLOperands::ScoreFail;
}		}

/// Holds the values and their lanes that are taking part in the look-ahead		/// \returns The additional cost due to possible broadcasting of the
/// score calculation. This is used in the external uses cost calculation.		/// elements in the lane. It is more profitable to have power-of-2 unique
/// Need to hold all the lanes in case of splat/broadcast at least to		/// elements in the lane, it will be vectorized with higher probability.
		vporpoUnsubmitted Not Done Reply Inline Actions Could you also add to the comment what `OpIdx` and `Idx` are. vporpo: Could you also add to the comment what `OpIdx` and `Idx` are.
		vporpoUnsubmitted Not Done Reply Inline Actions Please explain in the comments why it is more profitable. vporpo: Please explain in the comments why it is more profitable.
		ABataevAuthorUnsubmitted Done Reply Inline Actions Do you suggest to ad thу explanation here? ABataev: Do you suggest to ad thу explanation here?
		vporpoUnsubmitted Not Done Reply Inline Actions Yes, please add the explanation in the comments. vporpo: Yes, please add the explanation in the comments.
/// correctly check for the use in the different lane.		int getSplatCost(unsigned Lane, unsigned OpIdx, unsigned Idx) const {
SmallDenseMap<Value *, SmallSet<int, 4>> InLookAheadValues;		Value *V = getData(Idx, Lane).V;
		vporpoUnsubmitted Not Done Reply Inline Actions This needs a better name because it gets a bit confusing later on when you introduce `OpV`. Perhaps call this `OpIdxV` and the other one `IdxV` ? vporpo: This needs a better name because it gets a bit confusing later on when you introduce `OpV`.
		ABataevAuthorUnsubmitted Done Reply Inline Actions Ok ABataev: Ok
		if (!isa<Instruction>(V) \|\| V == getData(OpIdx, Lane).V)
/// \returns the additional cost due to uses of \p LHS and \p RHS that are		return 0;
/// either external to the vectorized code, or require shuffling.		SmallPtrSet<Value *, 4> Uniques;
int getExternalUsesCost(const std::pair<Value *, int> &LHS,		for (unsigned I = 0, E = getNumLanes(); I < E; ++I) {
		vporpoUnsubmitted Not Done Reply Inline Actions Could you rename `I` to `Ln` ? vporpo: Could you rename `I` to `Ln` ?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Ok ABataev: Ok
const std::pair<Value *, int> &RHS) {		if (I == Lane)
int Cost = 0;
std::array<std::pair<Value *, int>, 2> Values = {{LHS, RHS}};
for (int Idx = 0, IdxE = Values.size(); Idx != IdxE; ++Idx) {
Value *V = Values[Idx].first;
if (isa<Constant>(V)) {
// Since this is a function pass, it doesn't make semantic sense to
// walk the users of a subclass of Constant. The users could be in
// another function, or even another module that happens to be in
// the same LLVMContext.
continue;		continue;
		Value *OpV = getData(OpIdx, I).V;
		if (!isa<Instruction>(OpV))
		return 0;
		Uniques.insert(OpV);
}		}
		int UniquesCount = Uniques.size();
// Calculate the absolute lane, using the minimum relative lane of LHS		int UniquesCountWithV = UniquesCount;
// and RHS as base and Idx as the offset.		if (!Uniques.contains(V))
int Ln = std::min(LHS.second, RHS.second) + Idx;		++UniquesCountWithV;
		vporpoUnsubmitted Not Done Reply Inline Actions Perhaps replace these lines with: `int UniquesCountWithV = Uniques.contains(V) ? UniquesCount : UniquesCount + 1;` vporpo: Perhaps replace these lines with: `int UniquesCountWithV = Uniques.contains(V) ? UniquesCount…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Ok ABataev: Ok
assert(Ln >= 0 && "Bad lane calculation");		Value *OpV = getData(OpIdx, Lane).V;
unsigned UsersBudget = LookAheadUsersBudget;		int UniquesCountWithOpV = UniquesCount;
for (User *U : V->users()) {		if (!Uniques.contains(OpV))
if (const TreeEntry *UserTE = R.getTreeEntry(U)) {		++UniquesCountWithOpV;
		vporpoUnsubmitted Not Done Reply Inline Actions same vporpo: same
// The user is in the VectorizableTree. Check if we need to insert.		if (UniquesCountWithV == UniquesCountWithOpV)
int UserLn = UserTE->findLaneForValue(U);		return 0;
assert(UserLn >= 0 && "Bad lane");		return (PowerOf2Ceil(UniquesCountWithV) - UniquesCountWithV) -
// If the values are different, check just the line of the current		(PowerOf2Ceil(UniquesCountWithOpV) - UniquesCountWithOpV);
// value. If the values are the same, need to add UserInDiffLaneCost
// only if UserLn does not match both line numbers.
if ((LHS.first != RHS.first && UserLn != Ln) \|\|
(LHS.first == RHS.first && UserLn != LHS.second &&
UserLn != RHS.second)) {
Cost += UserInDiffLaneCost;
break;
}
} else {
// Check if the user is in the look-ahead code.
auto It2 = InLookAheadValues.find(U);
if (It2 != InLookAheadValues.end()) {
// The user is in the look-ahead code. Check the lane.
if (!It2->getSecond().contains(Ln)) {
Cost += UserInDiffLaneCost;
break;
}
} else {
// The user is neither in SLP tree nor in the look-ahead code.
Cost += ExternalUseCost;
break;
}
}
// Limit the number of visited uses to cap compilation time.
if (--UsersBudget == 0)
break;
}
}
return Cost;
}		}

/// Go through the operands of \p LHS and \p RHS recursively until \p		/// Go through the operands of \p LHS and \p RHS recursively until \p
/// MaxLevel, and return the cummulative score. For example:		/// MaxLevel, and return the cummulative score. For example:
/// \verbatim		/// \verbatim
/// A[0] B[0] A[1] B[1] C[0] D[0] B[1] A[1]		/// A[0] B[0] A[1] B[1] C[0] D[0] B[1] A[1]
/// \ / \ / \ / \ /		/// \ / \ / \ / \ /
/// + + + +		/// + + + +
/// G1 G2 G3 G4		/// G1 G2 G3 G4
/// \endverbatim		/// \endverbatim
/// The getScoreAtLevelRec(G1, G2) function will try to match the nodes at		/// The getScoreAtLevelRec(G1, G2) function will try to match the nodes at
		vporpoUnsubmitted Done Reply Inline Actions Could you add a comment explaining why we return `ScoreAllUserVectorized` for this type of instructions? vporpo: Could you add a comment explaining why we return `ScoreAllUserVectorized` for this type of…
/// each level recursively, accumulating the score. It starts from matching		/// each level recursively, accumulating the score. It starts from matching
/// the additions at level 0, then moves on to the loads (level 1). The		/// the additions at level 0, then moves on to the loads (level 1). The
/// score of G1 and G2 is higher than G1 and G3, because {A[0],A[1]} and		/// score of G1 and G2 is higher than G1 and G3, because {A[0],A[1]} and
/// {B[0],B[1]} match with VLOperands::ScoreConsecutiveLoads, while		/// {B[0],B[1]} match with VLOperands::ScoreConsecutiveLoads, while
/// {A[0],C[0]} has a score of VLOperands::ScoreFail.		/// {A[0],C[0]} has a score of VLOperands::ScoreFail.
/// Please note that the order of the operands does not matter, as we		/// Please note that the order of the operands does not matter, as we
/// evaluate the score of all profitable combinations of operands. In		/// evaluate the score of all profitable combinations of operands. In
/// other words the score of G1 and G4 is the same as G1 and G2. This		/// other words the score of G1 and G4 is the same as G1 and G2. This
/// heuristic is based on ideas described in:		/// heuristic is based on ideas described in:
/// Look-ahead SLP: Auto-vectorization in the presence of commutative		/// Look-ahead SLP: Auto-vectorization in the presence of commutative
/// operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,		/// operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,
/// Luís F. W. Góes		/// Luís F. W. Góes
int getScoreAtLevelRec(const std::pair<Value *, int> &LHS,		int getScoreAtLevelRec(Value LHS, Value RHS, int CurrLevel, int MaxLevel,
const std::pair<Value *, int> &RHS, int CurrLevel,		ArrayRef<Value *> MainAltOps) {
int MaxLevel) {

Value *V1 = LHS.first;
Value *V2 = RHS.first;
// Get the shallow score of V1 and V2.		// Get the shallow score of V1 and V2.
int ShallowScoreAtThisLevel = std::max(		int ShallowScoreAtThisLevel =
(int)ScoreFail, getShallowScore(V1, V2, DL, SE, getNumLanes()) -		getShallowScore(LHS, RHS, DL, SE, getNumLanes(), MainAltOps);
getExternalUsesCost(LHS, RHS));
int Lane1 = LHS.second;
int Lane2 = RHS.second;

// If reached MaxLevel,		// If reached MaxLevel,
// or if V1 and V2 are not instructions,		// or if V1 and V2 are not instructions,
// or if they are SPLAT,		// or if they are SPLAT,
// or if they are not consecutive,		// or if they are not consecutive,
// or if profitable to vectorize loads or extractelements, early return		// or if profitable to vectorize loads or extractelements, early return
// the current cost.		// the current cost.
auto *I1 = dyn_cast<Instruction>(V1);		auto *I1 = dyn_cast<Instruction>(LHS);
auto *I2 = dyn_cast<Instruction>(V2);		auto *I2 = dyn_cast<Instruction>(RHS);
if (CurrLevel == MaxLevel \|\| !(I1 && I2) \|\| I1 == I2 \|\|		if (CurrLevel == MaxLevel \|\| !(I1 && I2) \|\| I1 == I2 \|\|
ShallowScoreAtThisLevel == VLOperands::ScoreFail \|\|		ShallowScoreAtThisLevel == VLOperands::ScoreFail \|\|
(((isa<LoadInst>(I1) && isa<LoadInst>(I2)) \|\|		(((isa<LoadInst>(I1) && isa<LoadInst>(I2)) \|\|
		(I1->getNumOperands() > 2 && I2->getNumOperands() > 2) \|\|
(isa<ExtractElementInst>(I1) && isa<ExtractElementInst>(I2))) &&		(isa<ExtractElementInst>(I1) && isa<ExtractElementInst>(I2))) &&
ShallowScoreAtThisLevel))		ShallowScoreAtThisLevel))
return ShallowScoreAtThisLevel;		return ShallowScoreAtThisLevel;
assert(I1 && I2 && "Should have early exited.");		assert(I1 && I2 && "Should have early exited.");

// Keep track of in-tree values for determining the external-use cost.
InLookAheadValues[V1].insert(Lane1);
InLookAheadValues[V2].insert(Lane2);

// Contains the I2 operand indexes that got matched with I1 operands.		// Contains the I2 operand indexes that got matched with I1 operands.
SmallSet<unsigned, 4> Op2Used;		SmallSet<unsigned, 4> Op2Used;

// Recursion towards the operands of I1 and I2. We are trying all possible		// Recursion towards the operands of I1 and I2. We are trying all possible
// operand pairs, and keeping track of the best score.		// operand pairs, and keeping track of the best score.
for (unsigned OpIdx1 = 0, NumOperands1 = I1->getNumOperands();		for (unsigned OpIdx1 = 0, NumOperands1 = I1->getNumOperands();
OpIdx1 != NumOperands1; ++OpIdx1) {		OpIdx1 != NumOperands1; ++OpIdx1) {
// Try to pair op1I with the best operand of I2.		// Try to pair op1I with the best operand of I2.
int MaxTmpScore = 0;		int MaxTmpScore = 0;
unsigned MaxOpIdx2 = 0;		unsigned MaxOpIdx2 = 0;
bool FoundBest = false;		bool FoundBest = false;
// If I2 is commutative try all combinations.		// If I2 is commutative try all combinations.
unsigned FromIdx = isCommutative(I2) ? 0 : OpIdx1;		unsigned FromIdx = isCommutative(I2) ? 0 : OpIdx1;
unsigned ToIdx = isCommutative(I2)		unsigned ToIdx = isCommutative(I2)
? I2->getNumOperands()		? I2->getNumOperands()
: std::min(I2->getNumOperands(), OpIdx1 + 1);		: std::min(I2->getNumOperands(), OpIdx1 + 1);
assert(FromIdx <= ToIdx && "Bad index");		assert(FromIdx <= ToIdx && "Bad index");
for (unsigned OpIdx2 = FromIdx; OpIdx2 != ToIdx; ++OpIdx2) {		for (unsigned OpIdx2 = FromIdx; OpIdx2 != ToIdx; ++OpIdx2) {
// Skip operands already paired with OpIdx1.		// Skip operands already paired with OpIdx1.
if (Op2Used.count(OpIdx2))		if (Op2Used.count(OpIdx2))
continue;		continue;
// Recursively calculate the cost at each level		// Recursively calculate the cost at each level
int TmpScore = getScoreAtLevelRec({I1->getOperand(OpIdx1), Lane1},		int TmpScore =
{I2->getOperand(OpIdx2), Lane2},		getScoreAtLevelRec(I1->getOperand(OpIdx1), I2->getOperand(OpIdx2),
CurrLevel + 1, MaxLevel);		CurrLevel + 1, MaxLevel, None);
// Look for the best score.		// Look for the best score.
if (TmpScore > VLOperands::ScoreFail && TmpScore > MaxTmpScore) {		if (TmpScore > VLOperands::ScoreFail && TmpScore > MaxTmpScore) {
MaxTmpScore = TmpScore;		MaxTmpScore = TmpScore;
MaxOpIdx2 = OpIdx2;		MaxOpIdx2 = OpIdx2;
FoundBest = true;		FoundBest = true;
}		}
}		}
if (FoundBest) {		if (FoundBest) {
// Pair {OpIdx1, MaxOpIdx2} was found to be best. Never revisit it.		// Pair {OpIdx1, MaxOpIdx2} was found to be best. Never revisit it.
Op2Used.insert(MaxOpIdx2);		Op2Used.insert(MaxOpIdx2);
ShallowScoreAtThisLevel += MaxTmpScore;		ShallowScoreAtThisLevel += MaxTmpScore;
}		}
}		}
return ShallowScoreAtThisLevel;		return ShallowScoreAtThisLevel;
}		}

/// \Returns the look-ahead score, which tells us how much the sub-trees		/// \Returns the look-ahead score, which tells us how much the sub-trees
/// rooted at \p LHS and \p RHS match, the more they match the higher the		/// rooted at \p LHS and \p RHS match, the more they match the higher the
/// score. This helps break ties in an informed way when we cannot decide on		/// score. This helps break ties in an informed way when we cannot decide on
/// the order of the operands by just considering the immediate		/// the order of the operands by just considering the immediate
/// predecessors.		/// predecessors.
int getLookAheadScore(const std::pair<Value *, int> &LHS,		int getLookAheadScore(Value LHS, Value RHS,
const std::pair<Value *, int> &RHS) {		ArrayRef<Value *> MainAltOps) {
InLookAheadValues.clear();		return getScoreAtLevelRec(LHS, RHS, 1, LookAheadMaxDepth, MainAltOps);
return getScoreAtLevelRec(LHS, RHS, 1, LookAheadMaxDepth);
}		}

		/// Best defined scores per lanes between the passes. Used to choose the
		/// best operand (with the highest score) between the passes.
		SmallDenseMap<std::pair<unsigned, unsigned>, unsigned, 8>
		vporpoUnsubmitted Not Done Reply Inline Actions Could you add to the comment about the what each `unsigned` is in the pair? I think it is operand index and lane. vporpo: Could you add to the comment about the what each `unsigned` is in the pair? I think it is…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Sure, will do ABataev: Sure, will do
		BestScoresPerLanes;

// Search all operands in Ops[*][Lane] for the one that matches best		// Search all operands in Ops[*][Lane] for the one that matches best
// Ops[OpIdx][LastLane] and return its opreand index.		// Ops[OpIdx][LastLane] and return its opreand index.
// If no good match can be found, return None.		// If no good match can be found, return None.
Optional<unsigned>		Optional<unsigned> getBestOperand(unsigned OpIdx, int Lane, int LastLane,
getBestOperand(unsigned OpIdx, int Lane, int LastLane,		ArrayRef<ReorderingMode> ReorderingModes,
ArrayRef<ReorderingMode> ReorderingModes) {		ArrayRef<Value *> MainAltOps) {
unsigned NumOperands = getNumOperands();		unsigned NumOperands = getNumOperands();

// The operand of the previous lane at OpIdx.		// The operand of the previous lane at OpIdx.
Value *OpLastLane = getData(OpIdx, LastLane).V;		Value *OpLastLane = getData(OpIdx, LastLane).V;

// Our strategy mode for OpIdx.		// Our strategy mode for OpIdx.
ReorderingMode RMode = ReorderingModes[OpIdx];		ReorderingMode RMode = ReorderingModes[OpIdx];

// The linearized opcode of the operand at OpIdx, Lane.		// The linearized opcode of the operand at OpIdx, Lane.
bool OpIdxAPO = getData(OpIdx, Lane).APO;		bool OpIdxAPO = getData(OpIdx, Lane).APO;

// The best operand index and its score.		// The best operand index and its score.
// Sometimes we have more than one option (e.g., Opcode and Undefs), so we		// Sometimes we have more than one option (e.g., Opcode and Undefs), so we
// are using the score to differentiate between the two.		// are using the score to differentiate between the two.
struct BestOpData {		struct BestOpData {
Optional<unsigned> Idx = None;		Optional<unsigned> Idx = None;
unsigned Score = 0;		unsigned Score = 0;
} BestOp;		} BestOp;
		BestOp.Score =
		BestScoresPerLanes.try_emplace(std::make_pair(OpIdx, Lane), 0)
		.first->second;

// Iterate through all unused operands and look for the best.		// Iterate through all unused operands and look for the best.
for (unsigned Idx = 0; Idx != NumOperands; ++Idx) {		for (unsigned Idx = 0; Idx != NumOperands; ++Idx) {
// Get the operand at Idx and Lane.		// Get the operand at Idx and Lane.
OperandData &OpData = getData(Idx, Lane);		OperandData &OpData = getData(Idx, Lane);
Value *Op = OpData.V;		Value *Op = OpData.V;
bool OpAPO = OpData.APO;		bool OpAPO = OpData.APO;

Show All 10 Lines	Optional<unsigned> getBestOperand(unsigned OpIdx, int Lane, int LastLane,
// Look for an operand that matches the current mode.		// Look for an operand that matches the current mode.
switch (RMode) {		switch (RMode) {
case ReorderingMode::Load:		case ReorderingMode::Load:
case ReorderingMode::Constant:		case ReorderingMode::Constant:
case ReorderingMode::Opcode: {		case ReorderingMode::Opcode: {
bool LeftToRight = Lane > LastLane;		bool LeftToRight = Lane > LastLane;
Value *OpLeft = (LeftToRight) ? OpLastLane : Op;		Value *OpLeft = (LeftToRight) ? OpLastLane : Op;
Value *OpRight = (LeftToRight) ? Op : OpLastLane;		Value *OpRight = (LeftToRight) ? Op : OpLastLane;
unsigned Score =		int Score = getLookAheadScore(OpLeft, OpRight, MainAltOps);
getLookAheadScore({OpLeft, LastLane}, {OpRight, Lane});		if (Score) {
		vporpoUnsubmitted Not Done Reply Inline Actions Why skip if score is 0 ? vporpo: Why skip if score is 0 ?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Score 0 means failed match, it won't be vectorized for sure so no need to check for the reused scalars. ABataev: Score 0 means failed match, it won't be vectorized for sure so no need to check for the reused…
if (Score > BestOp.Score) {		int SplatScore = getSplatCost(Lane, OpIdx, Idx);
		vporpoUnsubmitted Not Done Reply Inline Actions Is it Score or Cost? Score usually suggests the higher the better (and Cost the lower the better). vporpo: Is it Score or Cost? Score usually suggests the higher the better (and Cost the lower the…
		ABataevAuthorUnsubmitted Done Reply Inline Actions It is cost currently, will rework it to score. ABataev: It is cost currently, will rework it to score.
		if (Score <= SplatScore)
		// Set the minimum score for splat-like sequence to avoid setting
		// failed state.
		Score = 1;
		else
		Score -= SplatScore;
		}
		if (Score > static_cast<int>(BestOp.Score)) {
		vporpoUnsubmitted Not Done Reply Inline Actions Since the score calculation is a bit more complicated now, I think it makes sense to move all the score calculation logic into a separate function like `getScore()` which will help hide all the calls to the separate score functions `getLookAheadScore()`, `getSplatScore()`, `getExternalScore()` and the score scaling. What do you think? vporpo: Since the score calculation is a bit more complicated now, I think it makes sense to move all…
		ABataevAuthorUnsubmitted Done Reply Inline Actions I would keep all these smaller functions but will create `getScore()` for a final score. ABataev: I would keep all these smaller functions but will create `getScore()` for a final score.
BestOp.Idx = Idx;		BestOp.Idx = Idx;
BestOp.Score = Score;		BestOp.Score = Score;
		BestScoresPerLanes[std::make_pair(OpIdx, Lane)] = Score;
}		}
		vporpoUnsubmitted Not Done Reply Inline Actions nit: use a static constexpr for the scaling factor vporpo: nit: use a static constexpr for the scaling factor
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yep, will fix it. ABataev: Yep, will fix it.
break;		break;
}		}
case ReorderingMode::Splat:		case ReorderingMode::Splat:
if (Op == OpLastLane)		if (Op == OpLastLane)
BestOp.Idx = Idx;		BestOp.Idx = Idx;
break;		break;
case ReorderingMode::Failed:		case ReorderingMode::Failed:
return None;		return None;
Show All 39 Lines	unsigned getBestLaneToStartReordering() const {
} else if (NumFreeOpsHash.NumOfAPOs == Min &&		} else if (NumFreeOpsHash.NumOfAPOs == Min &&
NumFreeOpsHash.NumOpsWithSameOpcodeParent < SameOpNumber) {		NumFreeOpsHash.NumOpsWithSameOpcodeParent < SameOpNumber) {
// Select the most optimal lane in terms of number of operands that		// Select the most optimal lane in terms of number of operands that
// should be moved around.		// should be moved around.
SameOpNumber = NumFreeOpsHash.NumOpsWithSameOpcodeParent;		SameOpNumber = NumFreeOpsHash.NumOpsWithSameOpcodeParent;
HashMap[NumFreeOpsHash.Hash] = std::make_pair(1, Lane);		HashMap[NumFreeOpsHash.Hash] = std::make_pair(1, Lane);
} else if (NumFreeOpsHash.NumOfAPOs == Min &&		} else if (NumFreeOpsHash.NumOfAPOs == Min &&
NumFreeOpsHash.NumOpsWithSameOpcodeParent == SameOpNumber) {		NumFreeOpsHash.NumOpsWithSameOpcodeParent == SameOpNumber) {
++HashMap[NumFreeOpsHash.Hash].first;		auto It = HashMap.find(NumFreeOpsHash.Hash);
		if (It == HashMap.end())
		HashMap[NumFreeOpsHash.Hash] = std::make_pair(1, Lane);
		else
		++It->second.first;
}		}
}		}
// Select the lane with the minimum counter.		// Select the lane with the minimum counter.
unsigned BestLane = 0;		unsigned BestLane = 0;
unsigned CntMin = UINT_MAX;		unsigned CntMin = UINT_MAX;
for (const auto &Data : reverse(HashMap)) {		for (const auto &Data : reverse(HashMap)) {
if (Data.second.first < CntMin) {		if (Data.second.first < CntMin) {
CntMin = Data.second.first;		CntMin = Data.second.first;
▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	bool shouldBroadcast(Value *Op, unsigned OpIdx, unsigned Lane) {
return false;		return false;
}		}
return true;		return true;
}		}

public:		public:
/// Initialize with all the operands of the instruction vector \p RootVL.		/// Initialize with all the operands of the instruction vector \p RootVL.
VLOperands(ArrayRef<Value *> RootVL, const DataLayout &DL,		VLOperands(ArrayRef<Value *> RootVL, const DataLayout &DL,
ScalarEvolution &SE, const BoUpSLP &R)		ScalarEvolution &SE)
: DL(DL), SE(SE), R(R) {		: DL(DL), SE(SE) {
// Append all the operands of RootVL.		// Append all the operands of RootVL.
appendOperandsOfVL(RootVL);		appendOperandsOfVL(RootVL);
}		}

/// \Returns a value vector with the operands across all lanes for the		/// \Returns a value vector with the operands across all lanes for the
/// opearnd at \p OpIdx.		/// opearnd at \p OpIdx.
ValueList getVL(unsigned OpIdx) const {		ValueList getVL(unsigned OpIdx) const {
ValueList OpVL(OpsVec[OpIdx].size());		ValueList OpVL(OpsVec[OpIdx].size());
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	void reorder() {
// Skip the second pass if the first pass did not fail.		// Skip the second pass if the first pass did not fail.
bool StrategyFailed = false;		bool StrategyFailed = false;
// Mark all operand data as free to use.		// Mark all operand data as free to use.
clearUsed();		clearUsed();
// We keep the original operand order for the FirstLane, so reorder the		// We keep the original operand order for the FirstLane, so reorder the
// rest of the lanes. We are visiting the nodes in a circular fashion,		// rest of the lanes. We are visiting the nodes in a circular fashion,
// using FirstLane as the center point and increasing the radius		// using FirstLane as the center point and increasing the radius
// distance.		// distance.
		SmallVector<SmallVector<Value *, 2>> MainAltOps(NumOperands);
		for (unsigned I = 0; I < NumOperands; ++I)
		MainAltOps[I].push_back(getData(I, FirstLane).V);

for (unsigned Distance = 1; Distance != NumLanes; ++Distance) {		for (unsigned Distance = 1; Distance != NumLanes; ++Distance) {
// Visit the lane on the right and then the lane on the left.		// Visit the lane on the right and then the lane on the left.
for (int Direction : {+1, -1}) {		for (int Direction : {+1, -1}) {
int Lane = FirstLane + Direction * Distance;		int Lane = FirstLane + Direction * Distance;
if (Lane < 0 \|\| Lane >= (int)NumLanes)		if (Lane < 0 \|\| Lane >= (int)NumLanes)
continue;		continue;
int LastLane = Lane - Direction;		int LastLane = Lane - Direction;
assert(LastLane >= 0 && LastLane < (int)NumLanes &&		assert(LastLane >= 0 && LastLane < (int)NumLanes &&
"Out of bounds");		"Out of bounds");
// Look for a good match for each operand.		// Look for a good match for each operand.
for (unsigned OpIdx = 0; OpIdx != NumOperands; ++OpIdx) {		for (unsigned OpIdx = 0; OpIdx != NumOperands; ++OpIdx) {
// Search for the operand that matches SortedOps[OpIdx][Lane-1].		// Search for the operand that matches SortedOps[OpIdx][Lane-1].
Optional<unsigned> BestIdx =		Optional<unsigned> BestIdx = getBestOperand(
getBestOperand(OpIdx, Lane, LastLane, ReorderingModes);		OpIdx, Lane, LastLane, ReorderingModes, MainAltOps[OpIdx]);
// By not selecting a value, we allow the operands that follow to		// By not selecting a value, we allow the operands that follow to
// select a better matching value. We will get a non-null value in		// select a better matching value. We will get a non-null value in
// the next run of getBestOperand().		// the next run of getBestOperand().
if (BestIdx) {		if (BestIdx) {
// Swap the current operand with the one returned by		// Swap the current operand with the one returned by
// getBestOperand().		// getBestOperand().
swap(OpIdx, BestIdx.getValue(), Lane);		swap(OpIdx, BestIdx.getValue(), Lane);
} else {		} else {
// We failed to find a best operand, set mode to 'Failed'.		// We failed to find a best operand, set mode to 'Failed'.
ReorderingModes[OpIdx] = ReorderingMode::Failed;		ReorderingModes[OpIdx] = ReorderingMode::Failed;
// Enable the second pass.		// Enable the second pass.
StrategyFailed = true;		StrategyFailed = true;
}		}
		// Try to get the alternate opcode and follow it during analysis.
		if (MainAltOps[OpIdx].size() != 2) {
		OperandData &AltOp = getData(OpIdx, Lane);
		InstructionsState OpS =
		getSameOpcode({MainAltOps[OpIdx].front(), AltOp.V});
		if (OpS.getOpcode() && OpS.isAltShuffle())
		MainAltOps[OpIdx].push_back(AltOp.V);
		}
}		}
}		}
}		}
// Skip second pass if the strategy did not fail.		// Skip second pass if the strategy did not fail.
if (!StrategyFailed)		if (!StrategyFailed)
break;		break;
}		}
}		}
▲ Show 20 Lines • Show All 1,321 Lines • ▼ Show 20 Lines	void BoUpSLP::reorderTopToBottom() {
DenseMap<const TreeEntry *, OrdersType> GathersToOrders;		DenseMap<const TreeEntry *, OrdersType> GathersToOrders;
// Find all reorderable nodes with the given VF.		// Find all reorderable nodes with the given VF.
// Currently the are vectorized stores,loads,extracts + some gathering of		// Currently the are vectorized stores,loads,extracts + some gathering of
// extracts.		// extracts.
for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders](		for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders](
const std::unique_ptr<TreeEntry> &TE) {		const std::unique_ptr<TreeEntry> &TE) {
if (Optional<OrdersType> CurrentOrder =		if (Optional<OrdersType> CurrentOrder =
getReorderingData(TE.get(), /TopToBottom=*/true)) {		getReorderingData(TE.get(), /TopToBottom=*/true)) {
		// Do not include ordering for nodes used in the alt opcode vectorization,
		// better to reorder them during bottom-to-top stage.
		vporpoUnsubmitted Not Done Reply Inline Actions Why is it better to do this bottom-up ? Could this be a separate patch? vporpo: Why is it better to do this bottom-up ? Could this be a separate patch?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, see D116740 ABataev: Yes, see D116740
		SmallVector<const TreeEntry *, 1> Worklist(1, TE.get());
		unsigned Cnt = 0;
		while (!Worklist.empty() && Cnt < RecursionMaxDepth) {
		const TreeEntry *UserTE = Worklist.pop_back_val();
		if (UserTE->UserTreeIndices.size() != 1)
		break;
		if (all_of(UserTE->UserTreeIndices, [](const EdgeInfo &EI) {
		return EI.UserTE->State == TreeEntry::Vectorize &&
		EI.UserTE->isAltShuffle() && EI.UserTE->Idx != 0;
		}))
		return;
		Worklist.push_back(UserTE->UserTreeIndices.back().UserTE);
		++Cnt;
		}
VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());		VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());
if (TE->State != TreeEntry::Vectorize)		if (TE->State != TreeEntry::Vectorize)
GathersToOrders.try_emplace(TE.get(), *CurrentOrder);		GathersToOrders.try_emplace(TE.get(), *CurrentOrder);
}		}
});		});

// Reorder the graph nodes according to their vectorization factor.		// Reorder the graph nodes according to their vectorization factor.
for (unsigned VF = VectorizableTree.front()->Scalars.size(); VF > 1;		for (unsigned VF = VectorizableTree.front()->Scalars.size(); VF > 1;
▲ Show 20 Lines • Show All 2,932 Lines • ▼ Show 20 Lines
void BoUpSLP::reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,		void BoUpSLP::reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
SmallVectorImpl<Value *> &Left,		SmallVectorImpl<Value *> &Left,
SmallVectorImpl<Value *> &Right,		SmallVectorImpl<Value *> &Right,
const DataLayout &DL,		const DataLayout &DL,
ScalarEvolution &SE,		ScalarEvolution &SE,
const BoUpSLP &R) {		const BoUpSLP &R) {
if (VL.empty())		if (VL.empty())
return;		return;
VLOperands Ops(VL, DL, SE, R);		VLOperands Ops(VL, DL, SE);
// Reorder the operands in place.		// Reorder the operands in place.
Ops.reorder();		Ops.reorder();
Left = Ops.getVL(0);		Left = Ops.getVL(0);
Right = Ops.getVL(1);		Right = Ops.getVL(1);
}		}

void BoUpSLP::setInsertPointAfterBundle(const TreeEntry *E) {		void BoUpSLP::setInsertPointAfterBundle(const TreeEntry *E) {
// Get the basic block this bundle is in. All instructions in the bundle		// Get the basic block this bundle is in. All instructions in the bundle
▲ Show 20 Lines • Show All 2,393 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {

LLVM_DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "		LLVM_DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "
<< "\n");		<< "\n");

R.buildTree(Ops);		R.buildTree(Ops);
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;
R.reorderTopToBottom();		R.reorderTopToBottom();
R.reorderBottomToTop();		// TODO: add support for more kinds of the instructions here.
		R.reorderBottomToTop(
		all_of(Ops, [](Value *V) { return isa<PHINode>(V); }));
R.buildExternalUses();		R.buildExternalUses();

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
InstructionCost Cost = R.getTreeCost();		InstructionCost Cost = R.getTreeCost();
CandidateFound = true;		CandidateFound = true;
MinCost = std::min(MinCost, Cost);		MinCost = std::min(MinCost, Cost);

if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
▲ Show 20 Lines • Show All 1,814 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	;
store i64 %tmp2.1, i64* %c.1, align 8		store i64 %tmp2.1, i64* %c.1, align 8
ret void		ret void
}		}

define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32(		; CHECK-LABEL: @build_vec_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = add <4 x i32> [[V0:%.]], [[V1:%.*]]		; CHECK-NEXT: [[TMP1:%.]] = add <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 3, i32 6>		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 0, i32 5, i32 3, i32 6>
; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 2, i32 7>
; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], [[TMP3]]		; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], [[TMP3]]
; CHECK-NEXT: ret <4 x i32> [[TMP5]]		; CHECK-NEXT: ret <4 x i32> [[TMP5]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines

define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_3_binops(		; CHECK-LABEL: @build_vec_v4i32_3_binops(
; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]		; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[TMP7:%.*]] = xor <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = xor <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <2 x i32> <i32 1, i32 1>		; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]]
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]]		; CHECK-NEXT: [[TMP8:%.*]] = add <2 x i32> [[SHUFFLE]], [[TMP6]]
; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]		; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> [[TMP8]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: ret <4 x i32> [[TMP3_31]]		; CHECK-NEXT: ret <4 x i32> [[TMP3_31]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
Show All 13 Lines	;
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @reduction_v4i32(		; CHECK-LABEL: @reduction_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = sub <4 x i32> [[V0:%.]], [[V1:%.*]]		; CHECK-NEXT: [[TMP1:%.]] = sub <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 7, i32 2>		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 0, i32 5, i32 7, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 6, i32 3>
; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], [[TMP3]]		; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], [[TMP3]]
; CHECK-NEXT: [[TMP6:%.*]] = lshr <4 x i32> [[TMP5]], <i32 15, i32 15, i32 15, i32 15>		; CHECK-NEXT: [[TMP6:%.*]] = lshr <4 x i32> [[TMP5]], <i32 15, i32 15, i32 15, i32 15>
; CHECK-NEXT: [[TMP7:%.*]] = and <4 x i32> [[TMP6]], <i32 65537, i32 65537, i32 65537, i32 65537>		; CHECK-NEXT: [[TMP7:%.*]] = and <4 x i32> [[TMP6]], <i32 65537, i32 65537, i32 65537, i32 65537>
; CHECK-NEXT: [[TMP8:%.*]] = mul nuw <4 x i32> [[TMP7]], <i32 65535, i32 65535, i32 65535, i32 65535>		; CHECK-NEXT: [[TMP8:%.*]] = mul nuw <4 x i32> [[TMP7]], <i32 65535, i32 65535, i32 65535, i32 65535>
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]		; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: [[TMP10:%.*]] = xor <4 x i32> [[TMP9]], [[TMP8]]		; CHECK-NEXT: [[TMP10:%.*]] = xor <4 x i32> [[TMP9]], [[TMP8]]
; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP10]])		; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP10]])
; CHECK-NEXT: ret i32 [[TMP11]]		; CHECK-NEXT: ret i32 [[TMP11]]
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	;
store i64 %tmp2.1, i64* %c.1, align 8		store i64 %tmp2.1, i64* %c.1, align 8
ret void		ret void
}		}

define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32(		; CHECK-LABEL: @build_vec_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = add <4 x i32> [[V0:%.]], [[V1:%.*]]		; CHECK-NEXT: [[TMP1:%.]] = add <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 3, i32 6>		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 0, i32 5, i32 3, i32 6>
; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 2, i32 7>
; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], [[TMP3]]		; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], [[TMP3]]
; CHECK-NEXT: ret <4 x i32> [[TMP5]]		; CHECK-NEXT: ret <4 x i32> [[TMP5]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines

define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_3_binops(		; CHECK-LABEL: @build_vec_v4i32_3_binops(
; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]		; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[TMP7:%.*]] = xor <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = xor <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <2 x i32> <i32 1, i32 1>		; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]]
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]]		; CHECK-NEXT: [[TMP8:%.*]] = add <2 x i32> [[SHUFFLE]], [[TMP6]]
; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]		; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> [[TMP8]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: ret <4 x i32> [[TMP3_31]]		; CHECK-NEXT: ret <4 x i32> [[TMP3_31]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
Show All 13 Lines	;
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @reduction_v4i32(		; CHECK-LABEL: @reduction_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = sub <4 x i32> [[V0:%.]], [[V1:%.*]]		; CHECK-NEXT: [[TMP1:%.]] = sub <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 7, i32 2>		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 0, i32 5, i32 7, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 6, i32 3>
; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], [[TMP3]]		; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], [[TMP3]]
; CHECK-NEXT: [[TMP6:%.*]] = lshr <4 x i32> [[TMP5]], <i32 15, i32 15, i32 15, i32 15>		; CHECK-NEXT: [[TMP6:%.*]] = lshr <4 x i32> [[TMP5]], <i32 15, i32 15, i32 15, i32 15>
; CHECK-NEXT: [[TMP7:%.*]] = and <4 x i32> [[TMP6]], <i32 65537, i32 65537, i32 65537, i32 65537>		; CHECK-NEXT: [[TMP7:%.*]] = and <4 x i32> [[TMP6]], <i32 65537, i32 65537, i32 65537, i32 65537>
; CHECK-NEXT: [[TMP8:%.*]] = mul nuw <4 x i32> [[TMP7]], <i32 65535, i32 65535, i32 65535, i32 65535>		; CHECK-NEXT: [[TMP8:%.*]] = mul nuw <4 x i32> [[TMP7]], <i32 65535, i32 65535, i32 65535, i32 65535>
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]		; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: [[TMP10:%.*]] = xor <4 x i32> [[TMP9]], [[TMP8]]		; CHECK-NEXT: [[TMP10:%.*]] = xor <4 x i32> [[TMP9]], [[TMP8]]
; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP10]])		; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP10]])
; CHECK-NEXT: ret i32 [[TMP11]]		; CHECK-NEXT: ret i32 [[TMP11]]
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -slp-min-tree-size=2 -slp-threshold=-1000 -slp-max-look-ahead-depth=1 -slp-look-ahead-users-budget=1 -slp-schedule-budget=27 -S -mtriple=x86_64-unknown-linux-gnu \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -slp-min-tree-size=2 -slp-threshold=-1000 -slp-max-look-ahead-depth=1 -slp-schedule-budget=27 -S -mtriple=x86_64-unknown-linux-gnu \| FileCheck %s

	define void @exceed(double %0, double %1) {			define void @exceed(double %0, double %1) {
	; CHECK-LABEL: @exceed(			; CHECK-LABEL: @exceed(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x double> poison, double [[TMP0:%.]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x double> poison, double [[TMP0:%.]], i32 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[TMP0]], i32 1			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[TMP0]], i32 1
	; CHECK-NEXT: [[TMP4:%.]] = insertelement <2 x double> poison, double [[TMP1:%.]], i32 0			; CHECK-NEXT: [[TMP4:%.]] = insertelement <2 x double> poison, double [[TMP1:%.]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[TMP1]], i32 1			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[TMP1]], i32 1
	Show All 18 Lines
	; CHECK-NEXT: [[IXX22:%.*]] = fsub double undef, undef			; CHECK-NEXT: [[IXX22:%.*]] = fsub double undef, undef
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[TMP6]], i32 0			; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[TMP6]], i32 0
	; CHECK-NEXT: [[IX2:%.*]] = fmul double [[TMP8]], [[TMP8]]			; CHECK-NEXT: [[IX2:%.*]] = fmul double [[TMP8]], [[TMP8]]
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x double> [[TMP2]], double [[TMP1]], i32 1			; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x double> [[TMP2]], double [[TMP1]], i32 1
	; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP6]], [[TMP9]]			; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP6]], [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP3]], [[TMP5]]			; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP3]], [[TMP5]]
	; CHECK-NEXT: [[TMP12:%.*]] = fmul fast <2 x double> [[TMP10]], [[TMP11]]			; CHECK-NEXT: [[TMP12:%.*]] = fmul fast <2 x double> [[TMP10]], [[TMP11]]
	; CHECK-NEXT: [[IXX101:%.*]] = fsub double undef, undef			; CHECK-NEXT: [[IXX101:%.*]] = fsub double undef, undef
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x double> poison, double [[TMP1]], i32 1			; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x double> <double undef, double poison>, double [[TMP1]], i32 1
	; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> [[TMP13]], double [[TMP7]], i32 0			; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> <double poison, double undef>, double [[TMP7]], i32 0
	; CHECK-NEXT: [[TMP15:%.*]] = fmul fast <2 x double> [[TMP14]], undef			; CHECK-NEXT: [[TMP15:%.*]] = fmul fast <2 x double> [[TMP13]], [[TMP14]]
	; CHECK-NEXT: switch i32 undef, label [[BB1:%.*]] [			; CHECK-NEXT: switch i32 undef, label [[BB1:%.*]] [
	; CHECK-NEXT: i32 0, label [[BB2:%.*]]			; CHECK-NEXT: i32 0, label [[BB2:%.*]]
	; CHECK-NEXT: ]			; CHECK-NEXT: ]
	; CHECK: bb1:			; CHECK: bb1:
	; CHECK-NEXT: br label [[LABEL:%.*]]			; CHECK-NEXT: br label [[LABEL:%.*]]
	; CHECK: bb2:			; CHECK: bb2:
	; CHECK-NEXT: br label [[LABEL]]			; CHECK-NEXT: br label [[LABEL]]
	; CHECK: label:			; CHECK: label:
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll

	Show All 11 Lines
	; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[X]] to <2 x float>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[X]] to <2 x float>*
	; CHECK-NEXT: [[TMP2:%.]] = load <2 x float>, <2 x float> [[TMP1]], align 16			; CHECK-NEXT: [[TMP2:%.]] = load <2 x float>, <2 x float> [[TMP1]], align 16
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 0, i32 1>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 0, i32 1>
	; CHECK-NEXT: [[TMP3:%.]] = load float, float undef, align 4			; CHECK-NEXT: [[TMP3:%.]] = load float, float undef, align 4
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> poison, float [[TMP0]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> poison, float [[TMP0]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x float> [[TMP4]], float [[TMP3]], i32 1			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x float> [[TMP4]], float [[TMP3]], i32 1
	; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float> [[TMP5]], <4 x float> poison, <4 x i32> <i32 0, i32 undef, i32 1, i32 undef>			; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float> [[TMP5]], <4 x float> poison, <4 x i32> <i32 0, i32 undef, i32 1, i32 undef>
	; CHECK-NEXT: [[TMP6:%.*]] = fmul <4 x float> [[SHUFFLE]], [[SHUFFLE1]]			; CHECK-NEXT: [[TMP6:%.*]] = fmul <4 x float> [[SHUFFLE]], [[SHUFFLE1]]
	; CHECK-NEXT: [[TMP7:%.*]] = fadd <4 x float> poison, [[TMP6]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd <4 x float> [[TMP6]], poison
	; CHECK-NEXT: [[TMP8:%.*]] = fadd <4 x float> [[TMP7]], poison			; CHECK-NEXT: [[TMP8:%.*]] = fadd <4 x float> [[TMP7]], poison
	; CHECK-NEXT: [[TMP9:%.*]] = fadd <4 x float> [[TMP8]], poison			; CHECK-NEXT: [[TMP9:%.*]] = fadd <4 x float> [[TMP8]], poison
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x float> [[TMP9]], i32 0			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x float> [[TMP9]], i32 0
	; CHECK-NEXT: [[VEC1:%.*]] = insertelement <2 x float> undef, float [[TMP10]], i32 0			; CHECK-NEXT: [[VEC1:%.*]] = insertelement <2 x float> undef, float [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x float> [[TMP9]], i32 1			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x float> [[TMP9]], i32 1
	; CHECK-NEXT: [[VEC2:%.*]] = insertelement <2 x float> [[VEC1]], float [[TMP11]], i32 1			; CHECK-NEXT: [[VEC2:%.*]] = insertelement <2 x float> [[VEC1]], float [[TMP11]], i32 1
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x float> [[TMP9]], i32 2			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x float> [[TMP9]], i32 2
	; CHECK-NEXT: [[VEC3:%.*]] = insertelement <2 x float> undef, float [[TMP12]], i32 0			; CHECK-NEXT: [[VEC3:%.*]] = insertelement <2 x float> undef, float [[TMP12]], i32 0
	Show All 37 Lines

llvm/test/Transforms/SLPVectorizer/X86/matched-shuffled-entries.ll

	Show All 10 Lines
	; CHECK-NEXT: [[ADD78_2:%.*]] = add nsw i32 undef, undef			; CHECK-NEXT: [[ADD78_2:%.*]] = add nsw i32 undef, undef
	; CHECK-NEXT: [[SUB102_3:%.*]] = sub nsw i32 undef, undef			; CHECK-NEXT: [[SUB102_3:%.*]] = sub nsw i32 undef, undef
	; CHECK-NEXT: [[TMP0:%.*]] = insertelement <16 x i32> poison, i32 [[SUB102_3]], i32 0			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <16 x i32> poison, i32 [[SUB102_3]], i32 0
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <16 x i32> [[TMP0]], i32 [[SUB102_1]], i32 1			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <16 x i32> [[TMP0]], i32 [[SUB102_1]], i32 1
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <16 x i32> [[TMP1]], i32 [[ADD94_1]], i32 2			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <16 x i32> [[TMP1]], i32 [[ADD94_1]], i32 2
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <16 x i32> [[TMP2]], i32 [[ADD78_1]], i32 3			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <16 x i32> [[TMP2]], i32 [[ADD78_1]], i32 3
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <16 x i32> [[TMP3]], i32 [[SUB86_1]], i32 4			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <16 x i32> [[TMP3]], i32 [[SUB86_1]], i32 4
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <16 x i32> [[TMP4]], i32 [[ADD78_2]], i32 5			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <16 x i32> [[TMP4]], i32 [[ADD78_2]], i32 5
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i32> [[TMP5]], <16 x i32> poison, <16 x i32> <i32 0, i32 undef, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 2, i32 3, i32 4, i32 undef, i32 5, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i32> [[TMP5]], <16 x i32> poison, <16 x i32> <i32 0, i32 undef, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 2, i32 3, i32 4, i32 5, i32 5, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <16 x i32> poison, i32 [[SUB86_1]], i32 0			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <16 x i32> poison, i32 [[SUB86_1]], i32 0
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <16 x i32> [[TMP6]], i32 [[ADD78_1]], i32 1			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <16 x i32> [[TMP6]], i32 [[ADD78_1]], i32 1
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <16 x i32> [[TMP7]], i32 [[ADD94_1]], i32 2			; CHECK-NEXT: [[TMP8:%.*]] = insertelement <16 x i32> [[TMP7]], i32 [[ADD94_1]], i32 2
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <16 x i32> [[TMP8]], i32 [[SUB102_1]], i32 3			; CHECK-NEXT: [[TMP9:%.*]] = insertelement <16 x i32> [[TMP8]], i32 [[SUB102_1]], i32 3
	; CHECK-NEXT: [[TMP10:%.*]] = insertelement <16 x i32> [[TMP9]], i32 [[ADD78_2]], i32 4			; CHECK-NEXT: [[TMP10:%.*]] = insertelement <16 x i32> [[TMP9]], i32 [[SUB102_3]], i32 4
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <16 x i32> [[TMP10]], i32 [[SUB102_3]], i32 5			; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i32> [[TMP10]], <16 x i32> poison, <16 x i32> <i32 undef, i32 undef, i32 0, i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 4>
	; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i32> [[TMP11]], <16 x i32> poison, <16 x i32> <i32 undef, i32 undef, i32 0, i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 2, i32 3, i32 4, i32 undef, i32 undef, i32 undef, i32 undef, i32 5>			; CHECK-NEXT: [[TMP11:%.*]] = add nsw <16 x i32> [[SHUFFLE]], [[SHUFFLE1]]
	; CHECK-NEXT: [[TMP12:%.*]] = add nsw <16 x i32> [[SHUFFLE]], [[SHUFFLE1]]			; CHECK-NEXT: [[TMP12:%.*]] = sub nsw <16 x i32> [[SHUFFLE]], [[SHUFFLE1]]
	; CHECK-NEXT: [[TMP13:%.*]] = sub nsw <16 x i32> [[SHUFFLE]], [[SHUFFLE1]]			; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <16 x i32> [[TMP11]], <16 x i32> [[TMP12]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 21, i32 22, i32 7, i32 24, i32 25, i32 10, i32 27, i32 28, i32 13, i32 30, i32 31>
	; CHECK-NEXT: [[TMP14:%.*]] = shufflevector <16 x i32> [[TMP12]], <16 x i32> [[TMP13]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 21, i32 22, i32 7, i32 24, i32 25, i32 10, i32 27, i32 28, i32 13, i32 30, i32 31>			; CHECK-NEXT: [[TMP14:%.*]] = lshr <16 x i32> [[TMP13]], <i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15>
	; CHECK-NEXT: [[TMP15:%.*]] = lshr <16 x i32> [[TMP14]], <i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15>			; CHECK-NEXT: [[TMP15:%.*]] = and <16 x i32> [[TMP14]], <i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537>
	; CHECK-NEXT: [[TMP16:%.*]] = and <16 x i32> [[TMP15]], <i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537, i32 65537>			; CHECK-NEXT: [[TMP16:%.*]] = mul nuw <16 x i32> [[TMP15]], <i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535>
	; CHECK-NEXT: [[TMP17:%.*]] = mul nuw <16 x i32> [[TMP16]], <i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535>			; CHECK-NEXT: [[TMP17:%.*]] = add <16 x i32> [[TMP16]], [[TMP13]]
	; CHECK-NEXT: [[TMP18:%.*]] = add <16 x i32> [[TMP17]], [[TMP14]]			; CHECK-NEXT: [[TMP18:%.*]] = xor <16 x i32> [[TMP17]], [[TMP16]]
	; CHECK-NEXT: [[TMP19:%.*]] = xor <16 x i32> [[TMP18]], [[TMP17]]			; CHECK-NEXT: [[TMP19:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP18]])
	; CHECK-NEXT: [[TMP20:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP19]])			; CHECK-NEXT: [[SHR:%.*]] = lshr i32 [[TMP19]], 16
	; CHECK-NEXT: [[SHR:%.*]] = lshr i32 [[TMP20]], 16
	; CHECK-NEXT: [[ADD119:%.*]] = add nuw nsw i32 undef, [[SHR]]			; CHECK-NEXT: [[ADD119:%.*]] = add nuw nsw i32 undef, [[SHR]]
	; CHECK-NEXT: [[SHR120:%.*]] = lshr i32 [[ADD119]], 1			; CHECK-NEXT: [[SHR120:%.*]] = lshr i32 [[ADD119]], 1
	; CHECK-NEXT: ret i32 [[SHR120]]			; CHECK-NEXT: ret i32 [[SHR120]]
	;			;
	entry:			entry:
	%add103 = add nsw i32 undef, undef			%add103 = add nsw i32 undef, undef
	%sub104 = sub nsw i32 undef, undef			%sub104 = sub nsw i32 undef, undef
	%add105 = add nsw i32 undef, undef			%add105 = add nsw i32 undef, undef
	▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll

	Show First 20 Lines • Show All 415 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: @opcode_reorder(			; CHECK-LABEL: @opcode_reorder(
	; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[B:%.]] to <4 x float>			; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[B:%.]] to <4 x float>
	; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4			; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[C:%.]] to <4 x float>			; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[C:%.]] to <4 x float>
	; CHECK-NEXT: [[TMP4:%.]] = load <4 x float>, <4 x float> [[TMP3]], align 4			; CHECK-NEXT: [[TMP4:%.]] = load <4 x float>, <4 x float> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP5:%.*]] = fadd <4 x float> [[TMP2]], [[TMP4]]			; CHECK-NEXT: [[TMP5:%.*]] = fadd <4 x float> [[TMP2]], [[TMP4]]
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[D:%.]] to <4 x float>			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[D:%.]] to <4 x float>
	; CHECK-NEXT: [[TMP7:%.]] = load <4 x float>, <4 x float> [[TMP6]], align 4			; CHECK-NEXT: [[TMP7:%.]] = load <4 x float>, <4 x float> [[TMP6]], align 4
	; CHECK-NEXT: [[TMP8:%.*]] = fadd <4 x float> [[TMP5]], [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = fadd <4 x float> [[TMP7]], [[TMP5]]
	; CHECK-NEXT: [[TMP9:%.]] = bitcast float [[A:%.]] to <4 x float>			; CHECK-NEXT: [[TMP9:%.]] = bitcast float [[A:%.]] to <4 x float>
	; CHECK-NEXT: store <4 x float> [[TMP8]], <4 x float>* [[TMP9]], align 4			; CHECK-NEXT: store <4 x float> [[TMP8]], <4 x float>* [[TMP9]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%1 = load float, float* %b			%1 = load float, float* %b
	%2 = load float, float* %c			%2 = load float, float* %c
	%3 = fadd float %1, %2			%3 = fadd float %1, %2
	%4 = load float, float* %d			%4 = load float, float* %d
	Show All 34 Lines

llvm/test/Transforms/SLPVectorizer/X86/vec_list_bias-inseltpoison.ll

	Show All 36 Lines
	; CHECK-NEXT: [[T40:%.*]] = mul nsw i32 [[T39]], 9633			; CHECK-NEXT: [[T40:%.*]] = mul nsw i32 [[T39]], 9633
	; CHECK-NEXT: [[T41:%.*]] = mul nsw i32 [[T25]], 2446			; CHECK-NEXT: [[T41:%.*]] = mul nsw i32 [[T25]], 2446
	; CHECK-NEXT: [[T42:%.*]] = mul nsw i32 [[T17]], 16819			; CHECK-NEXT: [[T42:%.*]] = mul nsw i32 [[T17]], 16819
	; CHECK-NEXT: [[T47:%.*]] = mul nsw i32 [[T37]], -16069			; CHECK-NEXT: [[T47:%.*]] = mul nsw i32 [[T37]], -16069
	; CHECK-NEXT: [[T48:%.*]] = mul nsw i32 [[T38]], -3196			; CHECK-NEXT: [[T48:%.*]] = mul nsw i32 [[T38]], -3196
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i32> poison, i32 [[T15]], i32 0			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i32> poison, i32 [[T15]], i32 0
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> [[TMP1]], i32 [[T40]], i32 1			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> [[TMP1]], i32 [[T40]], i32 1
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[T27]], i32 2			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[T27]], i32 2
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP3]], i32 [[T40]], i32 3			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP3]], i32 [[T47]], i32 3
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> <i32 poison, i32 poison, i32 6270, i32 poison>, i32 [[T9]], i32 0			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> <i32 poison, i32 poison, i32 6270, i32 poison>, i32 [[T9]], i32 0
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[T48]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[T48]], i32 1
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[T47]], i32 3			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[T40]], i32 3
	; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[TMP4]], [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP7]]			; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> <i32 0, i32 1, i32 6, i32 3>			; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> <i32 0, i32 1, i32 6, i32 3>
	; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <4 x i32> [[TMP10]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <4 x i32> [[TMP10]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP10]], i32 0			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP10]], i32 0
	; CHECK-NEXT: [[T69:%.*]] = insertelement <8 x i32> [[TMP11]], i32 [[TMP12]], i32 4			; CHECK-NEXT: [[T69:%.*]] = insertelement <8 x i32> [[TMP11]], i32 [[TMP12]], i32 4
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP10]], i32 1			; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP10]], i32 1
	; CHECK-NEXT: [[T70:%.*]] = insertelement <8 x i32> [[T69]], i32 [[TMP13]], i32 5			; CHECK-NEXT: [[T70:%.*]] = insertelement <8 x i32> [[T69]], i32 [[TMP13]], i32 5
	▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/vec_list_bias.ll

	Show All 36 Lines
	; CHECK-NEXT: [[T40:%.*]] = mul nsw i32 [[T39]], 9633			; CHECK-NEXT: [[T40:%.*]] = mul nsw i32 [[T39]], 9633
	; CHECK-NEXT: [[T41:%.*]] = mul nsw i32 [[T25]], 2446			; CHECK-NEXT: [[T41:%.*]] = mul nsw i32 [[T25]], 2446
	; CHECK-NEXT: [[T42:%.*]] = mul nsw i32 [[T17]], 16819			; CHECK-NEXT: [[T42:%.*]] = mul nsw i32 [[T17]], 16819
	; CHECK-NEXT: [[T47:%.*]] = mul nsw i32 [[T37]], -16069			; CHECK-NEXT: [[T47:%.*]] = mul nsw i32 [[T37]], -16069
	; CHECK-NEXT: [[T48:%.*]] = mul nsw i32 [[T38]], -3196			; CHECK-NEXT: [[T48:%.*]] = mul nsw i32 [[T38]], -3196
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i32> poison, i32 [[T15]], i32 0			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i32> poison, i32 [[T15]], i32 0
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> [[TMP1]], i32 [[T40]], i32 1			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> [[TMP1]], i32 [[T40]], i32 1
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[T27]], i32 2			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[T27]], i32 2
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP3]], i32 [[T40]], i32 3			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP3]], i32 [[T47]], i32 3
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> <i32 poison, i32 poison, i32 6270, i32 poison>, i32 [[T9]], i32 0			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> <i32 poison, i32 poison, i32 6270, i32 poison>, i32 [[T9]], i32 0
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[T48]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[T48]], i32 1
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[T47]], i32 3			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[T40]], i32 3
	; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[TMP4]], [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP7]]			; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> <i32 0, i32 1, i32 6, i32 3>			; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> <i32 0, i32 1, i32 6, i32 3>
	; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <4 x i32> [[TMP10]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <4 x i32> [[TMP10]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP10]], i32 0			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP10]], i32 0
	; CHECK-NEXT: [[T69:%.*]] = insertelement <8 x i32> [[TMP11]], i32 [[TMP12]], i32 4			; CHECK-NEXT: [[T69:%.*]] = insertelement <8 x i32> [[TMP11]], i32 [[TMP12]], i32 4
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP10]], i32 1			; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP10]], i32 1
	; CHECK-NEXT: [[T70:%.*]] = insertelement <8 x i32> [[T69]], i32 [[TMP13]], i32 5			; CHECK-NEXT: [[T70:%.*]] = insertelement <8 x i32> [[T69]], i32 [[TMP13]], i32 5
	▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines