This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
11/24
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
transpose-inseltpoison.ll
-
transpose.ll
-
X86/
-
crash_exceed_scheduling.ll
-
vec_list_bias-inseltpoison.ll
-
vec_list_bias.ll

Differential D116688

[SLP]Excluded external uses from the reordering estimation.
ClosedPublic

Authored by ABataev on Jan 5 2022, 12:16 PM.

Download Raw Diff

Details

Reviewers

vporpo
RKSimon
anton-afanasyev
dtemirbulatov

Commits

rG802ceb8343a2: [SLP]Excluded external uses from the reordering estimation.

Summary

Compiler adds the estimation for the external uses during operands
reordering analysis, which makes it tend to prefer duplicates in the
lanes rather than diamond/shuffled match in the graph. It changes the sizes of
the vector operands and may prevent some vectorization. We don't need
this kind of estimation for the analysis phase, because we just need to
choose the most compatible instruction and it does not matter if it has
external user or used in the non-matching lane. Instead, we count the number
of unique instruction in the lane and see if the reassociation changes
the number of unique scalars to be power of 2 or not. If we have power
of 2 unique scalars in the lane, it is considered more profitable rather
than having non-power-of-2 number of unique scalars.

Metric: SLP.NumVectorInstructions

Program results results0 diff

        test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test   70.00   86.00   22.9%
                    test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 4527.00 4630.00    2.3%
           test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test  346.00  353.00    2.0%
          test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test  346.00  353.00    2.0%
     test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9100.00 9275.00    1.9%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test  235.00  239.00    1.7%
       test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test  235.00  239.00    1.7%
   test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 8737.00 8859.00    1.4%
               test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 1051.00 1064.00    1.2%
        test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1628.00 1646.00    1.1%
       test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1628.00 1646.00    1.1%
  test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3565.00 3577.00    0.3%
   test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3565.00 3577.00    0.3%
     test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4240.00 4250.00    0.2%
            test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1996.00 1998.00    0.1%
               test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1671.00 1672.00    0.1%

test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 783.00 782.00 -0.1%

              test-suite :: SingleSource/Benchmarks/Misc/oourafft.test   69.00   68.00   -1.4%
test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test  207.00  192.00   -7.2%
 test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test  207.00  192.00   -7.2%

test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 89.00 80.00 -10.1%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 89.00 80.00 -10.1%

test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  260.00  215.00  -17.3%

test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 256.00 211.00 -17.6%

MultiSource/Benchmarks/Prolangs-C/TimberWolfMC - pretty the same.
SingleSource/Benchmarks/Misc/oourafft.test - 2 <2 x > loads replaced by
one <4 x> load.
External/SPEC/CINT2017speed/641.leela_s - function gets vectorized and
not inlined anymore.
External/SPEC/CINT2017rate/541.leela_r - same
xternal/SPEC/CINT2017rate/531.deepsjeng_r - changed the order in
multi-block tree, the result is pretty the same.
External/SPEC/CINT2017speed/631.deepsjeng_s - same.
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a - the result is the same
as before.
MultiSource/Benchmarks/MiBench/consumer-jpeg - same.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.Jan 5 2022, 12:16 PM

Herald added subscribers: dmgreen, hiraditya. · View Herald TranscriptJan 5 2022, 12:16 PM

ABataev requested review of this revision.Jan 5 2022, 12:16 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 5 2022, 12:16 PM

Harbormaster completed remote builds in B141745: Diff 397673.Jan 5 2022, 1:34 PM

I don't fully understand why you are completely removing the cost of external uses. This is modeling the additional extract instructions that will be needed if we decide to proceed with that specific operand order. This is useful when you have to break ties, for example if we to choose between instructions of the same opcode, one with external uses and the other without. In that case we would prefer to vectorize the instructions without uses. Perhaps the current implementation is causing issues because the external cost is always subtracted (line 1233) and is not just used as a tie breaker only when the costs are the same.
If I understand correctly the splat cost is orthogonal to the the external uses? Can't we have both?

Also could you split the MainAlt changes into a separate patch?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1216–1217	Please explain in the comments why it is more profitable.
1217	Could you also add to the comment what `OpIdx` and `Idx` are.
1219	This needs a better name because it gets a bit confusing later on when you introduce `OpV`. Perhaps call this `OpIdxV` and the other one `IdxV` ?
1223	Could you rename `I` to `Ln` ?
1238–1240	Perhaps replace these lines with: `int UniquesCountWithV = Uniques.contains(V) ? UniquesCount : UniquesCount + 1;`
1242–1244	same
1399	Could you add to the comment about the what each `unsigned` is in the pair? I think it is operand index and lane.
1466	Why skip if score is 0 ?
1467	Is it Score or Cost? Score usually suggests the higher the better (and Cost the lower the better).
3178	Why is it better to do this bottom-up ? Could this be a separate patch?

In D116688#3224141, @vporpo wrote:

I don't fully understand why you are completely removing the cost of external uses. This is modeling the additional extract instructions that will be needed if we decide to proceed with that specific operand order. This is useful when you have to break ties, for example if we to choose between instructions of the same opcode, one with external uses and the other without. In that case we would prefer to vectorize the instructions without uses. Perhaps the current implementation is causing issues because the external cost is always subtracted (line 1233) and is not just used as a tie breaker only when the costs are the same.
If I understand correctly the splat cost is orthogonal to the the external uses? Can't we have both?

Also could you split the MainAlt changes into a separate patch?

Sure, I will split the patch (already did it for the first part).

There are several reasons.

External uses. It will affect the cost in any case (does not matter in which lane the scalar is used) so we can just ignore it.
In-tree use or use in graph. Currently it is too pessimistic. We subtract 1 for each such use and the total cost of the lane becomes NumLanes less, though in the worst case it should be just 1-2 less because we end up just with a single shuffle, not with the gather of extracts. Better just to ignore all these uses and just count number of unique scalar to see if we can vectorize the lane after removing reused scalars.

ABataev mentioned this in D116740: [SLP]Improve reordering for the nodes beeing used in alternate vectorization..Jan 6 2022, 6:43 AM

ABataev added inline comments.Jan 6 2022, 7:18 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1216–1217	Do you suggest to ad thу explanation here?
1219	Ok
1223	Ok
1238–1240	Ok
1399	Sure, will do
1466	Score 0 means failed match, it won't be vectorized for sure so no need to check for the reused scalars.
1467	It is cost currently, will rework it to score.
3178	Yes, see D116740

ABataev mentioned this in rGd130df544d6c: [SLP]Improve reordering for the nodes beeing used in alternate vectorization..Jan 6 2022, 11:20 AM

I agree with your second point, it is too pessimistic and should be fixed.
But I don't fully follow your first point. What do you mean that it will affect the cost in all lanes?

Here is an example where the external uses cost can help:

%1 = load A[0]
%2 = load A[1] // %2 has external use
%3 = load B[0]
%4 = load A[1]
%Ln1 = add %1, %3
%Ln2 = add %2, %4
...
... = %2 // External use of %2

While doing the operand reordering we can choose to vectorize either {%1, %2} or {%1, %4}.
Both have the same opcodes etc. so the rest of the cost calculation will give them the exact same score.
But wouldn't we prefer to vectorize {%1, %4} rather than {%1, %2} to avoid the extract instruction?
How would we do this without taking into account the cost of the external uses?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1216–1217	Yes, please add the explanation in the comments.

In D116688#3225955, @vporpo wrote:
I agree with your second point, it is too pessimistic and should be fixed.
But I don't fully follow your first point. What do you mean that it will affect the cost in all lanes?

Here is an example where the external uses cost can help:
%1 = load A[0]
%2 = load A[1] // %2 has external use
%3 = load B[0]
%4 = load A[1]
%Ln1 = add %1, %3
%Ln2 = add %2, %4
...
... = %2 // External use of %2
While doing the operand reordering we can choose to vectorize either {%1, %2} or {%1, %4}.
Both have the same opcodes etc. so the rest of the cost calculation will give them the exact same score.
But wouldn't we prefer to vectorize {%1, %4} rather than {%1, %2} to avoid the extract instruction?
How would we do this without taking into account the cost of the external uses?

These 2 loads from A[1] will be combined by instcombine before SLP.

This is obviously a contrived example, but it highlights the issue. It still holds if you replace the loads with other instructions of the same opcode that lead to a similar situation requiring tie-breaking using the cost of external uses.

In D116688#3226015, @vporpo wrote:

This is obviously a contrived example, but it highlights the issue. It still holds if you replace the loads with other instructions of the same opcode that lead to a similar situation requiring tie-breaking using the cost of external uses.

If the instructions are the same, they will be combined. If the instructions are different, still better to choose the instruction with the higher score, even if it is externally used (we may have a deeper graph, which is still better for the vectorization). There is only one relevant case - if the instructions are very-very similar (their scores are equal). If the instruction is externally used, it still might be vectorized as part of another tree. The only preference here - instruction with a single use (or all vectorized users) and instruction with many uses, I believe. We can check for something like this and consider instruction with a single use (or all vectorized users) as a better choice rather than the instruction with not all vectorized users.

Yes, I agree, we need a better way of dealing with uses. We should consider the score first and only check for uses if the scores are exactly the same.
Checking for all-vectorized uses versus not-all-vectorized also makes sense.

Could you please add a TODO (perhaps near line 1336?) with a brief description on how we should be dealing with uses?

In D116688#3226339, @vporpo wrote:

Yes, I agree, we need a better way of dealing with uses. We should consider the score first and only check for uses if the scores are exactly the same.
Checking for all-vectorized uses versus not-all-vectorized also makes sense.

Could you please add a TODO (perhaps near line 1336?) with a brief description on how we should be dealing with uses?

I was going to add the analysis of the usage to the patch, have a quick prototype but it has some regressions, will work on the improvements tomorrow.

Added score for all vectorized users.

Harbormaster completed remote builds in B142153: Diff 398233.Jan 7 2022, 2:44 PM

vporpo added inline comments.Jan 7 2022, 3:13 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1259	Could you add a comment explaining why we return `ScoreAllUserVectorized` for this type of instructions?
1465–1467	Since the score calculation is a bit more complicated now, I think it makes sense to move all the score calculation logic into a separate function like `getScore()` which will help hide all the calls to the separate score functions `getLookAheadScore()`, `getSplatScore()`, `getExternalScore()` and the score scaling. What do you think?
1479	nit: use a static constexpr for the scaling factor

RKSimon retitled this revision from [SLP]Excluded external uses from the reprdering estimation. to [SLP]Excluded external uses from the reordering estimation..Jan 12 2022, 2:58 AM

ABataev marked an inline comment as done.Feb 1 2022, 10:17 AM

ABataev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1465–1467	I would keep all these smaller functions but will create `getScore()` for a final score.
1479	Yep, will fix it.

Rebase + address comments

vporpo accepted this revision.Feb 2 2022, 11:22 AM

This revision is now accepted and ready to land.Feb 2 2022, 11:22 AM

Harbormaster completed remote builds in B147208: Diff 405365.Feb 2 2022, 2:51 PM

This revision was landed with ongoing or failed builds.Feb 3 2022, 6:55 AM

Closed by commit rG802ceb8343a2: [SLP]Excluded external uses from the reordering estimation. (authored by ABataev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rG802ceb8343a2: [SLP]Excluded external uses from the reordering estimation..

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

277 lines

test/

Transforms/

SLPVectorizer/

AArch64/

transpose-inseltpoison.ll

11 lines

transpose.ll

11 lines

X86/

crash_exceed_scheduling.ll

2 lines

vec_list_bias-inseltpoison.ll

4 lines

vec_list_bias.ll

4 lines

Diff 405627

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	static cl::opt<unsigned> MinTreeSize(
cl::desc("Only vectorize small trees if they are fully vectorizable"));		cl::desc("Only vectorize small trees if they are fully vectorizable"));

// The maximum depth that the look-ahead score heuristic will explore.		// The maximum depth that the look-ahead score heuristic will explore.
// The higher this value, the higher the compilation time overhead.		// The higher this value, the higher the compilation time overhead.
static cl::opt<int> LookAheadMaxDepth(		static cl::opt<int> LookAheadMaxDepth(
"slp-max-look-ahead-depth", cl::init(2), cl::Hidden,		"slp-max-look-ahead-depth", cl::init(2), cl::Hidden,
cl::desc("The maximum look-ahead depth for operand reordering scores"));		cl::desc("The maximum look-ahead depth for operand reordering scores"));

// The Look-ahead heuristic goes through the users of the bundle to calculate
// the users cost in getExternalUsesCost(). To avoid compilation time increase
// we limit the number of users visited to this value.
static cl::opt<unsigned> LookAheadUsersBudget(
"slp-look-ahead-users-budget", cl::init(2), cl::Hidden,
cl::desc("The maximum number of users to visit while visiting the "
"predecessors. This prevents compilation time increase."));

static cl::opt<bool>		static cl::opt<bool>
ViewSLPTree("view-slp-tree", cl::Hidden,		ViewSLPTree("view-slp-tree", cl::Hidden,
cl::desc("Display the SLP trees with Graphviz"));		cl::desc("Display the SLP trees with Graphviz"));

// Limit the number of alias checks. The limit is chosen so that		// Limit the number of alias checks. The limit is chosen so that
// it has no negative effect on the llvm benchmarks.		// it has no negative effect on the llvm benchmarks.
static const unsigned AliasedCheckLimit = 10;		static const unsigned AliasedCheckLimit = 10;

▲ Show 20 Lines • Show All 929 Lines • ▼ Show 20 Lines	class VLOperands {
/// Instructions with alt opcodes (e.g, add + sub).		/// Instructions with alt opcodes (e.g, add + sub).
static const int ScoreAltOpcodes = 1;		static const int ScoreAltOpcodes = 1;
/// Identical instructions (a.k.a. splat or broadcast).		/// Identical instructions (a.k.a. splat or broadcast).
static const int ScoreSplat = 1;		static const int ScoreSplat = 1;
/// Matching with an undef is preferable to failing.		/// Matching with an undef is preferable to failing.
static const int ScoreUndef = 1;		static const int ScoreUndef = 1;
/// Score for failing to find a decent match.		/// Score for failing to find a decent match.
static const int ScoreFail = 0;		static const int ScoreFail = 0;
/// User exteranl to the vectorized code.		/// Score if all users are vectorized.
static const int ExternalUseCost = 1;		static const int ScoreAllUserVectorized = 1;
/// The user is internal but in a different lane.
static const int UserInDiffLaneCost = ExternalUseCost;

/// \returns the score of placing \p V1 and \p V2 in consecutive lanes.		/// \returns the score of placing \p V1 and \p V2 in consecutive lanes.
		/// Also, checks if \p V1 and \p V2 are compatible with instructions in \p
		/// MainAltOps.
static int getShallowScore(Value V1, Value V2, const DataLayout &DL,		static int getShallowScore(Value V1, Value V2, const DataLayout &DL,
ScalarEvolution &SE, int NumLanes) {		ScalarEvolution &SE, int NumLanes,
		ArrayRef<Value *> MainAltOps) {
if (V1 == V2)		if (V1 == V2)
return VLOperands::ScoreSplat;		return VLOperands::ScoreSplat;

auto *LI1 = dyn_cast<LoadInst>(V1);		auto *LI1 = dyn_cast<LoadInst>(V1);
auto *LI2 = dyn_cast<LoadInst>(V2);		auto *LI2 = dyn_cast<LoadInst>(V2);
if (LI1 && LI2) {		if (LI1 && LI2) {
if (LI1->getParent() != LI2->getParent())		if (LI1->getParent() != LI2->getParent())
return VLOperands::ScoreFail;		return VLOperands::ScoreFail;

Optional<int> Dist = getPointersDiff(		Optional<int> Dist = getPointersDiff(
LI1->getType(), LI1->getPointerOperand(), LI2->getType(),		LI1->getType(), LI1->getPointerOperand(), LI2->getType(),
LI2->getPointerOperand(), DL, SE, /StrictCheck=/true);		LI2->getPointerOperand(), DL, SE, /StrictCheck=/true);
if (!Dist)		if (!Dist \|\| *Dist == 0)
return VLOperands::ScoreFail;		return VLOperands::ScoreFail;
// The distance is too large - still may be profitable to use masked		// The distance is too large - still may be profitable to use masked
// loads/gathers.		// loads/gathers.
if (std::abs(*Dist) > NumLanes / 2)		if (std::abs(*Dist) > NumLanes / 2)
return VLOperands::ScoreAltOpcodes;		return VLOperands::ScoreAltOpcodes;
// This still will detect consecutive loads, but we might have "holes"		// This still will detect consecutive loads, but we might have "holes"
// in some cases. It is ok for non-power-2 vectorization and may produce		// in some cases. It is ok for non-power-2 vectorization and may produce
// better results. It should not affect current vectorization.		// better results. It should not affect current vectorization.
Show All 25 Lines	static int getShallowScore(Value V1, Value V2, const DataLayout &DL,
if (isUndefVector(EV2) && EV2->getType() == EV1->getType())		if (isUndefVector(EV2) && EV2->getType() == EV1->getType())
return VLOperands::ScoreConsecutiveExtracts;		return VLOperands::ScoreConsecutiveExtracts;
if (EV2 == EV1) {		if (EV2 == EV1) {
int Idx1 = Ex1Idx->getZExtValue();		int Idx1 = Ex1Idx->getZExtValue();
int Idx2 = Ex2Idx->getZExtValue();		int Idx2 = Ex2Idx->getZExtValue();
int Dist = Idx2 - Idx1;		int Dist = Idx2 - Idx1;
// The distance is too large - still may be profitable to use		// The distance is too large - still may be profitable to use
// shuffles.		// shuffles.
		if (std::abs(Dist) == 0)
		return VLOperands::ScoreSplat;
if (std::abs(Dist) > NumLanes / 2)		if (std::abs(Dist) > NumLanes / 2)
return VLOperands::ScoreAltOpcodes;		return VLOperands::ScoreSameOpcode;
return (Dist > 0) ? VLOperands::ScoreConsecutiveExtracts		return (Dist > 0) ? VLOperands::ScoreConsecutiveExtracts
: VLOperands::ScoreReversedExtracts;		: VLOperands::ScoreReversedExtracts;
}		}
		return VLOperands::ScoreAltOpcodes;
}		}
		return VLOperands::ScoreFail;
}		}

auto *I1 = dyn_cast<Instruction>(V1);		auto *I1 = dyn_cast<Instruction>(V1);
auto *I2 = dyn_cast<Instruction>(V2);		auto *I2 = dyn_cast<Instruction>(V2);
if (I1 && I2) {		if (I1 && I2) {
if (I1->getParent() != I2->getParent())		if (I1->getParent() != I2->getParent())
return VLOperands::ScoreFail;		return VLOperands::ScoreFail;
InstructionsState S = getSameOpcode({I1, I2});		SmallVector<Value *, 4> Ops(MainAltOps.begin(), MainAltOps.end());
		Ops.push_back(I1);
		Ops.push_back(I2);
		InstructionsState S = getSameOpcode(Ops);
// Note: Only consider instructions with <= 2 operands to avoid		// Note: Only consider instructions with <= 2 operands to avoid
// complexity explosion.		// complexity explosion.
if (S.getOpcode() && S.MainOp->getNumOperands() <= 2)		if (S.getOpcode() &&
		(S.MainOp->getNumOperands() <= 2 \|\| !MainAltOps.empty() \|\|
		!S.isAltShuffle()) &&
		all_of(Ops, [&S](Value *V) {
		return cast<Instruction>(V)->getNumOperands() ==
		S.MainOp->getNumOperands();
		}))
return S.isAltShuffle() ? VLOperands::ScoreAltOpcodes		return S.isAltShuffle() ? VLOperands::ScoreAltOpcodes
: VLOperands::ScoreSameOpcode;		: VLOperands::ScoreSameOpcode;
}		}

if (isa<UndefValue>(V2))		if (isa<UndefValue>(V2))
return VLOperands::ScoreUndef;		return VLOperands::ScoreUndef;

return VLOperands::ScoreFail;		return VLOperands::ScoreFail;
}		}

/// Holds the values and their lanes that are taking part in the look-ahead		/// \param Lane lane of the operands under analysis.
/// score calculation. This is used in the external uses cost calculation.		/// \param OpIdx operand index in \p Lane lane we're looking the best
/// Need to hold all the lanes in case of splat/broadcast at least to		/// candidate for.
		vporpoUnsubmitted Not Done Reply Inline Actions Could you also add to the comment what `OpIdx` and `Idx` are. vporpo: Could you also add to the comment what `OpIdx` and `Idx` are.
		vporpoUnsubmitted Not Done Reply Inline Actions Please explain in the comments why it is more profitable. vporpo: Please explain in the comments why it is more profitable.
		ABataevAuthorUnsubmitted Done Reply Inline Actions Do you suggest to ad thу explanation here? ABataev: Do you suggest to ad thу explanation here?
		vporpoUnsubmitted Not Done Reply Inline Actions Yes, please add the explanation in the comments. vporpo: Yes, please add the explanation in the comments.
/// correctly check for the use in the different lane.		/// \param Idx operand index of the current candidate value.
SmallDenseMap<Value *, SmallSet<int, 4>> InLookAheadValues;		/// \returns The additional score due to possible broadcasting of the
		vporpoUnsubmitted Not Done Reply Inline Actions This needs a better name because it gets a bit confusing later on when you introduce `OpV`. Perhaps call this `OpIdxV` and the other one `IdxV` ? vporpo: This needs a better name because it gets a bit confusing later on when you introduce `OpV`.
		ABataevAuthorUnsubmitted Done Reply Inline Actions Ok ABataev: Ok
		/// elements in the lane. It is more profitable to have power-of-2 unique
/// \returns the additional cost due to uses of \p LHS and \p RHS that are		/// elements in the lane, it will be vectorized with higher probability
/// either external to the vectorized code, or require shuffling.		/// after removing duplicates. Currently the SLP vectorizer supports only
int getExternalUsesCost(const std::pair<Value *, int> &LHS,		/// vectorization of the power-of-2 number of unique scalars.
		vporpoUnsubmitted Not Done Reply Inline Actions Could you rename `I` to `Ln` ? vporpo: Could you rename `I` to `Ln` ?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Ok ABataev: Ok
const std::pair<Value *, int> &RHS) {		int getSplatScore(unsigned Lane, unsigned OpIdx, unsigned Idx) const {
int Cost = 0;		Value *IdxLaneV = getData(Idx, Lane).V;
std::array<std::pair<Value *, int>, 2> Values = {{LHS, RHS}};		if (!isa<Instruction>(IdxLaneV) \|\| IdxLaneV == getData(OpIdx, Lane).V)
for (int Idx = 0, IdxE = Values.size(); Idx != IdxE; ++Idx) {		return 0;
Value *V = Values[Idx].first;		SmallPtrSet<Value *, 4> Uniques;
if (isa<Constant>(V)) {		for (unsigned Ln = 0, E = getNumLanes(); Ln < E; ++Ln) {
// Since this is a function pass, it doesn't make semantic sense to		if (Ln == Lane)
// walk the users of a subclass of Constant. The users could be in
// another function, or even another module that happens to be in
// the same LLVMContext.
continue;		continue;
		Value *OpIdxLnV = getData(OpIdx, Ln).V;
		if (!isa<Instruction>(OpIdxLnV))
		return 0;
		Uniques.insert(OpIdxLnV);
}		}
		int UniquesCount = Uniques.size();
// Calculate the absolute lane, using the minimum relative lane of LHS		int UniquesCntWithIdxLaneV =
// and RHS as base and Idx as the offset.		Uniques.contains(IdxLaneV) ? UniquesCount : UniquesCount + 1;
int Ln = std::min(LHS.second, RHS.second) + Idx;		Value *OpIdxLaneV = getData(OpIdx, Lane).V;
		vporpoUnsubmitted Not Done Reply Inline Actions Perhaps replace these lines with: `int UniquesCountWithV = Uniques.contains(V) ? UniquesCount : UniquesCount + 1;` vporpo: Perhaps replace these lines with: `int UniquesCountWithV = Uniques.contains(V) ? UniquesCount…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Ok ABataev: Ok
assert(Ln >= 0 && "Bad lane calculation");		int UniquesCntWithOpIdxLaneV =
unsigned UsersBudget = LookAheadUsersBudget;		Uniques.contains(OpIdxLaneV) ? UniquesCount : UniquesCount + 1;
for (User *U : V->users()) {		if (UniquesCntWithIdxLaneV == UniquesCntWithOpIdxLaneV)
if (const TreeEntry *UserTE = R.getTreeEntry(U)) {		return 0;
		vporpoUnsubmitted Not Done Reply Inline Actions same vporpo: same
// The user is in the VectorizableTree. Check if we need to insert.		return (PowerOf2Ceil(UniquesCntWithOpIdxLaneV) -
int UserLn = UserTE->findLaneForValue(U);		UniquesCntWithOpIdxLaneV) -
assert(UserLn >= 0 && "Bad lane");		(PowerOf2Ceil(UniquesCntWithIdxLaneV) - UniquesCntWithIdxLaneV);
// If the values are different, check just the line of the current		}
// value. If the values are the same, need to add UserInDiffLaneCost
// only if UserLn does not match both line numbers.		/// \param Lane lane of the operands under analysis.
if ((LHS.first != RHS.first && UserLn != Ln) \|\|		/// \param OpIdx operand index in \p Lane lane we're looking the best
(LHS.first == RHS.first && UserLn != LHS.second &&		/// candidate for.
UserLn != RHS.second)) {		/// \param Idx operand index of the current candidate value.
Cost += UserInDiffLaneCost;		/// \returns The additional score for the scalar which users are all
break;		/// vectorized.
}		int getExternalUseScore(unsigned Lane, unsigned OpIdx, unsigned Idx) const {
} else {		Value *IdxLaneV = getData(Idx, Lane).V;
// Check if the user is in the look-ahead code.		Value *OpIdxLaneV = getData(OpIdx, Lane).V;
auto It2 = InLookAheadValues.find(U);		// Do not care about number of uses for vector-like instructions
		vporpoUnsubmitted Done Reply Inline Actions Could you add a comment explaining why we return `ScoreAllUserVectorized` for this type of instructions? vporpo: Could you add a comment explaining why we return `ScoreAllUserVectorized` for this type of…
if (It2 != InLookAheadValues.end()) {		// (extractelement/extractvalue with constant indices), they are extracts
// The user is in the look-ahead code. Check the lane.		// themselves and already externally used. Vectorization of such
if (!It2->getSecond().contains(Ln)) {		// instructions does not add extra extractelement instruction, just may
Cost += UserInDiffLaneCost;		// remove it.
break;		if (isVectorLikeInstWithConstOps(IdxLaneV) &&
}		isVectorLikeInstWithConstOps(OpIdxLaneV))
} else {		return VLOperands::ScoreAllUserVectorized;
// The user is neither in SLP tree nor in the look-ahead code.		auto *IdxLaneI = dyn_cast<Instruction>(IdxLaneV);
Cost += ExternalUseCost;		if (!IdxLaneI \|\| !isa<Instruction>(OpIdxLaneV))
break;		return 0;
}		return R.areAllUsersVectorized(IdxLaneI, None)
}		? VLOperands::ScoreAllUserVectorized
// Limit the number of visited uses to cap compilation time.		: 0;
if (--UsersBudget == 0)
break;
}
}
return Cost;
}		}

/// Go through the operands of \p LHS and \p RHS recursively until \p		/// Go through the operands of \p LHS and \p RHS recursively until \p
/// MaxLevel, and return the cummulative score. For example:		/// MaxLevel, and return the cummulative score. For example:
/// \verbatim		/// \verbatim
/// A[0] B[0] A[1] B[1] C[0] D[0] B[1] A[1]		/// A[0] B[0] A[1] B[1] C[0] D[0] B[1] A[1]
/// \ / \ / \ / \ /		/// \ / \ / \ / \ /
/// + + + +		/// + + + +
/// G1 G2 G3 G4		/// G1 G2 G3 G4
/// \endverbatim		/// \endverbatim
/// The getScoreAtLevelRec(G1, G2) function will try to match the nodes at		/// The getScoreAtLevelRec(G1, G2) function will try to match the nodes at
/// each level recursively, accumulating the score. It starts from matching		/// each level recursively, accumulating the score. It starts from matching
/// the additions at level 0, then moves on to the loads (level 1). The		/// the additions at level 0, then moves on to the loads (level 1). The
/// score of G1 and G2 is higher than G1 and G3, because {A[0],A[1]} and		/// score of G1 and G2 is higher than G1 and G3, because {A[0],A[1]} and
/// {B[0],B[1]} match with VLOperands::ScoreConsecutiveLoads, while		/// {B[0],B[1]} match with VLOperands::ScoreConsecutiveLoads, while
/// {A[0],C[0]} has a score of VLOperands::ScoreFail.		/// {A[0],C[0]} has a score of VLOperands::ScoreFail.
/// Please note that the order of the operands does not matter, as we		/// Please note that the order of the operands does not matter, as we
/// evaluate the score of all profitable combinations of operands. In		/// evaluate the score of all profitable combinations of operands. In
/// other words the score of G1 and G4 is the same as G1 and G2. This		/// other words the score of G1 and G4 is the same as G1 and G2. This
/// heuristic is based on ideas described in:		/// heuristic is based on ideas described in:
/// Look-ahead SLP: Auto-vectorization in the presence of commutative		/// Look-ahead SLP: Auto-vectorization in the presence of commutative
/// operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,		/// operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,
/// Luís F. W. Góes		/// Luís F. W. Góes
int getScoreAtLevelRec(const std::pair<Value *, int> &LHS,		int getScoreAtLevelRec(Value LHS, Value RHS, int CurrLevel, int MaxLevel,
const std::pair<Value *, int> &RHS, int CurrLevel,		ArrayRef<Value *> MainAltOps) {
int MaxLevel) {

Value *V1 = LHS.first;
Value *V2 = RHS.first;
// Get the shallow score of V1 and V2.		// Get the shallow score of V1 and V2.
int ShallowScoreAtThisLevel = std::max(		int ShallowScoreAtThisLevel =
(int)ScoreFail, getShallowScore(V1, V2, DL, SE, getNumLanes()) -		getShallowScore(LHS, RHS, DL, SE, getNumLanes(), MainAltOps);
getExternalUsesCost(LHS, RHS));
int Lane1 = LHS.second;
int Lane2 = RHS.second;

// If reached MaxLevel,		// If reached MaxLevel,
// or if V1 and V2 are not instructions,		// or if V1 and V2 are not instructions,
// or if they are SPLAT,		// or if they are SPLAT,
// or if they are not consecutive,		// or if they are not consecutive,
// or if profitable to vectorize loads or extractelements, early return		// or if profitable to vectorize loads or extractelements, early return
// the current cost.		// the current cost.
auto *I1 = dyn_cast<Instruction>(V1);		auto *I1 = dyn_cast<Instruction>(LHS);
auto *I2 = dyn_cast<Instruction>(V2);		auto *I2 = dyn_cast<Instruction>(RHS);
if (CurrLevel == MaxLevel \|\| !(I1 && I2) \|\| I1 == I2 \|\|		if (CurrLevel == MaxLevel \|\| !(I1 && I2) \|\| I1 == I2 \|\|
ShallowScoreAtThisLevel == VLOperands::ScoreFail \|\|		ShallowScoreAtThisLevel == VLOperands::ScoreFail \|\|
(((isa<LoadInst>(I1) && isa<LoadInst>(I2)) \|\|		(((isa<LoadInst>(I1) && isa<LoadInst>(I2)) \|\|
		(I1->getNumOperands() > 2 && I2->getNumOperands() > 2) \|\|
(isa<ExtractElementInst>(I1) && isa<ExtractElementInst>(I2))) &&		(isa<ExtractElementInst>(I1) && isa<ExtractElementInst>(I2))) &&
ShallowScoreAtThisLevel))		ShallowScoreAtThisLevel))
return ShallowScoreAtThisLevel;		return ShallowScoreAtThisLevel;
assert(I1 && I2 && "Should have early exited.");		assert(I1 && I2 && "Should have early exited.");

// Keep track of in-tree values for determining the external-use cost.
InLookAheadValues[V1].insert(Lane1);
InLookAheadValues[V2].insert(Lane2);

// Contains the I2 operand indexes that got matched with I1 operands.		// Contains the I2 operand indexes that got matched with I1 operands.
SmallSet<unsigned, 4> Op2Used;		SmallSet<unsigned, 4> Op2Used;

// Recursion towards the operands of I1 and I2. We are trying all possible		// Recursion towards the operands of I1 and I2. We are trying all possible
// operand pairs, and keeping track of the best score.		// operand pairs, and keeping track of the best score.
for (unsigned OpIdx1 = 0, NumOperands1 = I1->getNumOperands();		for (unsigned OpIdx1 = 0, NumOperands1 = I1->getNumOperands();
OpIdx1 != NumOperands1; ++OpIdx1) {		OpIdx1 != NumOperands1; ++OpIdx1) {
// Try to pair op1I with the best operand of I2.		// Try to pair op1I with the best operand of I2.
int MaxTmpScore = 0;		int MaxTmpScore = 0;
unsigned MaxOpIdx2 = 0;		unsigned MaxOpIdx2 = 0;
bool FoundBest = false;		bool FoundBest = false;
// If I2 is commutative try all combinations.		// If I2 is commutative try all combinations.
unsigned FromIdx = isCommutative(I2) ? 0 : OpIdx1;		unsigned FromIdx = isCommutative(I2) ? 0 : OpIdx1;
unsigned ToIdx = isCommutative(I2)		unsigned ToIdx = isCommutative(I2)
? I2->getNumOperands()		? I2->getNumOperands()
: std::min(I2->getNumOperands(), OpIdx1 + 1);		: std::min(I2->getNumOperands(), OpIdx1 + 1);
assert(FromIdx <= ToIdx && "Bad index");		assert(FromIdx <= ToIdx && "Bad index");
for (unsigned OpIdx2 = FromIdx; OpIdx2 != ToIdx; ++OpIdx2) {		for (unsigned OpIdx2 = FromIdx; OpIdx2 != ToIdx; ++OpIdx2) {
// Skip operands already paired with OpIdx1.		// Skip operands already paired with OpIdx1.
if (Op2Used.count(OpIdx2))		if (Op2Used.count(OpIdx2))
continue;		continue;
// Recursively calculate the cost at each level		// Recursively calculate the cost at each level
int TmpScore = getScoreAtLevelRec({I1->getOperand(OpIdx1), Lane1},		int TmpScore =
{I2->getOperand(OpIdx2), Lane2},		getScoreAtLevelRec(I1->getOperand(OpIdx1), I2->getOperand(OpIdx2),
CurrLevel + 1, MaxLevel);		CurrLevel + 1, MaxLevel, None);
// Look for the best score.		// Look for the best score.
if (TmpScore > VLOperands::ScoreFail && TmpScore > MaxTmpScore) {		if (TmpScore > VLOperands::ScoreFail && TmpScore > MaxTmpScore) {
MaxTmpScore = TmpScore;		MaxTmpScore = TmpScore;
MaxOpIdx2 = OpIdx2;		MaxOpIdx2 = OpIdx2;
FoundBest = true;		FoundBest = true;
}		}
}		}
if (FoundBest) {		if (FoundBest) {
// Pair {OpIdx1, MaxOpIdx2} was found to be best. Never revisit it.		// Pair {OpIdx1, MaxOpIdx2} was found to be best. Never revisit it.
Op2Used.insert(MaxOpIdx2);		Op2Used.insert(MaxOpIdx2);
ShallowScoreAtThisLevel += MaxTmpScore;		ShallowScoreAtThisLevel += MaxTmpScore;
}		}
}		}
return ShallowScoreAtThisLevel;		return ShallowScoreAtThisLevel;
}		}

		/// Score scaling factor for fully compatible instructions but with
		/// different number of external uses. Allows better selection of the
		/// instructions with less external uses.
		static const int ScoreScaleFactor = 10;

/// \Returns the look-ahead score, which tells us how much the sub-trees		/// \Returns the look-ahead score, which tells us how much the sub-trees
/// rooted at \p LHS and \p RHS match, the more they match the higher the		/// rooted at \p LHS and \p RHS match, the more they match the higher the
/// score. This helps break ties in an informed way when we cannot decide on		/// score. This helps break ties in an informed way when we cannot decide on
/// the order of the operands by just considering the immediate		/// the order of the operands by just considering the immediate
/// predecessors.		/// predecessors.
int getLookAheadScore(const std::pair<Value *, int> &LHS,		int getLookAheadScore(Value LHS, Value RHS, ArrayRef<Value *> MainAltOps,
const std::pair<Value *, int> &RHS) {		int Lane, unsigned OpIdx, unsigned Idx,
InLookAheadValues.clear();		bool &IsUsed) {
return getScoreAtLevelRec(LHS, RHS, 1, LookAheadMaxDepth);		int Score =
}		getScoreAtLevelRec(LHS, RHS, 1, LookAheadMaxDepth, MainAltOps);
		if (Score) {
		int SplatScore = getSplatScore(Lane, OpIdx, Idx);
		if (Score <= -SplatScore) {
		// Set the minimum score for splat-like sequence to avoid setting
		// failed state.
		Score = 1;
		} else {
		Score += SplatScore;
		// Scale score to see the difference between different operands
		// and similar operands but all vectorized/not all vectorized
		// uses. It does not affect actual selection of the best
		// compatible operand in general, just allows to select the
		// operand with all vectorized uses.
		Score *= ScoreScaleFactor;
		Score += getExternalUseScore(Lane, OpIdx, Idx);
		IsUsed = true;
		}
		}
		return Score;
		}

		/// Best defined scores per lanes between the passes. Used to choose the
		/// best operand (with the highest score) between the passes.
		/// The key - {Operand Index, Lane}.
		vporpoUnsubmitted Not Done Reply Inline Actions Could you add to the comment about the what each `unsigned` is in the pair? I think it is operand index and lane. vporpo: Could you add to the comment about the what each `unsigned` is in the pair? I think it is…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Sure, will do ABataev: Sure, will do
		/// The value - the best score between the passes for the lane and the
		/// operand.
		SmallDenseMap<std::pair<unsigned, unsigned>, unsigned, 8>
		BestScoresPerLanes;

// Search all operands in Ops[*][Lane] for the one that matches best		// Search all operands in Ops[*][Lane] for the one that matches best
// Ops[OpIdx][LastLane] and return its opreand index.		// Ops[OpIdx][LastLane] and return its opreand index.
// If no good match can be found, return None.		// If no good match can be found, return None.
Optional<unsigned>		Optional<unsigned> getBestOperand(unsigned OpIdx, int Lane, int LastLane,
getBestOperand(unsigned OpIdx, int Lane, int LastLane,		ArrayRef<ReorderingMode> ReorderingModes,
ArrayRef<ReorderingMode> ReorderingModes) {		ArrayRef<Value *> MainAltOps) {
unsigned NumOperands = getNumOperands();		unsigned NumOperands = getNumOperands();

// The operand of the previous lane at OpIdx.		// The operand of the previous lane at OpIdx.
Value *OpLastLane = getData(OpIdx, LastLane).V;		Value *OpLastLane = getData(OpIdx, LastLane).V;

// Our strategy mode for OpIdx.		// Our strategy mode for OpIdx.
ReorderingMode RMode = ReorderingModes[OpIdx];		ReorderingMode RMode = ReorderingModes[OpIdx];
		if (RMode == ReorderingMode::Failed)
		return None;

// The linearized opcode of the operand at OpIdx, Lane.		// The linearized opcode of the operand at OpIdx, Lane.
bool OpIdxAPO = getData(OpIdx, Lane).APO;		bool OpIdxAPO = getData(OpIdx, Lane).APO;

// The best operand index and its score.		// The best operand index and its score.
// Sometimes we have more than one option (e.g., Opcode and Undefs), so we		// Sometimes we have more than one option (e.g., Opcode and Undefs), so we
// are using the score to differentiate between the two.		// are using the score to differentiate between the two.
struct BestOpData {		struct BestOpData {
Optional<unsigned> Idx = None;		Optional<unsigned> Idx = None;
unsigned Score = 0;		unsigned Score = 0;
} BestOp;		} BestOp;
		BestOp.Score =
		BestScoresPerLanes.try_emplace(std::make_pair(OpIdx, Lane), 0)
		.first->second;

		// Track if the operand must be marked as used. If the operand is set to
		// Score 1 explicitly (because of non power-of-2 unique scalars, we may
		// want to reestimate the operands again on the following iterations).
		bool IsUsed =
		RMode == ReorderingMode::Splat \|\| RMode == ReorderingMode::Constant;
// Iterate through all unused operands and look for the best.		// Iterate through all unused operands and look for the best.
for (unsigned Idx = 0; Idx != NumOperands; ++Idx) {		for (unsigned Idx = 0; Idx != NumOperands; ++Idx) {
// Get the operand at Idx and Lane.		// Get the operand at Idx and Lane.
OperandData &OpData = getData(Idx, Lane);		OperandData &OpData = getData(Idx, Lane);
Value *Op = OpData.V;		Value *Op = OpData.V;
bool OpAPO = OpData.APO;		bool OpAPO = OpData.APO;

// Skip already selected operands.		// Skip already selected operands.
Show All 9 Lines	Optional<unsigned> getBestOperand(unsigned OpIdx, int Lane, int LastLane,
// Look for an operand that matches the current mode.		// Look for an operand that matches the current mode.
switch (RMode) {		switch (RMode) {
case ReorderingMode::Load:		case ReorderingMode::Load:
case ReorderingMode::Constant:		case ReorderingMode::Constant:
case ReorderingMode::Opcode: {		case ReorderingMode::Opcode: {
bool LeftToRight = Lane > LastLane;		bool LeftToRight = Lane > LastLane;
Value *OpLeft = (LeftToRight) ? OpLastLane : Op;		Value *OpLeft = (LeftToRight) ? OpLastLane : Op;
Value *OpRight = (LeftToRight) ? Op : OpLastLane;		Value *OpRight = (LeftToRight) ? Op : OpLastLane;
unsigned Score =		int Score = getLookAheadScore(OpLeft, OpRight, MainAltOps, Lane,
getLookAheadScore({OpLeft, LastLane}, {OpRight, Lane});		OpIdx, Idx, IsUsed);
		vporpoUnsubmitted Not Done Reply Inline Actions Why skip if score is 0 ? vporpo: Why skip if score is 0 ?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Score 0 means failed match, it won't be vectorized for sure so no need to check for the reused scalars. ABataev: Score 0 means failed match, it won't be vectorized for sure so no need to check for the reused…
if (Score > BestOp.Score) {		if (Score > static_cast<int>(BestOp.Score)) {
		vporpoUnsubmitted Not Done Reply Inline Actions Is it Score or Cost? Score usually suggests the higher the better (and Cost the lower the better). vporpo: Is it Score or Cost? Score usually suggests the higher the better (and Cost the lower the…
		ABataevAuthorUnsubmitted Done Reply Inline Actions It is cost currently, will rework it to score. ABataev: It is cost currently, will rework it to score.
		vporpoUnsubmitted Not Done Reply Inline Actions Since the score calculation is a bit more complicated now, I think it makes sense to move all the score calculation logic into a separate function like `getScore()` which will help hide all the calls to the separate score functions `getLookAheadScore()`, `getSplatScore()`, `getExternalScore()` and the score scaling. What do you think? vporpo: Since the score calculation is a bit more complicated now, I think it makes sense to move all…
		ABataevAuthorUnsubmitted Done Reply Inline Actions I would keep all these smaller functions but will create `getScore()` for a final score. ABataev: I would keep all these smaller functions but will create `getScore()` for a final score.
BestOp.Idx = Idx;		BestOp.Idx = Idx;
BestOp.Score = Score;		BestOp.Score = Score;
		BestScoresPerLanes[std::make_pair(OpIdx, Lane)] = Score;
}		}
break;		break;
}		}
case ReorderingMode::Splat:		case ReorderingMode::Splat:
if (Op == OpLastLane)		if (Op == OpLastLane)
BestOp.Idx = Idx;		BestOp.Idx = Idx;
break;		break;
case ReorderingMode::Failed:		case ReorderingMode::Failed:
return None;		llvm_unreachable("Not expected Failed reordering mode.");
		vporpoUnsubmitted Not Done Reply Inline Actions nit: use a static constexpr for the scaling factor vporpo: nit: use a static constexpr for the scaling factor
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yep, will fix it. ABataev: Yep, will fix it.
}		}
}		}

if (BestOp.Idx) {		if (BestOp.Idx) {
getData(BestOp.Idx.getValue(), Lane).IsUsed = true;		getData(BestOp.Idx.getValue(), Lane).IsUsed = IsUsed;
return BestOp.Idx;		return BestOp.Idx;
}		}
// If we could not find a good match return None.		// If we could not find a good match return None.
return None;		return None;
}		}

/// Helper for reorderOperandVecs.		/// Helper for reorderOperandVecs.
/// \returns the lane that we should start reordering from. This is the one		/// \returns the lane that we should start reordering from. This is the one
▲ Show 20 Lines • Show All 300 Lines • ▼ Show 20 Lines	void reorder() {
// Skip the second pass if the first pass did not fail.		// Skip the second pass if the first pass did not fail.
bool StrategyFailed = false;		bool StrategyFailed = false;
// Mark all operand data as free to use.		// Mark all operand data as free to use.
clearUsed();		clearUsed();
// We keep the original operand order for the FirstLane, so reorder the		// We keep the original operand order for the FirstLane, so reorder the
// rest of the lanes. We are visiting the nodes in a circular fashion,		// rest of the lanes. We are visiting the nodes in a circular fashion,
// using FirstLane as the center point and increasing the radius		// using FirstLane as the center point and increasing the radius
// distance.		// distance.
		SmallVector<SmallVector<Value *, 2>> MainAltOps(NumOperands);
		for (unsigned I = 0; I < NumOperands; ++I)
		MainAltOps[I].push_back(getData(I, FirstLane).V);

for (unsigned Distance = 1; Distance != NumLanes; ++Distance) {		for (unsigned Distance = 1; Distance != NumLanes; ++Distance) {
// Visit the lane on the right and then the lane on the left.		// Visit the lane on the right and then the lane on the left.
for (int Direction : {+1, -1}) {		for (int Direction : {+1, -1}) {
int Lane = FirstLane + Direction * Distance;		int Lane = FirstLane + Direction * Distance;
if (Lane < 0 \|\| Lane >= (int)NumLanes)		if (Lane < 0 \|\| Lane >= (int)NumLanes)
continue;		continue;
int LastLane = Lane - Direction;		int LastLane = Lane - Direction;
assert(LastLane >= 0 && LastLane < (int)NumLanes &&		assert(LastLane >= 0 && LastLane < (int)NumLanes &&
"Out of bounds");		"Out of bounds");
// Look for a good match for each operand.		// Look for a good match for each operand.
for (unsigned OpIdx = 0; OpIdx != NumOperands; ++OpIdx) {		for (unsigned OpIdx = 0; OpIdx != NumOperands; ++OpIdx) {
// Search for the operand that matches SortedOps[OpIdx][Lane-1].		// Search for the operand that matches SortedOps[OpIdx][Lane-1].
Optional<unsigned> BestIdx =		Optional<unsigned> BestIdx = getBestOperand(
getBestOperand(OpIdx, Lane, LastLane, ReorderingModes);		OpIdx, Lane, LastLane, ReorderingModes, MainAltOps[OpIdx]);
// By not selecting a value, we allow the operands that follow to		// By not selecting a value, we allow the operands that follow to
// select a better matching value. We will get a non-null value in		// select a better matching value. We will get a non-null value in
// the next run of getBestOperand().		// the next run of getBestOperand().
if (BestIdx) {		if (BestIdx) {
// Swap the current operand with the one returned by		// Swap the current operand with the one returned by
// getBestOperand().		// getBestOperand().
swap(OpIdx, BestIdx.getValue(), Lane);		swap(OpIdx, BestIdx.getValue(), Lane);
} else {		} else {
// We failed to find a best operand, set mode to 'Failed'.		// We failed to find a best operand, set mode to 'Failed'.
ReorderingModes[OpIdx] = ReorderingMode::Failed;		ReorderingModes[OpIdx] = ReorderingMode::Failed;
// Enable the second pass.		// Enable the second pass.
StrategyFailed = true;		StrategyFailed = true;
}		}
		// Try to get the alternate opcode and follow it during analysis.
		if (MainAltOps[OpIdx].size() != 2) {
		OperandData &AltOp = getData(OpIdx, Lane);
		InstructionsState OpS =
		getSameOpcode({MainAltOps[OpIdx].front(), AltOp.V});
		if (OpS.getOpcode() && OpS.isAltShuffle())
		MainAltOps[OpIdx].push_back(AltOp.V);
		}
}		}
}		}
}		}
// Skip second pass if the strategy did not fail.		// Skip second pass if the strategy did not fail.
if (!StrategyFailed)		if (!StrategyFailed)
break;		break;
}		}
}		}
▲ Show 20 Lines • Show All 1,322 Lines • ▼ Show 20 Lines	void BoUpSLP::reorderTopToBottom() {
// Find all reorderable nodes with the given VF.		// Find all reorderable nodes with the given VF.
// Currently the are vectorized stores,loads,extracts + some gathering of		// Currently the are vectorized stores,loads,extracts + some gathering of
// extracts.		// extracts.
for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders](		for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders](
const std::unique_ptr<TreeEntry> &TE) {		const std::unique_ptr<TreeEntry> &TE) {
if (Optional<OrdersType> CurrentOrder =		if (Optional<OrdersType> CurrentOrder =
getReorderingData(TE.get(), /TopToBottom=*/true)) {		getReorderingData(TE.get(), /TopToBottom=*/true)) {
// Do not include ordering for nodes used in the alt opcode vectorization,		// Do not include ordering for nodes used in the alt opcode vectorization,
// better to reorder them during bottom-to-top stage. If follow the order		// better to reorder them during bottom-to-top stage. If follow the order
		vporpoUnsubmitted Not Done Reply Inline Actions Why is it better to do this bottom-up ? Could this be a separate patch? vporpo: Why is it better to do this bottom-up ? Could this be a separate patch?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, see D116740 ABataev: Yes, see D116740
// here, it causes reordering of the whole graph though actually it is		// here, it causes reordering of the whole graph though actually it is
// profitable just to reorder the subgraph that starts from the alternate		// profitable just to reorder the subgraph that starts from the alternate
// opcode vectorization node. Such nodes already end-up with the shuffle		// opcode vectorization node. Such nodes already end-up with the shuffle
// instruction and it is just enough to change this shuffle rather than		// instruction and it is just enough to change this shuffle rather than
// rotate the scalars for the whole graph.		// rotate the scalars for the whole graph.
unsigned Cnt = 0;		unsigned Cnt = 0;
const TreeEntry *UserTE = TE.get();		const TreeEntry *UserTE = TE.get();
while (UserTE && Cnt < RecursionMaxDepth) {		while (UserTE && Cnt < RecursionMaxDepth) {
▲ Show 20 Lines • Show All 1,459 Lines • ▼ Show 20 Lines	bool BoUpSLP::canReuseExtract(ArrayRef<Value > VL, Value OpValue,

return ShouldKeepOrder;		return ShouldKeepOrder;
}		}

bool BoUpSLP::areAllUsersVectorized(Instruction *I,		bool BoUpSLP::areAllUsersVectorized(Instruction *I,
ArrayRef<Value *> VectorizedVals) const {		ArrayRef<Value *> VectorizedVals) const {
return (I->hasOneUse() && is_contained(VectorizedVals, I)) \|\|		return (I->hasOneUse() && is_contained(VectorizedVals, I)) \|\|
all_of(I->users(), [this](User *U) {		all_of(I->users(), [this](User *U) {
return ScalarToTreeEntry.count(U) > 0 \|\| MustGather.contains(U);		return ScalarToTreeEntry.count(U) > 0 \|\|
		isVectorLikeInstWithConstOps(U) \|\|
		(isa<ExtractElementInst>(U) && MustGather.contains(U));
});		});
}		}

static std::pair<InstructionCost, InstructionCost>		static std::pair<InstructionCost, InstructionCost>
getVectorCallCosts(CallInst CI, FixedVectorType VecTy,		getVectorCallCosts(CallInst CI, FixedVectorType VecTy,
TargetTransformInfo TTI, TargetLibraryInfo TLI) {		TargetTransformInfo TTI, TargetLibraryInfo TLI) {
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);

▲ Show 20 Lines • Show All 5,817 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll

	Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines

	define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {			define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {
	; CHECK-LABEL: @build_vec_v4i32_3_binops(			; CHECK-LABEL: @build_vec_v4i32_3_binops(
	; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]			; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]
	; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]]			; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]]
	; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>			; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>
	; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3>			; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3>
	; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]]			; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]]
	; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> zeroinitializer			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>
	; CHECK-NEXT: [[TMP7:%.*]] = xor <2 x i32> [[V0]], [[V1]]			; CHECK-NEXT: [[TMP6:%.*]] = xor <2 x i32> [[V0]], [[V1]]
	; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <2 x i32> <i32 1, i32 1>			; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]]
	; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]]			; CHECK-NEXT: [[TMP8:%.*]] = add <2 x i32> [[SHUFFLE]], [[TMP6]]
	; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]			; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> [[TMP8]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	; CHECK-NEXT: ret <4 x i32> [[TMP3_31]]			; CHECK-NEXT: ret <4 x i32> [[TMP3_31]]
	;			;
	%v0.0 = extractelement <2 x i32> %v0, i32 0			%v0.0 = extractelement <2 x i32> %v0, i32 0
	%v0.1 = extractelement <2 x i32> %v0, i32 1			%v0.1 = extractelement <2 x i32> %v0, i32 1
	%v1.0 = extractelement <2 x i32> %v1, i32 0			%v1.0 = extractelement <2 x i32> %v1, i32 0
	%v1.1 = extractelement <2 x i32> %v1, i32 1			%v1.1 = extractelement <2 x i32> %v1, i32 1
	%tmp0.0 = add i32 %v0.0, %v1.0			%tmp0.0 = add i32 %v0.0, %v1.0
	%tmp0.1 = add i32 %v0.1, %v1.1			%tmp0.1 = add i32 %v0.1, %v1.1
	▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll

	Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines

	define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {			define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) {
	; CHECK-LABEL: @build_vec_v4i32_3_binops(			; CHECK-LABEL: @build_vec_v4i32_3_binops(
	; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]			; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]
	; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]]			; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]]
	; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>			; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>
	; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3>			; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3>
	; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]]			; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]]
	; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> zeroinitializer			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>
	; CHECK-NEXT: [[TMP7:%.*]] = xor <2 x i32> [[V0]], [[V1]]			; CHECK-NEXT: [[TMP6:%.*]] = xor <2 x i32> [[V0]], [[V1]]
	; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <2 x i32> <i32 1, i32 1>			; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]]
	; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]]			; CHECK-NEXT: [[TMP8:%.*]] = add <2 x i32> [[SHUFFLE]], [[TMP6]]
	; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]			; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> [[TMP8]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	; CHECK-NEXT: ret <4 x i32> [[TMP3_31]]			; CHECK-NEXT: ret <4 x i32> [[TMP3_31]]
	;			;
	%v0.0 = extractelement <2 x i32> %v0, i32 0			%v0.0 = extractelement <2 x i32> %v0, i32 0
	%v0.1 = extractelement <2 x i32> %v0, i32 1			%v0.1 = extractelement <2 x i32> %v0, i32 1
	%v1.0 = extractelement <2 x i32> %v1, i32 0			%v1.0 = extractelement <2 x i32> %v1, i32 0
	%v1.1 = extractelement <2 x i32> %v1, i32 1			%v1.1 = extractelement <2 x i32> %v1, i32 1
	%tmp0.0 = add i32 %v0.0, %v1.0			%tmp0.0 = add i32 %v0.0, %v1.0
	%tmp0.1 = add i32 %v0.1, %v1.1			%tmp0.1 = add i32 %v0.1, %v1.1
	▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -slp-min-tree-size=2 -slp-threshold=-1000 -slp-max-look-ahead-depth=1 -slp-look-ahead-users-budget=1 -slp-schedule-budget=27 -S -mtriple=x86_64-unknown-linux-gnu \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -slp-min-tree-size=2 -slp-threshold=-1000 -slp-max-look-ahead-depth=1 -slp-schedule-budget=27 -S -mtriple=x86_64-unknown-linux-gnu \| FileCheck %s

	define void @exceed(double %0, double %1) {			define void @exceed(double %0, double %1) {
	; CHECK-LABEL: @exceed(			; CHECK-LABEL: @exceed(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x double> poison, double [[TMP0:%.]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x double> poison, double [[TMP0:%.]], i32 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[TMP0]], i32 1			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[TMP0]], i32 1
	; CHECK-NEXT: [[TMP4:%.]] = insertelement <2 x double> poison, double [[TMP1:%.]], i32 0			; CHECK-NEXT: [[TMP4:%.]] = insertelement <2 x double> poison, double [[TMP1:%.]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[TMP1]], i32 1			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[TMP1]], i32 1
	▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/vec_list_bias-inseltpoison.ll

	Show All 36 Lines
	; CHECK-NEXT: [[T40:%.*]] = mul nsw i32 [[T39]], 9633			; CHECK-NEXT: [[T40:%.*]] = mul nsw i32 [[T39]], 9633
	; CHECK-NEXT: [[T41:%.*]] = mul nsw i32 [[T25]], 2446			; CHECK-NEXT: [[T41:%.*]] = mul nsw i32 [[T25]], 2446
	; CHECK-NEXT: [[T42:%.*]] = mul nsw i32 [[T17]], 16819			; CHECK-NEXT: [[T42:%.*]] = mul nsw i32 [[T17]], 16819
	; CHECK-NEXT: [[T47:%.*]] = mul nsw i32 [[T37]], -16069			; CHECK-NEXT: [[T47:%.*]] = mul nsw i32 [[T37]], -16069
	; CHECK-NEXT: [[T48:%.*]] = mul nsw i32 [[T38]], -3196			; CHECK-NEXT: [[T48:%.*]] = mul nsw i32 [[T38]], -3196
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i32> poison, i32 [[T15]], i32 0			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i32> poison, i32 [[T15]], i32 0
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> [[TMP1]], i32 [[T40]], i32 1			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> [[TMP1]], i32 [[T40]], i32 1
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[T27]], i32 2			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[T27]], i32 2
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP3]], i32 [[T40]], i32 3			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP3]], i32 [[T47]], i32 3
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> <i32 poison, i32 poison, i32 6270, i32 poison>, i32 [[T9]], i32 0			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> <i32 poison, i32 poison, i32 6270, i32 poison>, i32 [[T9]], i32 0
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[T48]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[T48]], i32 1
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[T47]], i32 3			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[T40]], i32 3
	; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[TMP4]], [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP7]]			; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> <i32 0, i32 1, i32 6, i32 3>			; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> <i32 0, i32 1, i32 6, i32 3>
	; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <4 x i32> [[TMP10]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <4 x i32> [[TMP10]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP10]], i32 0			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP10]], i32 0
	; CHECK-NEXT: [[T69:%.*]] = insertelement <8 x i32> [[TMP11]], i32 [[TMP12]], i32 4			; CHECK-NEXT: [[T69:%.*]] = insertelement <8 x i32> [[TMP11]], i32 [[TMP12]], i32 4
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP10]], i32 1			; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP10]], i32 1
	; CHECK-NEXT: [[T70:%.*]] = insertelement <8 x i32> [[T69]], i32 [[TMP13]], i32 5			; CHECK-NEXT: [[T70:%.*]] = insertelement <8 x i32> [[T69]], i32 [[TMP13]], i32 5
	▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/vec_list_bias.ll

	Show All 36 Lines
	; CHECK-NEXT: [[T40:%.*]] = mul nsw i32 [[T39]], 9633			; CHECK-NEXT: [[T40:%.*]] = mul nsw i32 [[T39]], 9633
	; CHECK-NEXT: [[T41:%.*]] = mul nsw i32 [[T25]], 2446			; CHECK-NEXT: [[T41:%.*]] = mul nsw i32 [[T25]], 2446
	; CHECK-NEXT: [[T42:%.*]] = mul nsw i32 [[T17]], 16819			; CHECK-NEXT: [[T42:%.*]] = mul nsw i32 [[T17]], 16819
	; CHECK-NEXT: [[T47:%.*]] = mul nsw i32 [[T37]], -16069			; CHECK-NEXT: [[T47:%.*]] = mul nsw i32 [[T37]], -16069
	; CHECK-NEXT: [[T48:%.*]] = mul nsw i32 [[T38]], -3196			; CHECK-NEXT: [[T48:%.*]] = mul nsw i32 [[T38]], -3196
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i32> poison, i32 [[T15]], i32 0			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i32> poison, i32 [[T15]], i32 0
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> [[TMP1]], i32 [[T40]], i32 1			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> [[TMP1]], i32 [[T40]], i32 1
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[T27]], i32 2			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[T27]], i32 2
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP3]], i32 [[T40]], i32 3			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP3]], i32 [[T47]], i32 3
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> <i32 poison, i32 poison, i32 6270, i32 poison>, i32 [[T9]], i32 0			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> <i32 poison, i32 poison, i32 6270, i32 poison>, i32 [[T9]], i32 0
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[T48]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[T48]], i32 1
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[T47]], i32 3			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[T40]], i32 3
	; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[TMP4]], [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP7]]			; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> <i32 0, i32 1, i32 6, i32 3>			; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> <i32 0, i32 1, i32 6, i32 3>
	; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <4 x i32> [[TMP10]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <4 x i32> [[TMP10]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP10]], i32 0			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP10]], i32 0
	; CHECK-NEXT: [[T69:%.*]] = insertelement <8 x i32> [[TMP11]], i32 [[TMP12]], i32 4			; CHECK-NEXT: [[T69:%.*]] = insertelement <8 x i32> [[TMP11]], i32 [[TMP12]], i32 4
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP10]], i32 1			; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP10]], i32 1
	; CHECK-NEXT: [[T70:%.*]] = insertelement <8 x i32> [[T69]], i32 [[TMP13]], i32 5			; CHECK-NEXT: [[T70:%.*]] = insertelement <8 x i32> [[T69]], i32 [[TMP13]], i32 5
	▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines