This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
2/27
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
insert-element-build-vector-inseltpoison.ll
-
insert-element-build-vector.ll
-
pr48879-sroa.ll
-
vectorize-pair-path.ll

Differential D125287

[SLP] Improve root steering by building actual trees instead of calling the look-ahead heuristic
Needs ReviewPublic

Authored by vporpo on May 9 2022, 8:15 PM.

Download Raw Diff

Details

Reviewers

vdmitrie
ABataev
RKSimon
dmgreen

Summary

Finding the best roots using the lookahead heuristic is not as accurate as
building short trees and comparing their cost.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

vporpo created this revision.May 9 2022, 8:15 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2022, 8:15 PM

Herald added a subscriber: hiraditya. · View Herald Transcript

vporpo requested review of this revision.May 9 2022, 8:15 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2022, 8:15 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B163623: Diff 428277.May 9 2022, 8:16 PM

This fixes a regression in SingleSource/Benchmarks/Misc/flops-5.c. Increasing the RootLookaheadMaxDepth doesn't fix the issue either. Building small trees instead of calling the lookahead heuristic seems to be more accurate in this case.

Updated checks in tests.

Harbormaster completed remote builds in B163625: Diff 428280.May 9 2022, 8:43 PM

ABataev added inline comments.May 10 2022, 4:33 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	I'm afraid of increasing compile time. All this stuff includes scheduling, which may take lots of time for large basic blocks.

vporpo added inline comments.May 10 2022, 8:42 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	What if we set a flag to disable scheduling for these types of fast tree estimations?

vdmitrie added inline comments.May 10 2022, 8:45 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	Yep, I agree. this will be more expensive for compile time. What about combining both worlds? I mean first try to use lookahead heuristics to get the single best. And if we can't narrow down to just one pair only then switch into probing via building trees. I believe it will not happen too frequently. We can also increase lookahead depth to make it even less frequent when we need to build vectorizable tree.

ABataev added inline comments.May 10 2022, 8:55 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	There is a problem with this fix that it tries to avoid/mask the problem, not fix it. The fact that LookAhead.getScoreAtLevelRec does not work here means that we're doing something wrong there or missing something. Would be good to try to improve LookAhead.getScoreAtLevelRec

vdmitrie added inline comments.May 10 2022, 9:03 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	This problem does actually not have a perfect solution. This is a heuristics and it will always have something missed. You can improve it to fix one particular case but there will be eventually another instance of the same problem.

ABataev added inline comments.May 10 2022, 9:08 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	Yes, sure. But if the heuristic misses something, better to try tweak the heuristic rather than using actual cost/vectorization attempt and just ignore the heuristic, which exists exactly for this purpose.

vdmitrie added inline comments.May 10 2022, 9:22 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	No, " just ignore" is not what I said. We definitely should use the heuristics. But when it happens that we came to its limits then we could use more fine grained tools.

ABataev added inline comments.May 10 2022, 9:33 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	I rather doubt that building a graph can be called a "fine grained tool". This is different tool, not intended for the analysis. We can extract some functionality out of there (to a separate function/member function) and make the heuristic more smart, but not use the build graph directly. Same problem with the heuristic may exist in some other places, we need to handle them too.

vdmitrie added inline comments.May 10 2022, 10:15 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	Cost modeling is the tool I referred to as a "fine grained tool". We have to build a graph to run it. So it's sort of necessary evil. In this sense trying to turn off scheduler for the purpose of using CM as finer grained heuristics does not sound like a crazy idea.

ABataev added inline comments.May 10 2022, 10:19 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	That's what I don't agree here. Cost model is not the tool for the modelling. For modelling we have heuristic. If it is not good, need to tweak it.

vporpo added inline comments.May 10 2022, 10:33 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	One issue with the lookahead search is that it is trying both sides of commutative operations, so this doesn't scale if we need to increase the depth. So we need a different tool for testing deeper trees. I agree that there may be something that the lookahead heuristic is missing here, but I would argue that it is the wrong tool for the job. The buildTree() logic is a much more accurate for this. Reusing the existing buildTree logic with some compromises (e.g, limiting size and disabling scheduling) seems like a good compromise to me.

ABataev added inline comments.May 10 2022, 10:37 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	I would oppose that. I would not use buildTree() for estimation. If there is a part, which can be used for better estimation, better to extract it to a separate function/class and the reuse it in the heuristic and actual graph building separately.

vporpo added inline comments.May 10 2022, 10:40 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	What is the reasoning for opposing it?

vdmitrie added inline comments.May 10 2022, 10:45 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	That's what I don't agree here. Cost model is not the tool for the modelling. For modelling we have heuristic. If it is not good, need to tweak it. It would be nice if you explained why you are against using CM for selecting a candidate. Cost model as its name suggests is supposed to be used for modeling.

ABataev added inline comments.May 10 2022, 10:46 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	Bad design decision. We have 4 stages. Analysis. Tree building. Cost estimation. Codegen. You want to make a circular dependence between Analysis and Tree building/Cost estimation. But I'm not against reusing some of the code from buildTree()/cost estimation for the analysis phase. I'm just saying that this functionality must be extracted and then reused for the analysis and for the tree building/cost estimation (if possible, to reduce maintenance burden).

ABataev added inline comments.May 10 2022, 10:47 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	Because it is not modelling, it is analysis
2044–2050	I mean, you want to use it not for modelling but for the analysis

vporpo added inline comments.May 10 2022, 10:56 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	I think the high level design that you showed is not very accurate. We are actually doing multiple "tree builds" and "cost estimations" before generating code even in the current design. I don't see the "circular dependency" issue being introduced by this.

vdmitrie added inline comments.May 10 2022, 10:57 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	Distinction between the two is moot.

ABataev added inline comments.May 10 2022, 11:01 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	Thta's not because we mix analysis/modelling/estimation, but just because cost estimation shows that the tree is not profitable. The question is not about number of attempts, it is about the design.
2044–2050	Weak argument.

vdmitrie added inline comments.May 10 2022, 12:21 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	What about alternative solution which is kind of step back but buildTree+CM is not used for analysis? if lookahead heuristics cannot find single best findBestRootPair returns all indices that give the maximum score. Caller then uses approach it used before: tries to vectorize each until it finds the first which is profitable.

vporpo added inline comments.May 10 2022, 1:02 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	I don't see any strong argument against using buildTree+CostModel as long as buildTree is fast enough. The argument that this is somehow changing the design makes little sense because the pass is already following this buildTree+CostModel design. The only exception is perhaps for the lookahead search which is actually an example of a design to avoid: it is using its own custom tree-building and cost modeling, and requires special maintenance. Also the argument that we should extract some of the functionality and place it in a separate component is not very strong. Replicating similar functionality in multiple places is something that a good design should avoid. It just increases the maintenance overhead and will inevitably lead to divergence.

ABataev added inline comments.May 10 2022, 1:14 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	What about alternative solution which is kind of step back but buildTree+CM is not used for analysis? if lookahead heuristics cannot find single best findBestRootPair returns all indices that give the maximum score. Caller then uses approach it used before: tries to vectorize each until it finds the first which is profitable. Yes, it may work as a quick solution.

vporpo added inline comments.May 10 2022, 4:38 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2044–2050	if lookahead heuristics cannot find single best findBestRootPair returns all indices that give the maximum score. Caller then uses approach it used before: tries to vectorize each until it finds the first which is profitable. That won't work because lookahead finds a single best, but it turns out to be the wrong one.

Disabled the scheduler for the fast buildtree.

I checked the compile time overhead with perf on the lit test, and it is about the same as the version before @vdmitrie's patch 88b9e46fb54c.

Harbormaster completed remote builds in B163983: Diff 428774.May 11 2022, 2:04 PM

I think that providing a buildTreeFastAndGetCost() style of function is a decent solution for these types of problems, but I guess this needs more discussion. Adding @RKSimon and @dmgreen .

vporpo added reviewers: RKSimon, dmgreen.May 16 2022, 11:13 AM

vdmitrie added inline comments.May 16 2022, 11:48 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
849	(If we finally agree with taking this path) It should probably be possible to introduce simulation mode to BlockScheduling rather than guard each BS interface call.
905	this description update is seems leftover from the previous diff (i.e. not intentional)

Fixed stale comment and added DisableScheduling flag to BlockScheduling.

Harbormaster completed remote builds in B164735: Diff 429827.May 16 2022, 1:15 PM

ping

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

136 lines

test/

Transforms/

SLPVectorizer/

X86/

insert-element-build-vector-inseltpoison.ll

2 lines

insert-element-build-vector.ll

2 lines

pr48879-sroa.ll

52 lines

vectorize-pair-path.ll

17 lines

Diff 429827

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	static cl::opt<unsigned> MinTreeSize(
cl::desc("Only vectorize small trees if they are fully vectorizable"));		cl::desc("Only vectorize small trees if they are fully vectorizable"));

// The maximum depth that the look-ahead score heuristic will explore.		// The maximum depth that the look-ahead score heuristic will explore.
// The higher this value, the higher the compilation time overhead.		// The higher this value, the higher the compilation time overhead.
static cl::opt<int> LookAheadMaxDepth(		static cl::opt<int> LookAheadMaxDepth(
"slp-max-look-ahead-depth", cl::init(2), cl::Hidden,		"slp-max-look-ahead-depth", cl::init(2), cl::Hidden,
cl::desc("The maximum look-ahead depth for operand reordering scores"));		cl::desc("The maximum look-ahead depth for operand reordering scores"));

// The maximum depth that the look-ahead score heuristic will explore		// The maximum tree size that will use when probing among candidates for
// when it probing among candidates for vectorization tree roots.		// vectorization tree roots. The higher this value, the higher the compilation
// The higher this value, the higher the compilation time overhead but unlike		// time overhead but unlike similar limit for operands ordering this is less
// similar limit for operands ordering this is less frequently used, hence		// frequently used, hence impact of higher value is less noticeable.
// impact of higher value is less noticeable.		static cl::opt<unsigned> RootLookAheadMaxSize(
static cl::opt<int> RootLookAheadMaxDepth(		"slp-root-look-ahead-max-size", cl::init(5), cl::Hidden,
"slp-max-root-look-ahead-depth", cl::init(2), cl::Hidden,		cl::desc("The maximum tree size for searching best rooting option"));
cl::desc("The maximum look-ahead depth for searching best rooting option"));

static cl::opt<bool>		static cl::opt<bool>
ViewSLPTree("view-slp-tree", cl::Hidden,		ViewSLPTree("view-slp-tree", cl::Hidden,
cl::desc("Display the SLP trees with Graphviz"));		cl::desc("Display the SLP trees with Graphviz"));

// Limit the number of alias checks. The limit is chosen so that		// Limit the number of alias checks. The limit is chosen so that
// it has no negative effect on the llvm benchmarks.		// it has no negative effect on the llvm benchmarks.
static const unsigned AliasedCheckLimit = 10;		static const unsigned AliasedCheckLimit = 10;
▲ Show 20 Lines • Show All 651 Lines • ▼ Show 20 Lines
}		}

namespace slpvectorizer {		namespace slpvectorizer {

/// Bottom Up SLP Vectorizer.		/// Bottom Up SLP Vectorizer.
class BoUpSLP {		class BoUpSLP {
struct TreeEntry;		struct TreeEntry;
struct ScheduleData;		struct ScheduleData;
		/// Limit the size of the SLP tree to this many nodes.
		Optional<unsigned> MaxTreeSize;
		/// Disables scheduling. This is to limit the compilation time used by
		/// buildTree(), usually in combination with \p MaxTreeSize for quickly
		/// building approximate trees that can be used for estimating which roots are
		/// to be prefered.
		/// WARNING: If this is enabled the tree is not guaranteed to contain valid
		/// instruction bundles that can actually get codegened.
		bool ForceDisableScheduling = false;
		vdmitrieUnsubmitted Not Done Reply Inline Actions (If we finally agree with taking this path) It should probably be possible to introduce simulation mode to BlockScheduling rather than guard each BS interface call. vdmitrie: (If we finally agree with taking this path) It should probably be possible to introduce…

public:		public:
using ValueList = SmallVector<Value *, 8>;		using ValueList = SmallVector<Value *, 8>;
using InstrList = SmallVector<Instruction *, 16>;		using InstrList = SmallVector<Instruction *, 16>;
using ValueSet = SmallPtrSet<Value *, 16>;		using ValueSet = SmallPtrSet<Value *, 16>;
using StoreList = SmallVector<StoreInst *, 8>;		using StoreList = SmallVector<StoreInst *, 8>;
using ExtraValueToDebugLocsMap =		using ExtraValueToDebugLocsMap =
MapVector<Value , SmallVector<Instruction , 2>>;		MapVector<Value , SmallVector<Instruction , 2>>;
Show All 39 Lines	public:
InstructionCost getSpillCost() const;		InstructionCost getSpillCost() const;

/// \returns the vectorization cost of the subtree that starts at \p VL.		/// \returns the vectorization cost of the subtree that starts at \p VL.
/// A negative number means that this is profitable.		/// A negative number means that this is profitable.
InstructionCost getTreeCost(ArrayRef<Value *> VectorizedVals = None);		InstructionCost getTreeCost(ArrayRef<Value *> VectorizedVals = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
		vdmitrieUnsubmitted Done Reply Inline Actions this description update is seems leftover from the previous diff (i.e. not intentional) vdmitrie: this description update is seems leftover from the previous diff (i.e. not intentional)
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

		/// Builds a tree starting from \p roots with up to \p MaxSize nodes. For
		/// faster compilation time scheduling is disabled by default, or can be
		/// enabled with \p DisableScheduling = false.
		/// This function is to be used for a look-ahead style evaluation of root
		/// nodes and estimating which ones are worth building a full tree for.
		InstructionCost buildTreeFastAndGetCost(ArrayRef<Value *> Roots,
		Optional<unsigned> MaxSize = 5,
		bool DisableScheduling = true);

/// Builds external uses of the vectorized scalars, i.e. the list of		/// Builds external uses of the vectorized scalars, i.e. the list of
/// vectorized scalars to be extracted, their lanes and their scalar users. \p		/// vectorized scalars to be extracted, their lanes and their scalar users. \p
/// ExternallyUsedValues contains additional list of external uses to handle		/// ExternallyUsedValues contains additional list of external uses to handle
/// vectorization of reductions.		/// vectorization of reductions.
void		void
buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = {});		buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = {});

/// Clear the internal data structures that are created by 'buildTree'.		/// Clear the internal data structures that are created by 'buildTree'.
void deleteTree() {		void deleteTree() {
		ForceDisableScheduling = false;
		MaxTreeSize = None;
VectorizableTree.clear();		VectorizableTree.clear();
ScalarToTreeEntry.clear();		ScalarToTreeEntry.clear();
MustGather.clear();		MustGather.clear();
ExternalUses.clear();		ExternalUses.clear();
for (auto &Iter : BlocksSchedules) {		for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();		BlockScheduling *BS = Iter.second.get();
BS->clear();		BS->clear();
}		}
▲ Show 20 Lines • Show All 1,096 Lines • ▼ Show 20 Lines	#endif
};		};

/// Evaluate each pair in \p Candidates and return index into \p Candidates		/// Evaluate each pair in \p Candidates and return index into \p Candidates
/// for a pair which have highest score deemed to have best chance to form		/// for a pair which have highest score deemed to have best chance to form
/// root of profitable tree to vectorize. Return None if no candidate scored		/// root of profitable tree to vectorize. Return None if no candidate scored
/// above the LookAheadHeuristics::ScoreFail.		/// above the LookAheadHeuristics::ScoreFail.
Optional<int>		Optional<int>
findBestRootPair(ArrayRef<std::pair<Value , Value >> Candidates) {		findBestRootPair(ArrayRef<std::pair<Value , Value >> Candidates) {
LookAheadHeuristics LookAhead(DL, SE, this, /NumLanes=*/2,		InstructionCost BestCost = InstructionCost::getMax();
RootLookAheadMaxDepth);
int BestScore = LookAheadHeuristics::ScoreFail;
Optional<int> Index = None;		Optional<int> Index = None;

for (int I : seq<int>(0, Candidates.size())) {		for (int I : seq<int>(0, Candidates.size())) {
int Score = LookAhead.getScoreAtLevelRec(Candidates[I].first,		SmallVector<Value *, 2> Roots(
Candidates[I].second,		{Candidates[I].first, Candidates[I].second});
/U1=/nullptr, /U2=/nullptr,		InstructionCost Cost =
/Level=/1, None);		buildTreeFastAndGetCost(Roots, RootLookAheadMaxSize.getValue());
if (Score > BestScore) {		if (Cost < BestCost) {
BestScore = Score;		BestCost = Cost;
Index = I;		Index = I;
		ABataevUnsubmitted Not Done Reply Inline Actions I'm afraid of increasing compile time. All this stuff includes scheduling, which may take lots of time for large basic blocks. ABataev: I'm afraid of increasing compile time. All this stuff includes scheduling, which may take lots…
		vdmitrieUnsubmitted Not Done Reply Inline Actions Yep, I agree. this will be more expensive for compile time. What about combining both worlds? I mean first try to use lookahead heuristics to get the single best. And if we can't narrow down to just one pair only then switch into probing via building trees. I believe it will not happen too frequently. We can also increase lookahead depth to make it even less frequent when we need to build vectorizable tree. vdmitrie: Yep, I agree. this will be more expensive for compile time. What about combining both worlds? I…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions What if we set a flag to disable scheduling for these types of fast tree estimations? vporpo: What if we set a flag to disable scheduling for these types of fast tree estimations?
		ABataevUnsubmitted Not Done Reply Inline Actions There is a problem with this fix that it tries to avoid/mask the problem, not fix it. The fact that LookAhead.getScoreAtLevelRec does not work here means that we're doing something wrong there or missing something. Would be good to try to improve LookAhead.getScoreAtLevelRec ABataev: There is a problem with this fix that it tries to avoid/mask the problem, not fix it. The fact…
		vdmitrieUnsubmitted Not Done Reply Inline Actions This problem does actually not have a perfect solution. This is a heuristics and it will always have something missed. You can improve it to fix one particular case but there will be eventually another instance of the same problem. vdmitrie: This problem does actually not have a perfect solution. This is a heuristics and it will always…
		ABataevUnsubmitted Not Done Reply Inline Actions Yes, sure. But if the heuristic misses something, better to try tweak the heuristic rather than using actual cost/vectorization attempt and just ignore the heuristic, which exists exactly for this purpose. ABataev: Yes, sure. But if the heuristic misses something, better to try tweak the heuristic rather than…
		vdmitrieUnsubmitted Not Done Reply Inline Actions No, " just ignore" is not what I said. We definitely should use the heuristics. But when it happens that we came to its limits then we could use more fine grained tools. vdmitrie: No, " just ignore" is not what I said. We definitely should use the heuristics. But when it…
		ABataevUnsubmitted Not Done Reply Inline Actions I rather doubt that building a graph can be called a "fine grained tool". This is different tool, not intended for the analysis. We can extract some functionality out of there (to a separate function/member function) and make the heuristic more smart, but not use the build graph directly. Same problem with the heuristic may exist in some other places, we need to handle them too. ABataev: I rather doubt that building a graph can be called a "fine grained tool". This is different…
		vdmitrieUnsubmitted Not Done Reply Inline Actions Cost modeling is the tool I referred to as a "fine grained tool". We have to build a graph to run it. So it's sort of necessary evil. In this sense trying to turn off scheduler for the purpose of using CM as finer grained heuristics does not sound like a crazy idea. vdmitrie: Cost modeling is the tool I referred to as a "fine grained tool". We have to build a graph to…
		ABataevUnsubmitted Not Done Reply Inline Actions That's what I don't agree here. Cost model is not the tool for the modelling. For modelling we have heuristic. If it is not good, need to tweak it. ABataev: That's what I don't agree here. Cost model is not the tool for the modelling. For modelling we…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions One issue with the lookahead search is that it is trying both sides of commutative operations, so this doesn't scale if we need to increase the depth. So we need a different tool for testing deeper trees. I agree that there may be something that the lookahead heuristic is missing here, but I would argue that it is the wrong tool for the job. The buildTree() logic is a much more accurate for this. Reusing the existing buildTree logic with some compromises (e.g, limiting size and disabling scheduling) seems like a good compromise to me. vporpo: One issue with the lookahead search is that it is trying both sides of commutative operations…
		ABataevUnsubmitted Not Done Reply Inline Actions I would oppose that. I would not use buildTree() for estimation. If there is a part, which can be used for better estimation, better to extract it to a separate function/class and the reuse it in the heuristic and actual graph building separately. ABataev: I would oppose that. I would not use buildTree() for estimation. If there is a part, which can…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions What is the reasoning for opposing it? vporpo: What is the reasoning for opposing it?
		ABataevUnsubmitted Not Done Reply Inline Actions Bad design decision. We have 4 stages. Analysis. Tree building. Cost estimation. Codegen. You want to make a circular dependence between Analysis and Tree building/Cost estimation. But I'm not against reusing some of the code from buildTree()/cost estimation for the analysis phase. I'm just saying that this functionality must be extracted and then reused for the analysis and for the tree building/cost estimation (if possible, to reduce maintenance burden). ABataev: Bad design decision. We have 4 stages. 1. Analysis. 2. Tree building. 3. Cost estimation. 4.
		vdmitrieUnsubmitted Not Done Reply Inline Actions That's what I don't agree here. Cost model is not the tool for the modelling. For modelling we have heuristic. If it is not good, need to tweak it. It would be nice if you explained why you are against using CM for selecting a candidate. Cost model as its name suggests is supposed to be used for modeling. vdmitrie: > That's what I don't agree here. Cost model is not the tool for the modelling. For modelling…
		ABataevUnsubmitted Not Done Reply Inline Actions Because it is not modelling, it is analysis ABataev: Because it is not modelling, it is analysis
		ABataevUnsubmitted Not Done Reply Inline Actions I mean, you want to use it not for modelling but for the analysis ABataev: I mean, you want to use it not for modelling but for the analysis
		vporpoAuthorUnsubmitted Done Reply Inline Actions I think the high level design that you showed is not very accurate. We are actually doing multiple "tree builds" and "cost estimations" before generating code even in the current design. I don't see the "circular dependency" issue being introduced by this. vporpo: I think the high level design that you showed is not very accurate. We are actually doing…
		ABataevUnsubmitted Not Done Reply Inline Actions Thta's not because we mix analysis/modelling/estimation, but just because cost estimation shows that the tree is not profitable. The question is not about number of attempts, it is about the design. ABataev: Thta's not because we mix analysis/modelling/estimation, but just because cost estimation shows…
		vdmitrieUnsubmitted Not Done Reply Inline Actions Distinction between the two is moot. vdmitrie: Distinction between the two is moot.
		ABataevUnsubmitted Not Done Reply Inline Actions Weak argument. ABataev: Weak argument.
		vdmitrieUnsubmitted Not Done Reply Inline Actions What about alternative solution which is kind of step back but buildTree+CM is not used for analysis? if lookahead heuristics cannot find single best findBestRootPair returns all indices that give the maximum score. Caller then uses approach it used before: tries to vectorize each until it finds the first which is profitable. vdmitrie: What about alternative solution which is kind of step back but buildTree+CM is not used for…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions I don't see any strong argument against using buildTree+CostModel as long as buildTree is fast enough. The argument that this is somehow changing the design makes little sense because the pass is already following this buildTree+CostModel design. The only exception is perhaps for the lookahead search which is actually an example of a design to avoid: it is using its own custom tree-building and cost modeling, and requires special maintenance. Also the argument that we should extract some of the functionality and place it in a separate component is not very strong. Replicating similar functionality in multiple places is something that a good design should avoid. It just increases the maintenance overhead and will inevitably lead to divergence. vporpo: I don't see any strong argument against using buildTree+CostModel as long as buildTree is fast…
		ABataevUnsubmitted Not Done Reply Inline Actions What about alternative solution which is kind of step back but buildTree+CM is not used for analysis? if lookahead heuristics cannot find single best findBestRootPair returns all indices that give the maximum score. Caller then uses approach it used before: tries to vectorize each until it finds the first which is profitable. Yes, it may work as a quick solution. ABataev: > What about alternative solution which is kind of step back but buildTree+CM is not used for…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions if lookahead heuristics cannot find single best findBestRootPair returns all indices that give the maximum score. Caller then uses approach it used before: tries to vectorize each until it finds the first which is profitable. That won't work because lookahead finds a single best, but it turns out to be the wrong one. vporpo: > if lookahead heuristics cannot find single best findBestRootPair returns all indices that…
}		}
}		}
return Index;		return Index;
}		}

/// Checks if the instruction is marked for deletion.		/// Checks if the instruction is marked for deletion.
bool isDeleted(Instruction *I) const { return DeletedInstructions.count(I); }		bool isDeleted(Instruction *I) const { return DeletedInstructions.count(I); }

▲ Show 20 Lines • Show All 481 Lines • ▼ Show 20 Lines	if (ReorderIndices.empty()) {
Last->setOperations(S);		Last->setOperations(S);
Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end());		Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end());
}		}
if (Last->State != TreeEntry::NeedToGather) {		if (Last->State != TreeEntry::NeedToGather) {
for (Value *V : VL) {		for (Value *V : VL) {
assert(!getTreeEntry(V) && "Scalar already in tree!");		assert(!getTreeEntry(V) && "Scalar already in tree!");
ScalarToTreeEntry[V] = Last;		ScalarToTreeEntry[V] = Last;
}		}
		if (!ForceDisableScheduling) {
// Update the scheduler bundle to point to this TreeEntry.		// Update the scheduler bundle to point to this TreeEntry.
ScheduleData *BundleMember = Bundle.getValue();		ScheduleData *BundleMember = Bundle.getValue();
assert((BundleMember \|\| isa<PHINode>(S.MainOp) \|\|		assert((BundleMember \|\| isa<PHINode>(S.MainOp) \|\|
isVectorLikeInstWithConstOps(S.MainOp) \|\|		isVectorLikeInstWithConstOps(S.MainOp) \|\|
doesNotNeedToSchedule(VL)) &&		doesNotNeedToSchedule(VL)) &&
"Bundle and VL out of sync");		"Bundle and VL out of sync");
if (BundleMember) {		if (BundleMember) {
for (Value *V : VL) {		for (Value *V : VL) {
if (doesNotNeedToBeScheduled(V))		if (doesNotNeedToBeScheduled(V))
continue;		continue;
assert(BundleMember && "Unexpected end of bundle.");		assert(BundleMember && "Unexpected end of bundle.");
BundleMember->TE = Last;		BundleMember->TE = Last;
BundleMember = BundleMember->NextInBundle;		BundleMember = BundleMember->NextInBundle;
}		}
}		}
assert(!BundleMember && "Bundle and VL out of sync");		assert(!BundleMember && "Bundle and VL out of sync");
		}
} else {		} else {
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
}		}

if (UserTreeIdx.UserTE)		if (UserTreeIdx.UserTE)
Last->UserTreeIndices.push_back(UserTreeIdx);		Last->UserTreeIndices.push_back(UserTreeIdx);

return Last;		return Last;
▲ Show 20 Lines • Show All 361 Lines • ▼ Show 20 Lines	struct BlockScheduling {
bool isInSchedulingRegion(ScheduleData *SD) const {		bool isInSchedulingRegion(ScheduleData *SD) const {
return SD->SchedulingRegionID == SchedulingRegionID;		return SD->SchedulingRegionID == SchedulingRegionID;
}		}

/// Marks an instruction as scheduled and puts all dependent ready		/// Marks an instruction as scheduled and puts all dependent ready
/// instructions into the ready-list.		/// instructions into the ready-list.
template <typename ReadyListType>		template <typename ReadyListType>
void schedule(ScheduleData *SD, ReadyListType &ReadyList) {		void schedule(ScheduleData *SD, ReadyListType &ReadyList) {
		assert(!DisableScheduling && "Trying to schedule when disabled");
SD->IsScheduled = true;		SD->IsScheduled = true;
LLVM_DEBUG(dbgs() << "SLP: schedule " << *SD << "\n");		LLVM_DEBUG(dbgs() << "SLP: schedule " << *SD << "\n");

for (ScheduleData *BundleMember = SD; BundleMember;		for (ScheduleData *BundleMember = SD; BundleMember;
BundleMember = BundleMember->NextInBundle) {		BundleMember = BundleMember->NextInBundle) {
if (BundleMember->Inst != BundleMember->OpValue)		if (BundleMember->Inst != BundleMember->OpValue)
continue;		continue;

▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	struct BlockScheduling {
/// cyclic dependencies. This is only a dry-run, no instructions are		/// cyclic dependencies. This is only a dry-run, no instructions are
/// actually moved at this stage.		/// actually moved at this stage.
/// \returns the scheduling bundle. The returned Optional value is non-None		/// \returns the scheduling bundle. The returned Optional value is non-None
/// if \p VL is allowed to be scheduled.		/// if \p VL is allowed to be scheduled.
Optional<ScheduleData *>		Optional<ScheduleData *>
tryScheduleBundle(ArrayRef<Value > VL, BoUpSLP SLP,		tryScheduleBundle(ArrayRef<Value > VL, BoUpSLP SLP,
const InstructionsState &S);		const InstructionsState &S);

		/// \Returns an uninitialized ScheduleBundle. This is needed when disabling
		/// the scheduler because a non-null ScheduleData bundle is used to
		/// determine whether a TreeEntry is marked as vectorizable.
		static ScheduleData *getDummyBundle() {
		static ScheduleData SD;
		return &SD;
		}

/// Un-bundles a group of instructions.		/// Un-bundles a group of instructions.
void cancelScheduling(ArrayRef<Value > VL, Value OpValue);		void cancelScheduling(ArrayRef<Value > VL, Value OpValue);

/// Allocates schedule data chunk.		/// Allocates schedule data chunk.
ScheduleData *allocateScheduleDataChunks();		ScheduleData *allocateScheduleDataChunks();

/// Extends the scheduling region so that V is inside the region.		/// Extends the scheduling region so that V is inside the region.
/// \returns true if the region size is within the limit.		/// \returns true if the region size is within the limit.
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	struct BlockScheduling {
/// The maximum size allowed for the scheduling region.		/// The maximum size allowed for the scheduling region.
int ScheduleRegionSizeLimit = ScheduleRegionSizeBudget;		int ScheduleRegionSizeLimit = ScheduleRegionSizeBudget;

/// The ID of the scheduling region. For a new vectorization iteration this		/// The ID of the scheduling region. For a new vectorization iteration this
/// is incremented which "removes" all ScheduleData from the region.		/// is incremented which "removes" all ScheduleData from the region.
/// Make sure that the initial SchedulingRegionID is greater than the		/// Make sure that the initial SchedulingRegionID is greater than the
/// initial SchedulingRegionID in ScheduleData (which is 0).		/// initial SchedulingRegionID in ScheduleData (which is 0).
int SchedulingRegionID = 1;		int SchedulingRegionID = 1;

		/// This is set to true if we need to skip scheduling for this block.
		bool DisableScheduling = false;
};		};

/// Attaches the BlockScheduling structures to basic blocks.		/// Attaches the BlockScheduling structures to basic blocks.
MapVector<BasicBlock *, std::unique_ptr<BlockScheduling>> BlocksSchedules;		MapVector<BasicBlock *, std::unique_ptr<BlockScheduling>> BlocksSchedules;

/// Performs the "real" scheduling. Done before vectorization is actually		/// Performs the "real" scheduling. Done before vectorization is actually
/// performed in a basic block.		/// performed in a basic block.
void scheduleBlock(BlockScheduling *BS);		void scheduleBlock(BlockScheduling *BS);
▲ Show 20 Lines • Show All 914 Lines • ▼ Show 20 Lines	void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst) {		ArrayRef<Value *> UserIgnoreLst) {
deleteTree();		deleteTree();
UserIgnoreList = UserIgnoreLst;		UserIgnoreList = UserIgnoreLst;
if (!allSameType(Roots))		if (!allSameType(Roots))
return;		return;
buildTree_rec(Roots, 0, EdgeInfo());		buildTree_rec(Roots, 0, EdgeInfo());
}		}

		InstructionCost BoUpSLP::buildTreeFastAndGetCost(ArrayRef<Value *> Roots,
		Optional<unsigned> MaxSize,
		bool DisableScheduling) {
		deleteTree();
		MaxTreeSize = MaxSize;
		ForceDisableScheduling = DisableScheduling;
		if (!allSameType(Roots))
		return InstructionCost::getMax();

		buildTree_rec(Roots, 0, EdgeInfo());

		if (isTreeTinyAndNotFullyVectorizable())
		return InstructionCost::getMax();
		reorderTopToBottom();
		reorderBottomToTop(!isa<InsertElementInst>(Roots.front()));
		buildExternalUses();
		computeMinimumValueSizes();
		return getTreeCost();
		}

namespace {		namespace {
/// Tracks the state we can represent the loads in the given sequence.		/// Tracks the state we can represent the loads in the given sequence.
enum class LoadsState { Gather, Vectorize, ScatterVectorize };		enum class LoadsState { Gather, Vectorize, ScatterVectorize };
} // anonymous namespace		} // anonymous namespace

/// Checks if the given array of loads can be represented as a vectorized,		/// Checks if the given array of loads can be represented as a vectorized,
/// scatter or just simple gather.		/// scatter or just simple gather.
static LoadsState canVectorizeLoads(ArrayRef<Value > VL, const Value VL0,		static LoadsState canVectorizeLoads(ArrayRef<Value > VL, const Value VL0,
▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	if (auto *LI = dyn_cast<LoadInst>(V)) {
}		}
Key = hash_combine(hash_value(I->getParent()), Key);		Key = hash_combine(hash_value(I->getParent()), Key);
}		}
return std::make_pair(Key, SubKey);		return std::make_pair(Key, SubKey);
}		}

void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
const EdgeInfo &UserTreeIdx) {		const EdgeInfo &UserTreeIdx) {
		// If we are building a fast approximate tree, then early return once we reach
		// the tree size limit.
		if (MaxTreeSize && VectorizableTree.size() >= *MaxTreeSize) {
		LLVM_DEBUG(dbgs() << "SLP: Reached max tree size " << *MaxTreeSize
		<< ".\n");
		return;
		}
assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");		assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");

SmallVector<int> ReuseShuffleIndicies;		SmallVector<int> ReuseShuffleIndicies;
SmallVector<Value *> UniqueValues;		SmallVector<Value *> UniqueValues;
auto &&TryToFindDuplicates = [&VL, &ReuseShuffleIndicies, &UniqueValues,		auto &&TryToFindDuplicates = [&VL, &ReuseShuffleIndicies, &UniqueValues,
&UserTreeIdx,		&UserTreeIdx,
this](const InstructionsState &S) {		this](const InstructionsState &S) {
// Check that every instruction appears once in this bundle.		// Check that every instruction appears once in this bundle.
▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines	void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
if (!TryToFindDuplicates(S))		if (!TryToFindDuplicates(S))
return;		return;

auto &BSRef = BlocksSchedules[BB];		auto &BSRef = BlocksSchedules[BB];
if (!BSRef)		if (!BSRef)
BSRef = std::make_unique<BlockScheduling>(BB);		BSRef = std::make_unique<BlockScheduling>(BB);

BlockScheduling &BS = *BSRef;		BlockScheduling &BS = *BSRef;
		BS.DisableScheduling = ForceDisableScheduling;

Optional<ScheduleData *> Bundle = BS.tryScheduleBundle(VL, this, S);		Optional<ScheduleData *> Bundle = !ForceDisableScheduling
		? BS.tryScheduleBundle(VL, this, S)
		: BlockScheduling::getDummyBundle();
#ifdef EXPENSIVE_CHECKS		#ifdef EXPENSIVE_CHECKS
// Make sure we didn't break any internal invariants		// Make sure we didn't break any internal invariants
BS.verify();		BS.verify();
#endif		#endif
if (!Bundle) {		if (!ForceDisableScheduling && !Bundle) {
LLVM_DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");		LLVM_DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");
assert((!BS.getScheduleData(VL0) \|\|		assert((!BS.getScheduleData(VL0) \|\|
!BS.getScheduleData(VL0)->isPartOfBundle()) &&		!BS.getScheduleData(VL0)->isPartOfBundle()) &&
"tryScheduleBundle should cancelScheduling on failure");		"tryScheduleBundle should cancelScheduling on failure");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
return;		return;
}		}
LLVM_DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");		LLVM_DEBUG(dbgs() << (ForceDisableScheduling
		? "SLP: Scheduling was disabled.\n"
		: "SLP: We are able to schedule this bundle.\n"));

unsigned ShuffleOrOp = S.isAltShuffle() ?		unsigned ShuffleOrOp = S.isAltShuffle() ?
(unsigned) Instruction::ShuffleVector : S.getOpcode();		(unsigned) Instruction::ShuffleVector : S.getOpcode();
switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI: {		case Instruction::PHI: {
auto *PH = cast<PHINode>(VL0);		auto *PH = cast<PHINode>(VL0);

// Check for terminator values (e.g. invoke).		// Check for terminator values (e.g. invoke).
▲ Show 20 Lines • Show All 3,625 Lines • ▼ Show 20 Lines	BoUpSLP::BlockScheduling::buildBundle(ArrayRef<Value *> VL) {
return Bundle;		return Bundle;
}		}

// Groups the instructions to a bundle (which is then a single scheduling entity)		// Groups the instructions to a bundle (which is then a single scheduling entity)
// and schedules instructions until the bundle gets ready.		// and schedules instructions until the bundle gets ready.
Optional<BoUpSLP::ScheduleData *>		Optional<BoUpSLP::ScheduleData *>
BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value > VL, BoUpSLP SLP,		BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value > VL, BoUpSLP SLP,
const InstructionsState &S) {		const InstructionsState &S) {
		assert(!DisableScheduling && "Trying to schedule when disabled");
// No need to schedule PHIs, insertelement, extractelement and extractvalue		// No need to schedule PHIs, insertelement, extractelement and extractvalue
// instructions.		// instructions.
if (isa<PHINode>(S.OpValue) \|\| isVectorLikeInstWithConstOps(S.OpValue) \|\|		if (isa<PHINode>(S.OpValue) \|\| isVectorLikeInstWithConstOps(S.OpValue) \|\|
doesNotNeedToSchedule(VL))		doesNotNeedToSchedule(VL))
return nullptr;		return nullptr;

// Initialize the instruction bundle.		// Initialize the instruction bundle.
Instruction *OldScheduleEnd = ScheduleEnd;		Instruction *OldScheduleEnd = ScheduleEnd;
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	if (!Bundle->isReady()) {
cancelScheduling(VL, S.OpValue);		cancelScheduling(VL, S.OpValue);
return None;		return None;
}		}
return Bundle;		return Bundle;
}		}

void BoUpSLP::BlockScheduling::cancelScheduling(ArrayRef<Value *> VL,		void BoUpSLP::BlockScheduling::cancelScheduling(ArrayRef<Value *> VL,
Value *OpValue) {		Value *OpValue) {
		// Early return if scheduling is disabled.
		if (DisableScheduling)
		return;
if (isa<PHINode>(OpValue) \|\| isVectorLikeInstWithConstOps(OpValue) \|\|		if (isa<PHINode>(OpValue) \|\| isVectorLikeInstWithConstOps(OpValue) \|\|
doesNotNeedToSchedule(VL))		doesNotNeedToSchedule(VL))
return;		return;

if (doesNotNeedToBeScheduled(OpValue))		if (doesNotNeedToBeScheduled(OpValue))
OpValue = *find_if_not(VL, doesNotNeedToBeScheduled);		OpValue = *find_if_not(VL, doesNotNeedToBeScheduled);
ScheduleData *Bundle = getScheduleData(OpValue);		ScheduleData *Bundle = getScheduleData(OpValue);
LLVM_DEBUG(dbgs() << "SLP: cancel scheduling of " << *Bundle << "\n");		LLVM_DEBUG(dbgs() << "SLP: cancel scheduling of " << *Bundle << "\n");
▲ Show 20 Lines • Show All 3,234 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/insert-element-build-vector-inseltpoison.ll

	Show First 20 Lines • Show All 151 Lines • ▼ Show 20 Lines
	; MINTREESIZE-NEXT: [[Q3:%.*]] = extractelement <4 x float> [[RD]], i32 3			; MINTREESIZE-NEXT: [[Q3:%.*]] = extractelement <4 x float> [[RD]], i32 3
	; MINTREESIZE-NEXT: [[TMP7:%.*]] = insertelement <2 x float> poison, float [[Q2]], i32 0			; MINTREESIZE-NEXT: [[TMP7:%.*]] = insertelement <2 x float> poison, float [[Q2]], i32 0
	; MINTREESIZE-NEXT: [[TMP8:%.*]] = insertelement <2 x float> [[TMP7]], float [[Q3]], i32 1			; MINTREESIZE-NEXT: [[TMP8:%.*]] = insertelement <2 x float> [[TMP7]], float [[Q3]], i32 1
	; MINTREESIZE-NEXT: [[Q4:%.*]] = fadd float [[Q0]], [[Q1]]			; MINTREESIZE-NEXT: [[Q4:%.*]] = fadd float [[Q0]], [[Q1]]
	; MINTREESIZE-NEXT: [[Q5:%.*]] = fadd float [[Q2]], [[Q3]]			; MINTREESIZE-NEXT: [[Q5:%.*]] = fadd float [[Q2]], [[Q3]]
	; MINTREESIZE-NEXT: [[TMP9:%.*]] = insertelement <2 x float> poison, float [[Q4]], i32 0			; MINTREESIZE-NEXT: [[TMP9:%.*]] = insertelement <2 x float> poison, float [[Q4]], i32 0
	; MINTREESIZE-NEXT: [[TMP10:%.*]] = insertelement <2 x float> [[TMP9]], float [[Q5]], i32 1			; MINTREESIZE-NEXT: [[TMP10:%.*]] = insertelement <2 x float> [[TMP9]], float [[Q5]], i32 1
	; MINTREESIZE-NEXT: [[Q6:%.*]] = fadd float [[Q4]], [[Q5]]			; MINTREESIZE-NEXT: [[Q6:%.*]] = fadd float [[Q4]], [[Q5]]
				; MINTREESIZE-NEXT: [[TMP11:%.*]] = insertelement <2 x float> poison, float [[Q6]], i32 0
				; MINTREESIZE-NEXT: [[TMP12:%.*]] = insertelement <2 x float> [[TMP11]], float [[Q5]], i32 1
	; MINTREESIZE-NEXT: [[QI:%.*]] = fcmp olt float [[Q6]], [[Q5]]			; MINTREESIZE-NEXT: [[QI:%.*]] = fcmp olt float [[Q6]], [[Q5]]
	; MINTREESIZE-NEXT: call void @llvm.assume(i1 [[QI]])			; MINTREESIZE-NEXT: call void @llvm.assume(i1 [[QI]])
	; MINTREESIZE-NEXT: ret <4 x float> undef			; MINTREESIZE-NEXT: ret <4 x float> undef
	;			;
	%c0 = extractelement <4 x i32> %c, i32 0			%c0 = extractelement <4 x i32> %c, i32 0
	%c1 = extractelement <4 x i32> %c, i32 1			%c1 = extractelement <4 x i32> %c, i32 1
	%c2 = extractelement <4 x i32> %c, i32 2			%c2 = extractelement <4 x i32> %c, i32 2
	%c3 = extractelement <4 x i32> %c, i32 3			%c3 = extractelement <4 x i32> %c, i32 3
	▲ Show 20 Lines • Show All 417 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll

	Show First 20 Lines • Show All 186 Lines • ▼ Show 20 Lines
	; MINTREESIZE-NEXT: [[Q3:%.*]] = extractelement <4 x float> [[RD]], i32 3			; MINTREESIZE-NEXT: [[Q3:%.*]] = extractelement <4 x float> [[RD]], i32 3
	; MINTREESIZE-NEXT: [[TMP7:%.*]] = insertelement <2 x float> poison, float [[Q2]], i32 0			; MINTREESIZE-NEXT: [[TMP7:%.*]] = insertelement <2 x float> poison, float [[Q2]], i32 0
	; MINTREESIZE-NEXT: [[TMP8:%.*]] = insertelement <2 x float> [[TMP7]], float [[Q3]], i32 1			; MINTREESIZE-NEXT: [[TMP8:%.*]] = insertelement <2 x float> [[TMP7]], float [[Q3]], i32 1
	; MINTREESIZE-NEXT: [[Q4:%.*]] = fadd float [[Q0]], [[Q1]]			; MINTREESIZE-NEXT: [[Q4:%.*]] = fadd float [[Q0]], [[Q1]]
	; MINTREESIZE-NEXT: [[Q5:%.*]] = fadd float [[Q2]], [[Q3]]			; MINTREESIZE-NEXT: [[Q5:%.*]] = fadd float [[Q2]], [[Q3]]
	; MINTREESIZE-NEXT: [[TMP9:%.*]] = insertelement <2 x float> poison, float [[Q4]], i32 0			; MINTREESIZE-NEXT: [[TMP9:%.*]] = insertelement <2 x float> poison, float [[Q4]], i32 0
	; MINTREESIZE-NEXT: [[TMP10:%.*]] = insertelement <2 x float> [[TMP9]], float [[Q5]], i32 1			; MINTREESIZE-NEXT: [[TMP10:%.*]] = insertelement <2 x float> [[TMP9]], float [[Q5]], i32 1
	; MINTREESIZE-NEXT: [[Q6:%.*]] = fadd float [[Q4]], [[Q5]]			; MINTREESIZE-NEXT: [[Q6:%.*]] = fadd float [[Q4]], [[Q5]]
				; MINTREESIZE-NEXT: [[TMP11:%.*]] = insertelement <2 x float> poison, float [[Q6]], i32 0
				; MINTREESIZE-NEXT: [[TMP12:%.*]] = insertelement <2 x float> [[TMP11]], float [[Q5]], i32 1
	; MINTREESIZE-NEXT: [[QI:%.*]] = fcmp olt float [[Q6]], [[Q5]]			; MINTREESIZE-NEXT: [[QI:%.*]] = fcmp olt float [[Q6]], [[Q5]]
	; MINTREESIZE-NEXT: call void @llvm.assume(i1 [[QI]])			; MINTREESIZE-NEXT: call void @llvm.assume(i1 [[QI]])
	; MINTREESIZE-NEXT: ret <4 x float> undef			; MINTREESIZE-NEXT: ret <4 x float> undef
	;			;
	%c0 = extractelement <4 x i32> %c, i32 0			%c0 = extractelement <4 x i32> %c, i32 0
	%c1 = extractelement <4 x i32> %c, i32 1			%c1 = extractelement <4 x i32> %c, i32 1
	%c2 = extractelement <4 x i32> %c, i32 2			%c2 = extractelement <4 x i32> %c, i32 2
	%c3 = extractelement <4 x i32> %c, i32 3			%c3 = extractelement <4 x i32> %c, i32 3
	▲ Show 20 Lines • Show All 417 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/pr48879-sroa.ll

	Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	;			;
	; AVX-LABEL: @compute_min(			; AVX-LABEL: @compute_min(
	; AVX-NEXT: entry:			; AVX-NEXT: entry:
	; AVX-NEXT: [[TMP0:%.]] = load i16, ptr [[Y:%.]], align 2			; AVX-NEXT: [[TMP0:%.]] = load i16, ptr [[Y:%.]], align 2
	; AVX-NEXT: [[TMP1:%.]] = load i16, ptr [[X:%.]], align 2			; AVX-NEXT: [[TMP1:%.]] = load i16, ptr [[X:%.]], align 2
	; AVX-NEXT: [[TMP2:%.*]] = tail call i16 @llvm.smin.i16(i16 [[TMP0]], i16 [[TMP1]])			; AVX-NEXT: [[TMP2:%.*]] = tail call i16 @llvm.smin.i16(i16 [[TMP0]], i16 [[TMP1]])
	; AVX-NEXT: [[ARRAYIDX_I_I_1:%.*]] = getelementptr inbounds [8 x i16], ptr [[X]], i64 0, i64 1			; AVX-NEXT: [[ARRAYIDX_I_I_1:%.*]] = getelementptr inbounds [8 x i16], ptr [[X]], i64 0, i64 1
	; AVX-NEXT: [[ARRAYIDX_I_I10_1:%.*]] = getelementptr inbounds [8 x i16], ptr [[Y]], i64 0, i64 1			; AVX-NEXT: [[ARRAYIDX_I_I10_1:%.*]] = getelementptr inbounds [8 x i16], ptr [[Y]], i64 0, i64 1
	; AVX-NEXT: [[TMP3:%.*]] = load i16, ptr [[ARRAYIDX_I_I10_1]], align 2			; AVX-NEXT: [[ARRAYIDX_I_I_3:%.*]] = getelementptr inbounds [8 x i16], ptr [[X]], i64 0, i64 3
	; AVX-NEXT: [[TMP4:%.*]] = load i16, ptr [[ARRAYIDX_I_I_1]], align 2			; AVX-NEXT: [[ARRAYIDX_I_I10_3:%.*]] = getelementptr inbounds [8 x i16], ptr [[Y]], i64 0, i64 3
				; AVX-NEXT: [[TMP3:%.*]] = load i16, ptr [[ARRAYIDX_I_I10_3]], align 2
				; AVX-NEXT: [[TMP4:%.*]] = load i16, ptr [[ARRAYIDX_I_I_3]], align 2
	; AVX-NEXT: [[TMP5:%.*]] = tail call i16 @llvm.smin.i16(i16 [[TMP3]], i16 [[TMP4]])			; AVX-NEXT: [[TMP5:%.*]] = tail call i16 @llvm.smin.i16(i16 [[TMP3]], i16 [[TMP4]])
	; AVX-NEXT: [[ARRAYIDX_I_I_2:%.*]] = getelementptr inbounds [8 x i16], ptr [[X]], i64 0, i64 2
	; AVX-NEXT: [[ARRAYIDX_I_I10_2:%.*]] = getelementptr inbounds [8 x i16], ptr [[Y]], i64 0, i64 2
	; AVX-NEXT: [[ARRAYIDX_I_I_4:%.*]] = getelementptr inbounds [8 x i16], ptr [[X]], i64 0, i64 4			; AVX-NEXT: [[ARRAYIDX_I_I_4:%.*]] = getelementptr inbounds [8 x i16], ptr [[X]], i64 0, i64 4
	; AVX-NEXT: [[ARRAYIDX_I_I10_4:%.*]] = getelementptr inbounds [8 x i16], ptr [[Y]], i64 0, i64 4			; AVX-NEXT: [[ARRAYIDX_I_I10_4:%.*]] = getelementptr inbounds [8 x i16], ptr [[Y]], i64 0, i64 4
	; AVX-NEXT: [[TMP6:%.*]] = load i16, ptr [[ARRAYIDX_I_I10_4]], align 2			; AVX-NEXT: [[TMP6:%.*]] = load i16, ptr [[ARRAYIDX_I_I10_4]], align 2
	; AVX-NEXT: [[TMP7:%.*]] = load i16, ptr [[ARRAYIDX_I_I_4]], align 2			; AVX-NEXT: [[TMP7:%.*]] = load i16, ptr [[ARRAYIDX_I_I_4]], align 2
	; AVX-NEXT: [[TMP8:%.*]] = tail call i16 @llvm.smin.i16(i16 [[TMP6]], i16 [[TMP7]])			; AVX-NEXT: [[TMP8:%.*]] = tail call i16 @llvm.smin.i16(i16 [[TMP6]], i16 [[TMP7]])
	; AVX-NEXT: [[ARRAYIDX_I_I_5:%.*]] = getelementptr inbounds [8 x i16], ptr [[X]], i64 0, i64 5			; AVX-NEXT: [[ARRAYIDX_I_I_5:%.*]] = getelementptr inbounds [8 x i16], ptr [[X]], i64 0, i64 5
	; AVX-NEXT: [[ARRAYIDX_I_I10_5:%.*]] = getelementptr inbounds [8 x i16], ptr [[Y]], i64 0, i64 5			; AVX-NEXT: [[ARRAYIDX_I_I10_5:%.*]] = getelementptr inbounds [8 x i16], ptr [[Y]], i64 0, i64 5
	; AVX-NEXT: [[TMP9:%.*]] = load i16, ptr [[ARRAYIDX_I_I10_5]], align 2			; AVX-NEXT: [[ARRAYIDX_I_I_7:%.*]] = getelementptr inbounds [8 x i16], ptr [[X]], i64 0, i64 7
	; AVX-NEXT: [[TMP10:%.*]] = load i16, ptr [[ARRAYIDX_I_I_5]], align 2			; AVX-NEXT: [[ARRAYIDX_I_I10_7:%.*]] = getelementptr inbounds [8 x i16], ptr [[Y]], i64 0, i64 7
				; AVX-NEXT: [[TMP9:%.*]] = load i16, ptr [[ARRAYIDX_I_I10_7]], align 2
				; AVX-NEXT: [[TMP10:%.*]] = load i16, ptr [[ARRAYIDX_I_I_7]], align 2
	; AVX-NEXT: [[TMP11:%.*]] = tail call i16 @llvm.smin.i16(i16 [[TMP9]], i16 [[TMP10]])			; AVX-NEXT: [[TMP11:%.*]] = tail call i16 @llvm.smin.i16(i16 [[TMP9]], i16 [[TMP10]])
	; AVX-NEXT: [[ARRAYIDX_I_I_6:%.*]] = getelementptr inbounds [8 x i16], ptr [[X]], i64 0, i64 6			; AVX-NEXT: [[RETVAL_SROA_4_0_INSERT_EXT:%.*]] = zext i16 [[TMP5]] to i64
	; AVX-NEXT: [[ARRAYIDX_I_I10_6:%.*]] = getelementptr inbounds [8 x i16], ptr [[Y]], i64 0, i64 6			; AVX-NEXT: [[RETVAL_SROA_4_0_INSERT_SHIFT:%.*]] = shl nuw i64 [[RETVAL_SROA_4_0_INSERT_EXT]], 48
	; AVX-NEXT: [[TMP12:%.*]] = load <2 x i16>, ptr [[ARRAYIDX_I_I10_2]], align 2			; AVX-NEXT: [[TMP12:%.*]] = load <2 x i16>, ptr [[ARRAYIDX_I_I10_1]], align 2
	; AVX-NEXT: [[TMP13:%.*]] = load <2 x i16>, ptr [[ARRAYIDX_I_I_2]], align 2			; AVX-NEXT: [[TMP13:%.*]] = load <2 x i16>, ptr [[ARRAYIDX_I_I_1]], align 2
	; AVX-NEXT: [[TMP14:%.*]] = call <2 x i16> @llvm.smin.v2i16(<2 x i16> [[TMP12]], <2 x i16> [[TMP13]])			; AVX-NEXT: [[TMP14:%.*]] = call <2 x i16> @llvm.smin.v2i16(<2 x i16> [[TMP12]], <2 x i16> [[TMP13]])
	; AVX-NEXT: [[TMP15:%.*]] = zext <2 x i16> [[TMP14]] to <2 x i64>			; AVX-NEXT: [[TMP15:%.*]] = zext <2 x i16> [[TMP14]] to <2 x i64>
	; AVX-NEXT: [[TMP16:%.*]] = shl nuw <2 x i64> [[TMP15]], <i64 32, i64 48>			; AVX-NEXT: [[TMP16:%.*]] = shl nuw nsw <2 x i64> [[TMP15]], <i64 16, i64 32>
	; AVX-NEXT: [[TMP17:%.*]] = extractelement <2 x i64> [[TMP16]], i32 0			; AVX-NEXT: [[TMP17:%.*]] = extractelement <2 x i64> [[TMP16]], i32 1
	; AVX-NEXT: [[TMP18:%.*]] = extractelement <2 x i64> [[TMP16]], i32 1			; AVX-NEXT: [[RETVAL_SROA_3_0_INSERT_INSERT:%.*]] = or i64 [[RETVAL_SROA_4_0_INSERT_SHIFT]], [[TMP17]]
	; AVX-NEXT: [[RETVAL_SROA_3_0_INSERT_INSERT:%.*]] = or i64 [[TMP18]], [[TMP17]]			; AVX-NEXT: [[TMP18:%.*]] = extractelement <2 x i64> [[TMP16]], i32 0
	; AVX-NEXT: [[RETVAL_SROA_2_0_INSERT_EXT:%.*]] = zext i16 [[TMP5]] to i64			; AVX-NEXT: [[RETVAL_SROA_2_0_INSERT_INSERT:%.*]] = or i64 [[RETVAL_SROA_3_0_INSERT_INSERT]], [[TMP18]]
	; AVX-NEXT: [[RETVAL_SROA_2_0_INSERT_SHIFT:%.*]] = shl nuw nsw i64 [[RETVAL_SROA_2_0_INSERT_EXT]], 16
	; AVX-NEXT: [[RETVAL_SROA_2_0_INSERT_INSERT:%.*]] = or i64 [[RETVAL_SROA_3_0_INSERT_INSERT]], [[RETVAL_SROA_2_0_INSERT_SHIFT]]
	; AVX-NEXT: [[RETVAL_SROA_0_0_INSERT_EXT:%.*]] = zext i16 [[TMP2]] to i64			; AVX-NEXT: [[RETVAL_SROA_0_0_INSERT_EXT:%.*]] = zext i16 [[TMP2]] to i64
	; AVX-NEXT: [[RETVAL_SROA_0_0_INSERT_INSERT:%.*]] = or i64 [[RETVAL_SROA_2_0_INSERT_INSERT]], [[RETVAL_SROA_0_0_INSERT_EXT]]			; AVX-NEXT: [[RETVAL_SROA_0_0_INSERT_INSERT:%.*]] = or i64 [[RETVAL_SROA_2_0_INSERT_INSERT]], [[RETVAL_SROA_0_0_INSERT_EXT]]
	; AVX-NEXT: [[DOTFCA_0_INSERT:%.*]] = insertvalue { i64, i64 } poison, i64 [[RETVAL_SROA_0_0_INSERT_INSERT]], 0			; AVX-NEXT: [[DOTFCA_0_INSERT:%.*]] = insertvalue { i64, i64 } poison, i64 [[RETVAL_SROA_0_0_INSERT_INSERT]], 0
	; AVX-NEXT: [[TMP19:%.*]] = load <2 x i16>, ptr [[ARRAYIDX_I_I10_6]], align 2			; AVX-NEXT: [[RETVAL_SROA_9_8_INSERT_EXT:%.*]] = zext i16 [[TMP11]] to i64
	; AVX-NEXT: [[TMP20:%.*]] = load <2 x i16>, ptr [[ARRAYIDX_I_I_6]], align 2			; AVX-NEXT: [[RETVAL_SROA_9_8_INSERT_SHIFT:%.*]] = shl nuw i64 [[RETVAL_SROA_9_8_INSERT_EXT]], 48
				; AVX-NEXT: [[TMP19:%.*]] = load <2 x i16>, ptr [[ARRAYIDX_I_I10_5]], align 2
				; AVX-NEXT: [[TMP20:%.*]] = load <2 x i16>, ptr [[ARRAYIDX_I_I_5]], align 2
	; AVX-NEXT: [[TMP21:%.*]] = call <2 x i16> @llvm.smin.v2i16(<2 x i16> [[TMP19]], <2 x i16> [[TMP20]])			; AVX-NEXT: [[TMP21:%.*]] = call <2 x i16> @llvm.smin.v2i16(<2 x i16> [[TMP19]], <2 x i16> [[TMP20]])
	; AVX-NEXT: [[TMP22:%.*]] = zext <2 x i16> [[TMP21]] to <2 x i64>			; AVX-NEXT: [[TMP22:%.*]] = zext <2 x i16> [[TMP21]] to <2 x i64>
	; AVX-NEXT: [[TMP23:%.*]] = shl nuw <2 x i64> [[TMP22]], <i64 32, i64 48>			; AVX-NEXT: [[TMP23:%.*]] = shl nuw nsw <2 x i64> [[TMP22]], <i64 16, i64 32>
	; AVX-NEXT: [[TMP24:%.*]] = extractelement <2 x i64> [[TMP23]], i32 0			; AVX-NEXT: [[TMP24:%.*]] = extractelement <2 x i64> [[TMP23]], i32 1
	; AVX-NEXT: [[TMP25:%.*]] = extractelement <2 x i64> [[TMP23]], i32 1			; AVX-NEXT: [[RETVAL_SROA_8_8_INSERT_INSERT:%.*]] = or i64 [[RETVAL_SROA_9_8_INSERT_SHIFT]], [[TMP24]]
	; AVX-NEXT: [[RETVAL_SROA_8_8_INSERT_INSERT:%.*]] = or i64 [[TMP25]], [[TMP24]]			; AVX-NEXT: [[TMP25:%.*]] = extractelement <2 x i64> [[TMP23]], i32 0
	; AVX-NEXT: [[RETVAL_SROA_7_8_INSERT_EXT:%.*]] = zext i16 [[TMP11]] to i64			; AVX-NEXT: [[RETVAL_SROA_7_8_INSERT_INSERT:%.*]] = or i64 [[RETVAL_SROA_8_8_INSERT_INSERT]], [[TMP25]]
	; AVX-NEXT: [[RETVAL_SROA_7_8_INSERT_SHIFT:%.*]] = shl nuw nsw i64 [[RETVAL_SROA_7_8_INSERT_EXT]], 16
	; AVX-NEXT: [[RETVAL_SROA_7_8_INSERT_INSERT:%.*]] = or i64 [[RETVAL_SROA_8_8_INSERT_INSERT]], [[RETVAL_SROA_7_8_INSERT_SHIFT]]
	; AVX-NEXT: [[RETVAL_SROA_5_8_INSERT_EXT:%.*]] = zext i16 [[TMP8]] to i64			; AVX-NEXT: [[RETVAL_SROA_5_8_INSERT_EXT:%.*]] = zext i16 [[TMP8]] to i64
	; AVX-NEXT: [[RETVAL_SROA_5_8_INSERT_INSERT:%.*]] = or i64 [[RETVAL_SROA_7_8_INSERT_INSERT]], [[RETVAL_SROA_5_8_INSERT_EXT]]			; AVX-NEXT: [[RETVAL_SROA_5_8_INSERT_INSERT:%.*]] = or i64 [[RETVAL_SROA_7_8_INSERT_INSERT]], [[RETVAL_SROA_5_8_INSERT_EXT]]
	; AVX-NEXT: [[DOTFCA_1_INSERT:%.*]] = insertvalue { i64, i64 } [[DOTFCA_0_INSERT]], i64 [[RETVAL_SROA_5_8_INSERT_INSERT]], 1			; AVX-NEXT: [[DOTFCA_1_INSERT:%.*]] = insertvalue { i64, i64 } [[DOTFCA_0_INSERT]], i64 [[RETVAL_SROA_5_8_INSERT_INSERT]], 1
	; AVX-NEXT: ret { i64, i64 } [[DOTFCA_1_INSERT]]			; AVX-NEXT: ret { i64, i64 } [[DOTFCA_1_INSERT]]
	;			;
	entry:			entry:
	%0 = load i16, ptr %y, align 2			%0 = load i16, ptr %y, align 2
	%1 = load i16, ptr %x, align 2			%1 = load i16, ptr %x, align 2
	▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/vectorize-pair-path.ll

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines

	attributes #0 = { "unsafe-fp-math"="true" }			attributes #0 = { "unsafe-fp-math"="true" }

	; This test checks that root steering works and that the code gets vectorized.			; This test checks that root steering works and that the code gets vectorized.

	define void @root_steering() {			define void @root_steering() {
	; CHECK-LABEL: @root_steering(			; CHECK-LABEL: @root_steering(
	; CHECK-NEXT: bb:			; CHECK-NEXT: bb:
	; CHECK-NEXT: [[CHAIN2_2:%.*]] = fadd double 4.000000e-01, 5.000000e-01
	; CHECK-NEXT: [[CHAIN2_1:%.*]] = fmul double 3.000000e-01, [[CHAIN2_2]]
	; CHECK-NEXT: [[ROOT5:%.*]] = fadd double 2.000000e-01, [[CHAIN2_1]]
	; CHECK-NEXT: [[ROOT3:%.*]] = fmul double 3.000000e-01, 2.000000e-01			; CHECK-NEXT: [[ROOT3:%.*]] = fmul double 3.000000e-01, 2.000000e-01
	; CHECK-NEXT: [[MUL:%.*]] = fmul double [[ROOT3]], 1.000000e-01			; CHECK-NEXT: [[MUL:%.*]] = fmul double [[ROOT3]], 1.000000e-01
	; CHECK-NEXT: [[CHAINB_3:%.*]] = fadd double 3.000000e-01, 4.000000e-01			; CHECK-NEXT: [[CHAINB_3:%.*]] = fadd double 3.000000e-01, 4.000000e-01
	; CHECK-NEXT: [[CHAINB_2:%.*]] = fmul double 2.000000e-01, [[CHAINB_3]]			; CHECK-NEXT: [[CHAINB_2:%.*]] = fmul double 2.000000e-01, [[CHAINB_3]]
	; CHECK-NEXT: [[CHAINB_1:%.*]] = fadd double 1.000000e-01, [[CHAINB_2]]			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> <double 5.000000e-01, double poison>, double [[CHAINB_2]], i32 1
	; CHECK-NEXT: [[ROOT4:%.*]] = fmul double [[MUL]], [[CHAINB_1]]			; CHECK-NEXT: [[TMP1:%.*]] = fadd <2 x double> <double 4.000000e-01, double 1.000000e-01>, [[TMP0]]
	; CHECK-NEXT: [[ROOT2:%.*]] = fadd double 1.000000e-01, [[ROOT4]]			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> <double 3.000000e-01, double poison>, double [[MUL]], i32 1
	; CHECK-NEXT: [[ROOT1:%.*]] = fmul double [[ROOT3]], [[ROOT5]]			; CHECK-NEXT: [[TMP3:%.*]] = fmul <2 x double> [[TMP2]], [[TMP1]]
	; CHECK-NEXT: [[DIV:%.*]] = fdiv double [[ROOT1]], [[ROOT2]]			; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> <double 2.000000e-01, double 1.000000e-01>, [[TMP3]]
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[TMP4]], i32 0
				; CHECK-NEXT: [[ROOT1:%.*]] = fmul double [[ROOT3]], [[TMP5]]
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[TMP4]], i32 1
				; CHECK-NEXT: [[DIV:%.*]] = fdiv double [[ROOT1]], [[TMP6]]
	; CHECK-NEXT: [[SEED:%.*]] = fcmp ogt double [[DIV]], 3.000000e-01			; CHECK-NEXT: [[SEED:%.*]] = fcmp ogt double [[DIV]], 3.000000e-01
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	bb:			bb:
	%chain2_2 = fadd double 0.4, 0.5			%chain2_2 = fadd double 0.4, 0.5
	%chain2_1 = fmul double 0.3, %chain2_2			%chain2_1 = fmul double 0.3, %chain2_2
	%root5 = fadd double 0.2, %chain2_1			%root5 = fadd double 0.2, %chain2_1

	Show All 15 Lines