This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1
SLPVectorizer.cpp

Differential D117926

[SLP] Optionally preserve MemorySSA
ClosedPublic

Authored by reames on Jan 21 2022, 1:29 PM.

Download Raw Diff

Details

Reviewers

ABataev
fhahn
nikic
asbirlea
aeubanks

Commits

rG1cfa986d68e2: [SLP] Optionally preserve MemorySSA

Summary

This initial patch adds code to preserve MemorySSA through a run of SLP vectorizer. The eventual plan is to use MemorySSA to accelerate SLP's memory dependence checking, but we're a ways from that.

In particular, this patch is correct, but really slow. I want to land this so that the slightly more delicate compile time optimization patches are individually reviewable.

Edit: Forgot to say, this is my first time using memoryssa for anything, so skeptical review is very warranted.

Note that I'm intentionally not preserving MemorySSA even if available. The current update code is *so slow* that it's faster to simply rebuild after SLP is done.

Suggestions on how to make this reasonable fast are welcomed, but I'd strongly prefer to work incrementally and address compile time concerns in following patches.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

reames created this revision.Jan 21 2022, 1:29 PM

Herald added subscribers: george.burgess.iv, bollu, hiraditya, mcrosier. · View Herald TranscriptJan 21 2022, 1:29 PM

reames requested review of this revision.Jan 21 2022, 1:29 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 21 2022, 1:29 PM

reames edited the summary of this revision. (Show Details)Jan 21 2022, 1:31 PM

Can you please explain what the larger context here is? What cases you trying to solve with MemorySSA?

I'm not sure it will be the right tool for the job, so I think we should discuss this before making any changes. We don't have MSSA available at SLP's pipeline position, and computing it just for SLP will make this pass much more expensive.

In D117926#3262469, @nikic wrote:

Can you please explain what the larger context here is? What cases you trying to solve with MemorySSA?

Sure, though I'm a bit limited in what I can say. The original example is not public.

Essentially, I have a case where we are spending a large fraction of total O2 time inside SLP - specifically, inside the code which is figuring out which memory dependencies exist while trying to schedule. (To prevent confusion, note that SLP scheduling subsumes several legality tests.)

Specifically, the case which is hurting this example - which is machine generated code - is a very long basic block with a vectorizable pair of loads at the beginning, and a vectorizable pair of stores (consuming the loaded values) at the end. There's multiple pairs, but the core detail is that the required scheduling window is basically the entire size of the huge basic block.

The time is spent figuring out dependencies for *scalar* instructions - not even the ones we're trying to vectorize. Since this is such a huge block, the current mssa-like memory chain ends up being very expensive.

I'd explored options for limiting the scheduling window, but mssa felt like a more general answer, so I started there.

I'm not sure it will be the right tool for the job, so I think we should discuss this before making any changes. We don't have MSSA available at SLP's pipeline position, and computing it just for SLP will make this pass much more expensive.

I'm really surprised to hear you say that. My understanding was that memory ssa was rather cheap to construct if you don't need an optimized form, and that optimization was done lazily.

However, I see my memory of prior discussion on this topic is clearly wrong. The constructor for memoryssa does appear to eagerly optimize.

Despite this, I don't see memoryssaanalysis showing up as expensive in the -time-passes-per-run output even with this change. I see SLP itself slow down a lot, but I had put that down to the generic renaming instead of using specialized knowledge from the callsite.

Edit: I confirmed the pass profiling result by nulling out MSSA immediately after the getResult call. The runtime drops to basically nothing over the non-MSSA version (e.g. measurement noise). So despite the optimization at construction, it really is the updates which are expensive in this case. It's possibly my example is highly unrepresentative, but that seems questionable. Any theories?

Ok, was able to spot the additional construction time. It took about 15 ms.

For context, the original example spends about 3,23 seconds in SLP w/o MSSA, and the (horribly unoptimized) preservation currently takes an additional 5.25 seconds on top of that.

Quite literally, different orders of magnitudes. If we can get SSA preservation down to something reasonable - again, incrementalism please - I'd argue using it here is entirely reasonable.

In D117926#3262503, @reames wrote:

In D117926#3262469, @nikic wrote:

Can you please explain what the larger context here is? What cases you trying to solve with MemorySSA?

Sure, though I'm a bit limited in what I can say. The original example is not public.

Essentially, I have a case where we are spending a large fraction of total O2 time inside SLP - specifically, inside the code which is figuring out which memory dependencies exist while trying to schedule. (To prevent confusion, note that SLP scheduling subsumes several legality tests.)

Specifically, the case which is hurting this example - which is machine generated code - is a very long basic block with a vectorizable pair of loads at the beginning, and a vectorizable pair of stores (consuming the loaded values) at the end. There's multiple pairs, but the core detail is that the required scheduling window is basically the entire size of the huge basic block.

Is it possible to construct an artificial test case that can be shared? Just the loads/stores at the beginning and end and dummy instructions in between?

The time is spent figuring out dependencies for *scalar* instructions - not even the ones we're trying to vectorize. Since this is such a huge block, the current mssa-like memory chain ends up being very expensive.

I'd explored options for limiting the scheduling window, but mssa felt like a more general answer, so I started there.

It sounds to me like a cutoff is what this mainly needs. We always run into degenerate cases when there is an unbounded instruction walk. (MSSA itself also limits instruction walks.)

Something you might want to try it use BatchAAResults. Assuming that all the alias checks happen without IR modifications in between, it would be safe to cache them.

I'm not sure it will be the right tool for the job, so I think we should discuss this before making any changes. We don't have MSSA available at SLP's pipeline position, and computing it just for SLP will make this pass much more expensive.

I'm really surprised to hear you say that. My understanding was that memory ssa was rather cheap to construct if you don't need an optimized form, and that optimization was done lazily.

However, I see my memory of prior discussion on this topic is clearly wrong. The constructor for memoryssa does appear to eagerly optimize.

Yes. We discussed adding a mode that does not eagerly optimize in the past, but didn't do so (yet) for lack of a use-case.

Despite this, I don't see memoryssaanalysis showing up as expensive in the -time-passes-per-run output even with this change. I see SLP itself slow down a lot, but I had put that down to the generic renaming instead of using specialized knowledge from the callsite.

Edit: I confirmed the pass profiling result by nulling out MSSA immediately after the getResult call. The runtime drops to basically nothing over the non-MSSA version (e.g. measurement noise). So despite the optimization at construction, it really is the updates which are expensive in this case. It's possibly my example is highly unrepresentative, but that seems questionable. Any theories?

MemorySSA updates can be quite expensive -- some update operations are unexpectedly O(n). I believe the insertUse() and removeMemoryAccess() should be cheap, but insertDef() with RenameUses=true can be expensive, due to renaming. In some cases you can avoid the renaming, see for example D107702.

In D117926#3262537, @reames wrote:

Ok, was able to spot the additional construction time. It took about 15 ms.

For context, the original example spends about 3,23 seconds in SLP w/o MSSA, and the (horribly unoptimized) preservation currently takes an additional 5.25 seconds on top of that.

Quite literally, different orders of magnitudes. If we can get SSA preservation down to something reasonable - again, incrementalism please - I'd argue using it here is entirely reasonable.

Sure, if you're looking at a degenerate case, MSSA construction will not be a dominating factor. What I have in mind here is the average case, where SLP will be practically free (and usually just not do anything), while MSSA construction still needs to happen.

Harbormaster completed remote builds in B144925: Diff 402097.Jan 21 2022, 4:29 PM

Any chance you could profile where most time is spent?

As Nikita mentioned, the insertDef and moveTo (which calls insertDef) methods are very expensive if they need to do renaming. Can you check if most time is spent in MSSA->renamePass?
I'm asking this just to get a confirmation on where the issue lies (even the walk from 7714 can be an issue if the block is that huge and there are only MemoryAccesses only at the very beginning and end, though I sincerely doubt it).

There is no good capping for these updates, as renaming is often required to do a correct update, but if all the accesses are in a single block which already contains MemoryDefs, the changes shouldn't be so invasive (no new MemoryPhis should be inserted) so there's a chance to make this manageable.

Also echoing @nikic, an artificial test to replicate the issue would be great.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
8099	Without knowing SLPVectorizer details, I think this should be: s/MemorySSA::InsertionPlace::End/MemorySSA::InsertionPlace::BeforeTerminator.

I am setting this aside for the moment. The consensus of review seems to be that landing a working but slow preservation and then speeding it up incrementally is not desired. I think that's a poor choice, but am willing to comply.

I will refresh this patch once I have the full complexity of a working and fast MSSA preservation.

This will likely be a while as in the delay caused by the review, I've found a non-mssa based fix for the original compile time issue and will be proceeding with that one instead. I hope to get back to the MSSA update because I think that is generally worthwhile, but it's no longer a priority for me, and I'm currently very short on time.

Refresh the patch.

Ok, I'm back to this, and want to strongly argue for this patch being accepted largely as is. I went ahead and filed an issue with a long form writeup explaining why I think this is the right approach (https://github.com/llvm/llvm-project/issues/54256).

I want to request that we defer *efficiency* of update to separate reviews. I have investigated the cause of the slowdowns, and have a local patch which is fast on my motivating example. My local patch is definitely not fully correct as it hacks around several issues, but I'm now reasonable convinced a correct and fast preservation of MSSA is entirely possible. However, the number of interlocking and subtle changes required is well beyond anything reasonable for a single review.

If we can't unblock incremental review on this, I don't think implementing this is practical at all. There's too much undocumented assumptions about memory ssa form, and attempting to post the entire series without being able to do cleanups in between would be simply unmanageable and unreviewable.

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 10:37 AM

Harbormaster completed remote builds in B152981: Diff 413553.Mar 7 2022, 11:00 AM

nikic mentioned this in D121381: [MemorySSA] Support lazy use optimization.Mar 10 2022, 8:21 AM

ping

Knowing whether I'm going to be able to land incremental progress here is blocking a lot of work. I'd really appreciate either an LGTM or being told this project is being rejected. I'll accept either answer, but the lack of response is the worst possibility.

Reading the details in your post and looking at the SLPVectorizer code I think there's the potential for the pass to benefit from using MemorySSA. However I do not understand enough of the SLPVectorizer mechanics to guess if this will work out or not.
In this context, I think it makes sense to have incremental changes while keeping building and updating MSSA off by default; as you mentioned in the post, all changes can (and should) be reverted if a later decision is to pursue something else.

@nikic Thoughts?

Regarding this patch, seeing MSSA used in a dependent patch would help.
Is there a benefit from the existance of an insertAccesses() /insertDefs() /insertUses API in MSSA (i.e. bulk add update, perhaps limited to within the same BB)? Does doing lazy updates during the vectorizeTree() call make sense or will MSSA need to be queried?

In D117926#3379735, @asbirlea wrote:

Reading the details in your post and looking at the SLPVectorizer code I think there's the potential for the pass to benefit from using MemorySSA. However I do not understand enough of the SLPVectorizer mechanics to guess if this will work out or not.
In this context, I think it makes sense to have incremental changes while keeping building and updating MSSA off by default; as you mentioned in the post, all changes can (and should) be reverted if a later decision is to pursue something else.

@nikic Thoughts?

Was this meant to be an LGTM? It sort of sounds like it, but you never say that explicitly.

Regarding this patch, seeing MSSA used in a dependent patch would help.
Is there a benefit from the existance of an insertAccesses() /insertDefs() /insertUses API in MSSA (i.e. bulk add update, perhaps limited to within the same BB)? Does doing lazy updates during the vectorizeTree() call make sense or will MSSA need to be queried?

I explicitly don't want to try to answer this until future patches. That's the whole point of being incremental; we can discuss each one in the appropriate context, and I can avoid repeating myself and speculating about hypothetical situations which don't match the code structure as it evolves.

This looks technically correct to me, so I'll accept this in the interest of exploration.

However, I am still not convinced that MemorySSA is the solution to this problem for reasons already outlined, so my baseline expectation is that we will opt to not enable this code in the end -- let's hope I'm wrong on that ;) The lack of a publishable reproducer is a really big problem here, because this leaves no room to explore alternatives at all.

This revision is now accepted and ready to land.Mar 14 2022, 2:05 PM

This revision was landed with ongoing or failed builds.Mar 15 2022, 4:38 PM

Closed by commit rG1cfa986d68e2: [SLP] Optionally preserve MemorySSA (authored by reames). · Explain Why

This revision was automatically updated to reflect the committed changes.

reames added a commit: rG1cfa986d68e2: [SLP] Optionally preserve MemorySSA.

nikic mentioned this in rGf96428e16de2: [MemorySSA] Don't optimize uses during construction.Mar 18 2022, 1:56 AM

reames added a reverting change: rG8f108c32bcd5: Revert "[SLP] Optionally preserve MemorySSA".Mar 18 2022, 10:46 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

SLPVectorizer.h

4 lines

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

90 lines

Diff 415634

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

Show All 31 Lines
class DemandedBits;		class DemandedBits;
class DominatorTree;		class DominatorTree;
class Function;		class Function;
class GetElementPtrInst;		class GetElementPtrInst;
class InsertElementInst;		class InsertElementInst;
class InsertValueInst;		class InsertValueInst;
class Instruction;		class Instruction;
class LoopInfo;		class LoopInfo;
		class MemorySSA;
class OptimizationRemarkEmitter;		class OptimizationRemarkEmitter;
class PHINode;		class PHINode;
class ScalarEvolution;		class ScalarEvolution;
class StoreInst;		class StoreInst;
class TargetLibraryInfo;		class TargetLibraryInfo;
class TargetTransformInfo;		class TargetTransformInfo;
class Value;		class Value;

Show All 14 Lines	struct SLPVectorizerPass : public PassInfoMixin<SLPVectorizerPass> {
ScalarEvolution *SE = nullptr;		ScalarEvolution *SE = nullptr;
TargetTransformInfo *TTI = nullptr;		TargetTransformInfo *TTI = nullptr;
TargetLibraryInfo *TLI = nullptr;		TargetLibraryInfo *TLI = nullptr;
AAResults *AA = nullptr;		AAResults *AA = nullptr;
LoopInfo *LI = nullptr;		LoopInfo *LI = nullptr;
DominatorTree *DT = nullptr;		DominatorTree *DT = nullptr;
AssumptionCache *AC = nullptr;		AssumptionCache *AC = nullptr;
DemandedBits *DB = nullptr;		DemandedBits *DB = nullptr;
		MemorySSA *MSSA = nullptr; // nullable, currently preserved, but not used
const DataLayout *DL = nullptr;		const DataLayout *DL = nullptr;

public:		public:
PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);		PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);

// Glue for old PM.		// Glue for old PM.
bool runImpl(Function &F, ScalarEvolution SE_, TargetTransformInfo TTI_,		bool runImpl(Function &F, ScalarEvolution SE_, TargetTransformInfo TTI_,
TargetLibraryInfo TLI_, AAResults AA_, LoopInfo *LI_,		TargetLibraryInfo TLI_, AAResults AA_, LoopInfo *LI_,
DominatorTree DT_, AssumptionCache AC_, DemandedBits *DB_,		DominatorTree DT_, AssumptionCache AC_, DemandedBits *DB_,
OptimizationRemarkEmitter *ORE_);		MemorySSA MSSA_, OptimizationRemarkEmitter ORE_);

private:		private:
/// Collect store and getelementptr instructions and organize them		/// Collect store and getelementptr instructions and organize them
/// according to the underlying object of their pointer operands. We sort the		/// according to the underlying object of their pointer operands. We sort the
/// instructions by their underlying objects to reduce the cost of		/// instructions by their underlying objects to reduce the cost of
/// consecutive access queries.		/// consecutive access queries.
///		///
/// TODO: We can further reduce this cost if we flush the chain creation		/// TODO: We can further reduce this cost if we flush the chain creation
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 35 Lines
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/CodeMetrics.h"		#include "llvm/Analysis/CodeMetrics.h"
#include "llvm/Analysis/DemandedBits.h"		#include "llvm/Analysis/DemandedBits.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
#include "llvm/Analysis/IVDescriptors.h"		#include "llvm/Analysis/IVDescriptors.h"
#include "llvm/Analysis/LoopAccessAnalysis.h"		#include "llvm/Analysis/LoopAccessAnalysis.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/MemoryLocation.h"		#include "llvm/Analysis/MemoryLocation.h"
		#include "llvm/Analysis/MemorySSA.h"
		#include "llvm/Analysis/MemorySSAUpdater.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"		#include "llvm/Analysis/OptimizationRemarkEmitter.h"
#include "llvm/Analysis/ScalarEvolution.h"		#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/ScalarEvolutionExpressions.h"		#include "llvm/Analysis/ScalarEvolutionExpressions.h"
#include "llvm/Analysis/TargetLibraryInfo.h"		#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/Attributes.h"		#include "llvm/IR/Attributes.h"
▲ Show 20 Lines • Show All 109 Lines • ▼ Show 20 Lines
static cl::opt<int> LookAheadMaxDepth(		static cl::opt<int> LookAheadMaxDepth(
"slp-max-look-ahead-depth", cl::init(2), cl::Hidden,		"slp-max-look-ahead-depth", cl::init(2), cl::Hidden,
cl::desc("The maximum look-ahead depth for operand reordering scores"));		cl::desc("The maximum look-ahead depth for operand reordering scores"));

static cl::opt<bool>		static cl::opt<bool>
ViewSLPTree("view-slp-tree", cl::Hidden,		ViewSLPTree("view-slp-tree", cl::Hidden,
cl::desc("Display the SLP trees with Graphviz"));		cl::desc("Display the SLP trees with Graphviz"));

		static cl::opt<bool> EnableMSSAInSLPVectorizer(
		"enable-mssa-in-slp-vectorizer", cl::Hidden, cl::init(false),
		cl::desc("Enable MemorySSA for SLPVectorizer in new pass manager"));

// Limit the number of alias checks. The limit is chosen so that		// Limit the number of alias checks. The limit is chosen so that
// it has no negative effect on the llvm benchmarks.		// it has no negative effect on the llvm benchmarks.
static const unsigned AliasedCheckLimit = 10;		static const unsigned AliasedCheckLimit = 10;

// Another limit for the alias checks: The maximum distance between load/store		// Another limit for the alias checks: The maximum distance between load/store
// instructions where alias checks are done.		// instructions where alias checks are done.
// This limit is useful for very large basic blocks.		// This limit is useful for very large basic blocks.
static const unsigned MaxMemDepDistance = 160;		static const unsigned MaxMemDepDistance = 160;
▲ Show 20 Lines • Show All 607 Lines • ▼ Show 20 Lines	public:
using StoreList = SmallVector<StoreInst *, 8>;		using StoreList = SmallVector<StoreInst *, 8>;
using ExtraValueToDebugLocsMap =		using ExtraValueToDebugLocsMap =
MapVector<Value , SmallVector<Instruction , 2>>;		MapVector<Value , SmallVector<Instruction , 2>>;
using OrdersType = SmallVector<unsigned, 4>;		using OrdersType = SmallVector<unsigned, 4>;

BoUpSLP(Function Func, ScalarEvolution Se, TargetTransformInfo *Tti,		BoUpSLP(Function Func, ScalarEvolution Se, TargetTransformInfo *Tti,
TargetLibraryInfo TLi, AAResults Aa, LoopInfo *Li,		TargetLibraryInfo TLi, AAResults Aa, LoopInfo *Li,
DominatorTree Dt, AssumptionCache AC, DemandedBits *DB,		DominatorTree Dt, AssumptionCache AC, DemandedBits *DB,
const DataLayout DL, OptimizationRemarkEmitter ORE)		MemorySSA MSSA, const DataLayout DL, OptimizationRemarkEmitter *ORE)
: BatchAA(*Aa), F(Func), SE(Se), TTI(Tti), TLI(TLi), LI(Li),		: BatchAA(*Aa), F(Func), SE(Se), TTI(Tti), TLI(TLi), LI(Li),
DT(Dt), AC(AC), DB(DB), DL(DL), ORE(ORE), Builder(Se->getContext()) {		DT(Dt), AC(AC), DB(DB), MSSA(MSSA), DL(DL), ORE(ORE),
		Builder(Se->getContext()) {
CodeMetrics::collectEphemeralValues(F, AC, EphValues);		CodeMetrics::collectEphemeralValues(F, AC, EphValues);
// Use the vector register size specified by the target unless overridden		// Use the vector register size specified by the target unless overridden
// by a command-line option.		// by a command-line option.
// TODO: It would be better to limit the vectorization factor based on		// TODO: It would be better to limit the vectorization factor based on
// data type rather than just register size. For example, x86 AVX has		// data type rather than just register size. For example, x86 AVX has
// 256-bit registers, but it does not support integer operations		// 256-bit registers, but it does not support integer operations
// at that width (that requires AVX2).		// at that width (that requires AVX2).
if (MaxVectorRegSizeOption.getNumOccurrences())		if (MaxVectorRegSizeOption.getNumOccurrences())
▲ Show 20 Lines • Show All 2,171 Lines • ▼ Show 20 Lines	#endif
Function *F;		Function *F;
ScalarEvolution *SE;		ScalarEvolution *SE;
TargetTransformInfo *TTI;		TargetTransformInfo *TTI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
LoopInfo *LI;		LoopInfo *LI;
DominatorTree *DT;		DominatorTree *DT;
AssumptionCache *AC;		AssumptionCache *AC;
DemandedBits *DB;		DemandedBits *DB;
		MemorySSA *MSSA;
const DataLayout *DL;		const DataLayout *DL;
OptimizationRemarkEmitter *ORE;		OptimizationRemarkEmitter *ORE;

unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.		unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.
unsigned MinVecRegSize; // Set by cl::opt (default: 128).		unsigned MinVecRegSize; // Set by cl::opt (default: 128).

/// Instruction builder to construct the vectorized tree.		/// Instruction builder to construct the vectorized tree.
IRBuilder<> Builder;		IRBuilder<> Builder;
▲ Show 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	if (Entry->State == TreeEntry::NeedToGather)
return "color=red";		return "color=red";
return "";		return "";
}		}
};		};

} // end namespace llvm		} // end namespace llvm

BoUpSLP::~BoUpSLP() {		BoUpSLP::~BoUpSLP() {
		if (MSSA) {
		MemorySSAUpdater MSSAU(MSSA);
		for (const auto &Pair : DeletedInstructions) {
		if (auto *Access = MSSA->getMemoryAccess(Pair.first))
		MSSAU.removeMemoryAccess(Access);
		}
		}
for (const auto &Pair : DeletedInstructions) {		for (const auto &Pair : DeletedInstructions) {
// Replace operands of ignored instructions with Undefs in case if they were		// Replace operands of ignored instructions with Undefs in case if they were
// marked for deletion.		// marked for deletion.
if (Pair.getSecond()) {		if (Pair.getSecond()) {
Value *Undef = UndefValue::get(Pair.getFirst()->getType());		Value *Undef = UndefValue::get(Pair.getFirst()->getType());
Pair.getFirst()->replaceAllUsesWith(Undef);		Pair.getFirst()->replaceAllUsesWith(Undef);
}		}
Pair.getFirst()->dropAllReferences();		Pair.getFirst()->dropAllReferences();
▲ Show 20 Lines • Show All 3,679 Lines • ▼ Show 20 Lines	case Instruction::ExtractElement: {
return V;		return V;
}		}
case Instruction::ExtractValue: {		case Instruction::ExtractValue: {
auto *LI = cast<LoadInst>(E->getSingleOperand(0));		auto *LI = cast<LoadInst>(E->getSingleOperand(0));
Builder.SetInsertPoint(LI);		Builder.SetInsertPoint(LI);
auto *PtrTy = PointerType::get(VecTy, LI->getPointerAddressSpace());		auto *PtrTy = PointerType::get(VecTy, LI->getPointerAddressSpace());
Value *Ptr = Builder.CreateBitCast(LI->getOperand(0), PtrTy);		Value *Ptr = Builder.CreateBitCast(LI->getOperand(0), PtrTy);
LoadInst *V = Builder.CreateAlignedLoad(VecTy, Ptr, LI->getAlign());		LoadInst *V = Builder.CreateAlignedLoad(VecTy, Ptr, LI->getAlign());
		if (MSSA) {
		MemorySSAUpdater MSSAU(MSSA);
		auto *Access = MSSA->getMemoryAccess(LI);
		assert(Access);
		MemoryUseOrDef *NewAccess =
		MSSAU.createMemoryAccessBefore(V, Access->getDefiningAccess(),
		Access);
		MSSAU.insertUse(cast<MemoryUse>(NewAccess), true);
		}
Value *NewV = propagateMetadata(V, E->Scalars);		Value *NewV = propagateMetadata(V, E->Scalars);
ShuffleBuilder.addInversedMask(E->ReorderIndices);		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
NewV = ShuffleBuilder.finalize(NewV);		NewV = ShuffleBuilder.finalize(NewV);
E->VectorizedValue = NewV;		E->VectorizedValue = NewV;
return NewV;		return NewV;
}		}
case Instruction::InsertElement: {		case Instruction::InsertElement: {
▲ Show 20 Lines • Show All 233 Lines • ▼ Show 20 Lines	case Instruction::Load: {
Value *VecPtr = vectorizeTree(E->getOperand(0));		Value *VecPtr = vectorizeTree(E->getOperand(0));
// Use the minimum alignment of the gathered loads.		// Use the minimum alignment of the gathered loads.
Align CommonAlignment = LI->getAlign();		Align CommonAlignment = LI->getAlign();
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
CommonAlignment =		CommonAlignment =
commonAlignment(CommonAlignment, cast<LoadInst>(V)->getAlign());		commonAlignment(CommonAlignment, cast<LoadInst>(V)->getAlign());
NewLI = Builder.CreateMaskedGather(VecTy, VecPtr, CommonAlignment);		NewLI = Builder.CreateMaskedGather(VecTy, VecPtr, CommonAlignment);
}		}

		if (MSSA) {
		MemorySSAUpdater MSSAU(MSSA);
		auto *Access = MSSA->getMemoryAccess(LI);
		assert(Access);
		MemoryUseOrDef *NewAccess =
		MSSAU.createMemoryAccessAfter(NewLI, Access->getDefiningAccess(),
		Access);
		MSSAU.insertUse(cast<MemoryUse>(NewAccess), true);
		}

Value *V = propagateMetadata(NewLI, E->Scalars);		Value *V = propagateMetadata(NewLI, E->Scalars);

ShuffleBuilder.addInversedMask(E->ReorderIndices);		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
Show All 9 Lines	case Instruction::Store: {
VecValue = ShuffleBuilder.finalize(VecValue);		VecValue = ShuffleBuilder.finalize(VecValue);

Value *ScalarPtr = SI->getPointerOperand();		Value *ScalarPtr = SI->getPointerOperand();
Value *VecPtr = Builder.CreateBitCast(		Value *VecPtr = Builder.CreateBitCast(
ScalarPtr, VecValue->getType()->getPointerTo(AS));		ScalarPtr, VecValue->getType()->getPointerTo(AS));
StoreInst *ST =		StoreInst *ST =
Builder.CreateAlignedStore(VecValue, VecPtr, SI->getAlign());		Builder.CreateAlignedStore(VecValue, VecPtr, SI->getAlign());

		if (MSSA) {
		MemorySSAUpdater MSSAU(MSSA);
		auto *Access = MSSA->getMemoryAccess(SI);
		assert(Access);
		MemoryUseOrDef *NewAccess =
		MSSAU.createMemoryAccessAfter(ST, Access->getDefiningAccess(),
		Access);
		MSSAU.insertDef(cast<MemoryDef>(NewAccess), true);
		}

// The pointer operand uses an in-tree scalar, so add the new BitCast or		// The pointer operand uses an in-tree scalar, so add the new BitCast or
// StoreInst to ExternalUses to make sure that an extract will be		// StoreInst to ExternalUses to make sure that an extract will be
// generated in the future.		// generated in the future.
if (TreeEntry *Entry = getTreeEntry(ScalarPtr)) {		if (TreeEntry *Entry = getTreeEntry(ScalarPtr)) {
// Find which lane we need to extract.		// Find which lane we need to extract.
unsigned FoundLane = Entry->findLaneForValue(ScalarPtr);		unsigned FoundLane = Entry->findLaneForValue(ScalarPtr);
ExternalUses.push_back(ExternalUser(		ExternalUses.push_back(ExternalUser(
ScalarPtr, ScalarPtr != VecPtr ? cast<User>(VecPtr) : ST,		ScalarPtr, ScalarPtr != VecPtr ? cast<User>(VecPtr) : ST,
▲ Show 20 Lines • Show All 948 Lines • ▼ Show 20 Lines	BS->doForAllOpcodes(I, [this, &Idx, &NumToSchedule, BS](ScheduleData *SD) {
BS->calculateDependencies(SD, false, this);		BS->calculateDependencies(SD, false, this);
NumToSchedule++;		NumToSchedule++;
}		}
});		});
}		}
BS->initialFillReadyList(ReadyInsts);		BS->initialFillReadyList(ReadyInsts);

Instruction *LastScheduledInst = BS->ScheduleEnd;		Instruction *LastScheduledInst = BS->ScheduleEnd;
		MemoryAccess *MemInsertPt = nullptr;
		if (MSSA) {
		for (auto I = LastScheduledInst->getIterator(); I != BS->BB->end(); I++) {
		if (auto Access = MSSA->getMemoryAccess(&I)) {
		MemInsertPt = Access;
		break;
		}
		}
		}

// Do the "real" scheduling.		// Do the "real" scheduling.
while (!ReadyInsts.empty()) {		while (!ReadyInsts.empty()) {
ScheduleData picked = ReadyInsts.begin();		ScheduleData picked = ReadyInsts.begin();
ReadyInsts.erase(ReadyInsts.begin());		ReadyInsts.erase(ReadyInsts.begin());

// Move the scheduled instruction(s) to their dedicated places, if not		// Move the scheduled instruction(s) to their dedicated places, if not
// there yet.		// there yet.
for (ScheduleData *BundleMember = picked; BundleMember;		for (ScheduleData *BundleMember = picked; BundleMember;
BundleMember = BundleMember->NextInBundle) {		BundleMember = BundleMember->NextInBundle) {
Instruction *pickedInst = BundleMember->Inst;		Instruction *pickedInst = BundleMember->Inst;
if (pickedInst->getNextNode() != LastScheduledInst)		if (pickedInst->getNextNode() != LastScheduledInst) {
pickedInst->moveBefore(LastScheduledInst);		pickedInst->moveBefore(LastScheduledInst);
		if (MSSA) {
		MemorySSAUpdater MSSAU(MSSA);
		if (auto *Access = MSSA->getMemoryAccess(pickedInst)) {
		if (MemInsertPt)
		MSSAU.moveBefore(Access, cast<MemoryUseOrDef>(MemInsertPt));
		else
		asbirleaUnsubmitted Not Done Reply Inline Actions Without knowing SLPVectorizer details, I think this should be: s/MemorySSA::InsertionPlace::End/MemorySSA::InsertionPlace::BeforeTerminator. asbirlea: Without knowing SLPVectorizer details, I think this should be: s/MemorySSA::InsertionPlace…
		MSSAU.moveToPlace(Access, BS->BB,
		MemorySSA::InsertionPlace::End);
		}
		}
		}

LastScheduledInst = pickedInst;		LastScheduledInst = pickedInst;
		if (MSSA)
		if (auto *Access = MSSA->getMemoryAccess(LastScheduledInst))
		MemInsertPt = Access;
}		}

BS->schedule(picked, ReadyInsts);		BS->schedule(picked, ReadyInsts);
NumToSchedule--;		NumToSchedule--;
}		}
assert(NumToSchedule == 0 && "could not schedule all instructions");		assert(NumToSchedule == 0 && "could not schedule all instructions");

// Check that we didn't break any of our invariants.		// Check that we didn't break any of our invariants.
▲ Show 20 Lines • Show All 329 Lines • ▼ Show 20 Lines	bool runOnFunction(Function &F) override {
auto *TLI = TLIP ? &TLIP->getTLI(F) : nullptr;		auto *TLI = TLIP ? &TLIP->getTLI(F) : nullptr;
auto *AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		auto *AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
auto *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();		auto *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
auto *AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);		auto *AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
auto *DB = &getAnalysis<DemandedBitsWrapperPass>().getDemandedBits();		auto *DB = &getAnalysis<DemandedBitsWrapperPass>().getDemandedBits();
auto *ORE = &getAnalysis<OptimizationRemarkEmitterWrapperPass>().getORE();		auto *ORE = &getAnalysis<OptimizationRemarkEmitterWrapperPass>().getORE();

return Impl.runImpl(F, SE, TTI, TLI, AA, LI, DT, AC, DB, ORE);		return Impl.runImpl(F, SE, TTI, TLI, AA, LI, DT, AC, DB, /MSSA/nullptr, ORE);
}		}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
FunctionPass::getAnalysisUsage(AU);		FunctionPass::getAnalysisUsage(AU);
AU.addRequired<AssumptionCacheTracker>();		AU.addRequired<AssumptionCacheTracker>();
AU.addRequired<ScalarEvolutionWrapperPass>();		AU.addRequired<ScalarEvolutionWrapperPass>();
AU.addRequired<AAResultsWrapperPass>();		AU.addRequired<AAResultsWrapperPass>();
AU.addRequired<TargetTransformInfoWrapperPass>();		AU.addRequired<TargetTransformInfoWrapperPass>();
Show All 17 Lines	PreservedAnalyses SLPVectorizerPass::run(Function &F, FunctionAnalysisManager &AM) {
auto *TTI = &AM.getResult<TargetIRAnalysis>(F);		auto *TTI = &AM.getResult<TargetIRAnalysis>(F);
auto *TLI = AM.getCachedResult<TargetLibraryAnalysis>(F);		auto *TLI = AM.getCachedResult<TargetLibraryAnalysis>(F);
auto *AA = &AM.getResult<AAManager>(F);		auto *AA = &AM.getResult<AAManager>(F);
auto *LI = &AM.getResult<LoopAnalysis>(F);		auto *LI = &AM.getResult<LoopAnalysis>(F);
auto *DT = &AM.getResult<DominatorTreeAnalysis>(F);		auto *DT = &AM.getResult<DominatorTreeAnalysis>(F);
auto *AC = &AM.getResult<AssumptionAnalysis>(F);		auto *AC = &AM.getResult<AssumptionAnalysis>(F);
auto *DB = &AM.getResult<DemandedBitsAnalysis>(F);		auto *DB = &AM.getResult<DemandedBitsAnalysis>(F);
auto *ORE = &AM.getResult<OptimizationRemarkEmitterAnalysis>(F);		auto *ORE = &AM.getResult<OptimizationRemarkEmitterAnalysis>(F);
		auto *MSSA = EnableMSSAInSLPVectorizer ?
		&AM.getResult<MemorySSAAnalysis>(F).getMSSA() : (MemorySSA*)nullptr;

bool Changed = runImpl(F, SE, TTI, TLI, AA, LI, DT, AC, DB, ORE);		bool Changed = runImpl(F, SE, TTI, TLI, AA, LI, DT, AC, DB, MSSA, ORE);
if (!Changed)		if (!Changed)
return PreservedAnalyses::all();		return PreservedAnalyses::all();

PreservedAnalyses PA;		PreservedAnalyses PA;
PA.preserveSet<CFGAnalyses>();		PA.preserveSet<CFGAnalyses>();
		if (MSSA) {
		#ifdef EXPENSIVE_CHECKS
		MSSA->verifyMemorySSA();
		#endif
		PA.preserve<MemorySSAAnalysis>();
		}
return PA;		return PA;
}		}

bool SLPVectorizerPass::runImpl(Function &F, ScalarEvolution *SE_,		bool SLPVectorizerPass::runImpl(Function &F, ScalarEvolution *SE_,
TargetTransformInfo *TTI_,		TargetTransformInfo *TTI_,
TargetLibraryInfo TLI_, AAResults AA_,		TargetLibraryInfo TLI_, AAResults AA_,
LoopInfo LI_, DominatorTree DT_,		LoopInfo LI_, DominatorTree DT_,
AssumptionCache AC_, DemandedBits DB_,		AssumptionCache AC_, DemandedBits DB_,
		MemorySSA *MSSA,
OptimizationRemarkEmitter *ORE_) {		OptimizationRemarkEmitter *ORE_) {
if (!RunSLPVectorization)		if (!RunSLPVectorization)
return false;		return false;
SE = SE_;		SE = SE_;
TTI = TTI_;		TTI = TTI_;
TLI = TLI_;		TLI = TLI_;
AA = AA_;		AA = AA_;
LI = LI_;		LI = LI_;
Show All 17 Lines	bool SLPVectorizerPass::runImpl(Function &F, ScalarEvolution *SE_,
// Don't vectorize when the attribute NoImplicitFloat is used.		// Don't vectorize when the attribute NoImplicitFloat is used.
if (F.hasFnAttribute(Attribute::NoImplicitFloat))		if (F.hasFnAttribute(Attribute::NoImplicitFloat))
return false;		return false;

LLVM_DEBUG(dbgs() << "SLP: Analyzing blocks in " << F.getName() << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Analyzing blocks in " << F.getName() << ".\n");

// Use the bottom up slp vectorizer to construct chains that start with		// Use the bottom up slp vectorizer to construct chains that start with
// store instructions.		// store instructions.
BoUpSLP R(&F, SE, TTI, TLI, AA, LI, DT, AC, DB, DL, ORE_);		BoUpSLP R(&F, SE, TTI, TLI, AA, LI, DT, AC, DB, MSSA, DL, ORE_);

// A general note: the vectorizer must use BoUpSLP::eraseInstruction() to		// A general note: the vectorizer must use BoUpSLP::eraseInstruction() to
// delete instructions.		// delete instructions.

// Update DFS numbers now so that we can use them for ordering.		// Update DFS numbers now so that we can use them for ordering.
DT->updateDFSNumbers();		DT->updateDFSNumbers();

// Scan the blocks in the function in post order.		// Scan the blocks in the function in post order.
▲ Show 20 Lines • Show All 2,162 Lines • Show Last 20 Lines