This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
LoopAccessAnalysis.h
-
lib/Analysis/
-
Analysis/
3/4
LoopAccessAnalysis.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
ARM/
-
mve-mat-mul.ll
-
X86/
1/3
optsize.ll
-
unittests/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
VPlanSlpTest.cpp

Differential D71919

[LoopVectorize] Disable single stride access predicates when gather loads are available.
AcceptedPublic

Authored by dmgreen on Dec 27 2019, 12:54 AM.

Download Raw Diff

Details

Reviewers

Ayal
hsaito
fhahn

Summary

The LoopVectorizer/LAA has the ability to add runtime checks for memory accesses that look like they may be single stride accesses, in an attempt to still run vectorized code. This can happen in a boring matrix multiply kernel, for example:

for(int i = 0; i < n; i++) {
  for (int j = 0; j < m; j++)
  {
    int sum = 0;
    for (int k = 0; k < l; k++)
      sum += A[i*l + k] * B[k*m + j];
    C[i*m + j] = sum;
  }
}

However if we have access to efficient vector gather loads, they should be are a much better option than vectoizing with runtime checks for a stride of 1.

This adds a check into the place that appears to be dictating this, LAA, to check if the MaskedGather or MaskedScatter would be legal.

Diff Detail

Event Timeline

dmgreen created this revision.Dec 27 2019, 12:54 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 27 2019, 12:54 AM

Herald added subscribers: rogfer01, hiraditya. · View Herald Transcript

The LoopVectorizer/LAA has the ability to add runtime checks for memory accesses that look like they may be single stride accesses, in an attempt to still run vectorized code. This can happen in a boring matrix multiply kernel, for example: [snip]

However if we have access to efficient vector gather loads, they should be are a much better option than vectoizing with runtime checks for a stride of 1.

This adds a check into the place that appears to be dictating this, LAA, to check if the MaskedGather or MaskedScatter would be legal.

OK.

Longer version:

Agreed, this is the place that gathers all symbolic strides for which runtime checks are later added by replaceSymbolicStrideSCEV().

Agreed, a gather or scatter would probably be a better option than a runtime check, certainly if the stride turns out to be other than 1, in which case the runtime check option will execute the original scalar loop.
Note that if the stride does turn out to be 1, a runtime check may be faster: the cost of a vector load/store is typically less than that of a gather/scatter, disregarding the overhead of the runtime check itself. So having a way to "manually" restore original performance for such cases may be useful (in addition to EnableMemAccessVersioning). Always preferring a gather or scatter as suggested should be good step forward, given the expected complexity of devising a cost-based preference.

Instead of teaching LAI to make such target/cost-based decisions, it would have been better to let this analysis continue to collect *all* symbolic strides potentially subject to runtime checks, and teach the planning/transform to prune/decide which strides to actually specialize; e.g., have LVP::plan() start by calling "CM.setVersioningForStrideDecisions()", analogous to InterleavedAccessInfo::analyzeInterleaving() which collects all potential interleave groups, and CM::setCostBasedWideningDecision() which decides which of the groups to materialize (per VF). However, this requires a fair amount of refactoring; worth a TODO?

llvm/lib/Analysis/LoopAccessAnalysis.cpp
2287	Separate the existing `Value Ptr = getLoadStorePointerOperand(MemAccess); if (!Ptr) return;` part from the new gather/scatter consideration? Would have been nice to reuse LV's `isLegalGatherOrScatter(Value V)`, or perhaps refactor it into `if (TTI && TTI->isLegalGatherOrScatter(MemAccess)) return;`? Worth informing of filtered strides with LLVM_DEBUG messages. (Can check if Ptr is already in SymbolicStrides and exit early; unrelated change.)
llvm/test/Transforms/LoopVectorize/X86/optsize.ll
151	Would indeed be good to have an i16 version retaining current checks (unvectorized behavior), as skx supports gathers of i32 but not for i16; and also the original i32 version with checks for vectorized code.

OK.

Longer version:

Agreed, this is the place that gathers all symbolic strides for which runtime checks are later added by replaceSymbolicStrideSCEV().

Agreed, a gather or scatter would probably be a better option than a runtime check, certainly if the stride turns out to be other than 1, in which case the runtime check option will execute the original scalar loop.
Note that if the stride does turn out to be 1, a runtime check may be faster: the cost of a vector load/store is typically less than that of a gather/scatter, disregarding the overhead of the runtime check itself. So having a way to "manually" restore original performance for such cases may be useful (in addition to EnableMemAccessVersioning). Always preferring a gather or scatter as suggested should be good step forward, given the expected complexity of devising a cost-based preference.

Instead of teaching LAI to make such target/cost-based decisions, it would have been better to let this analysis continue to collect *all* symbolic strides potentially subject to runtime checks, and teach the planning/transform to prune/decide which strides to actually specialize; e.g., have LVP::plan() start by calling "CM.setVersioningForStrideDecisions()", analogous to InterleavedAccessInfo::analyzeInterleaving() which collects all potential interleave groups, and CM::setCostBasedWideningDecision() which decides which of the groups to materialize (per VF). However, this requires a fair amount of refactoring; worth a TODO?

Thanks for taking a look. I agree, this all feels like a layering violation to me! (I found an existing TODO in LAA saying the same thing). The refactoring does seem like a lot of work though. What can I say, I'm looking forward to a point where VPlan can make some of these decisions more structurally.

We (ARM MVE) don't have gathers/scatters enabled yet, this was just hitting the second thing I tried (matrix multiply). It may be a little while before we have them turned on by default.

llvm/test/Transforms/LoopVectorize/X86/optsize.ll
151	The gather version of the original 32bit code here seems to be quite large for -Os. There are large constant pool entries that the gather needs to access?

dmgreen marked an inline comment as done.Jan 2 2020, 3:17 AM

dmgreen added inline comments.

llvm/lib/Analysis/LoopAccessAnalysis.cpp
2287	Yeah, my original version of this patch had isLegalGatherOrScatter passed as a functor from the LoopVectoriser through to LAA. It felt quite ugly though, this seemed a lot simpler to just pass TTI (plus I was looking as the SVE patch set and they had also added TTI to LAA for different reasons). But adding isLegalGatherOrScatter to TTI sounds like a better idea. That will make this cleaner.

dmgreen updated this revision to Diff 235843.Jan 2 2020, 3:19 AM

This looks good to me, adding a couple of minor last nits, thanks!

llvm/include/llvm/Analysis/TargetTransformInfo.h
609 ↗	(On Diff #235843)	"represent \p V" >> "represent a vectorized version of \p V" ? (This is the original comment, but the original context was LV.)
llvm/lib/Analysis/LoopAccessAnalysis.cpp
2319	May want to add a cl::opt flag to turn this filtering on/off.
2320	Suggest to start the line with "LAA:"
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4612 ↗	(On Diff #235843)	This is another potential customer of [TTI.]isLegalGatherOrScatter(I). After which CM's isLegalMaskedGather() and isLegalMaskedScatter() become useless and should be removed. This also suggests a TTI.isLegalMaskedLoadOrStore(Value *), btw. But these are independent cleanups.
llvm/test/Transforms/LoopVectorize/X86/optsize.ll
151	The actual size of vectorized loops under -Os does deserve further investigation, in general.

This revision is now accepted and ready to land.Jan 2 2020, 4:23 AM

Update as per the comments. I may leave this to be committed until the MVE codegen is further along (it depends on the first patch at least).

Ayal added inline comments.Jan 7 2020, 2:34 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4599 ↗	(On Diff #236326)	Thanks! nit: may look better to have instead bool isLegalMaskedLoadOrStore = isa<LoadInst>(I) ? isLegalMaskedLoad(Ty, Ptr, Alignment) : isLegalMaskedStore(Ty, Ptr, Alignment); return !(isLegalMaskedLoadOrStore \|\| TTI.isLegalGatherOrScatter(I)); (in anticipation of TTI.isLegalMaskedLoadOrStore())

dmgreen mentioned this in D72387: [LoopVectorize][TTI] Add an isLegalMaskedLoadStore method. NFC.Jan 8 2020, 1:05 AM

dmgreen marked an inline comment as done.

dmgreen added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4599 ↗	(On Diff #236326)	I've put up the isLegalMaskedLoadStore patch in D72387.

The LoopVectorizer/LAA has the ability to add runtime checks for memory accesses that look like they may be single stride accesses, in an attempt to still run vectorized code. This can happen in a boring matrix multiply kernel, for example:

for(int i = 0; i < n; i++) {
  for (int j = 0; j < m; j++)
  {
    int sum = 0;
    for (int k = 0; k < l; k++)
      sum += A[i*l + k] * B[k*m + j];
    C[i*m + j] = sum;
  }
}

Note that a (more boring?) matrix multiply kernel where B is a square matrix, i.e., where stride m is equal to trip count l, will not be specialized for m=1. But this general case may multiply matrix A by a single column matrix B, whose stride m is 1.

Another possible way to prevent such undesired specialization may be with a __builtin_expect/llvm.expect(m>1, 1).

Yeah, thanks. The square case comes up quite a lot.

My intuition is that users are unlikely to call matrix multiply when they are multiplying a matrix and a vector (they would call mat-vect multiply). I may be wrong about how much that happens though. In other cases that are not matrix multiply, the stride of 1 may be more common, but it's hard to tell that without some profiling data.

We here can also just disable the -enable-mem-access-versioning option, which will suffice for our testing in the short term.

We shouldn't make this either/or. Ability to runtime check unit-stride is good, and ability to use gather/scatter is also good. Depending on the target, I see the following situations

don't vectorize the loop ---- unit-strided code alone isn't profitable or loop itself is profitable but not good enough to cover runtime check cost.
vectorize with runtime check, with scalar code to fall back ---- when gather/scatter code is deemed not profitable.
vectorize with runtime check with gather/scatter code to fall back
vectorize with gather/scatter

If we aren't adding proper cost modeling, I think the default action should stay unchanged, i.e., 2), until further study allows us to collectively say another default is better. Looks like ARM wants to go with 4). If so, we need to make this TTI check, and allow LoopVectorize internal flag to override TTI. I don't think we immediately need to have the ability to do 3), but writing it down as TODO would be nice.

Matt added a subscriber: Matt.Sep 2 2021, 12:06 PM

Herald added a subscriber: vkmr. · View Herald TranscriptSep 2 2021, 12:06 PM

dmgreen mentioned this in D147336: [IVDescriptors] Add pointer InductionDescriptors with non-constant strides (try 2).Apr 3 2023, 1:05 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

LoopAccessAnalysis.h

10 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

34 lines

test/

Transforms/

LoopVectorize/

ARM/

mve-mat-mul.ll

46 lines

X86/

optsize.ll

10 lines

unittests/

Transforms/

Vectorize/

VPlanSlpTest.cpp

2 lines

Diff 235101

llvm/include/llvm/Analysis/LoopAccessAnalysis.h

Show First 20 Lines • Show All 510 Lines • ▼ Show 20 Lines
/// ScalarEvolution, we will generate run-time checks by emitting a		/// ScalarEvolution, we will generate run-time checks by emitting a
/// SCEVUnionPredicate.		/// SCEVUnionPredicate.
///		///
/// Checks for both memory dependences and the SCEV predicates contained in the		/// Checks for both memory dependences and the SCEV predicates contained in the
/// PSE must be emitted in order for the results of this analysis to be valid.		/// PSE must be emitted in order for the results of this analysis to be valid.
class LoopAccessInfo {		class LoopAccessInfo {
public:		public:
LoopAccessInfo(Loop L, ScalarEvolution SE, const TargetLibraryInfo *TLI,		LoopAccessInfo(Loop L, ScalarEvolution SE, const TargetLibraryInfo *TLI,
AliasAnalysis AA, DominatorTree DT, LoopInfo *LI);		const TargetTransformInfo TTI, AliasAnalysis AA,
		DominatorTree DT, LoopInfo LI);

/// Return true we can analyze the memory accesses in the loop and there are		/// Return true we can analyze the memory accesses in the loop and there are
/// no memory dependence cycles.		/// no memory dependence cycles.
bool canVectorizeMemory() const { return CanVecMem; }		bool canVectorizeMemory() const { return CanVecMem; }

/// Return true if there is a convergent operation in the loop. There may		/// Return true if there is a convergent operation in the loop. There may
/// still be reported runtime pointer checks that would be required, but it is		/// still be reported runtime pointer checks that would be required, but it is
/// not legal to insert them.		/// not legal to insert them.
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	public:
/// should be re-written (and therefore simplified) according to PSE.		/// should be re-written (and therefore simplified) according to PSE.
/// A user of LoopAccessAnalysis will need to emit the runtime checks		/// A user of LoopAccessAnalysis will need to emit the runtime checks
/// associated with this predicate.		/// associated with this predicate.
const PredicatedScalarEvolution &getPSE() const { return *PSE; }		const PredicatedScalarEvolution &getPSE() const { return *PSE; }

private:		private:
/// Analyze the loop.		/// Analyze the loop.
void analyzeLoop(AliasAnalysis AA, LoopInfo LI,		void analyzeLoop(AliasAnalysis AA, LoopInfo LI,
const TargetLibraryInfo TLI, DominatorTree DT);		const TargetLibraryInfo TLI, const TargetTransformInfo TTI,
		DominatorTree *DT);

/// Check if the structure of the loop allows it to be analyzed by this		/// Check if the structure of the loop allows it to be analyzed by this
/// pass.		/// pass.
bool canAnalyzeLoop();		bool canAnalyzeLoop();

/// Save the analysis remark.		/// Save the analysis remark.
///		///
/// LAA does not directly emits the remarks. Instead it stores it which the		/// LAA does not directly emits the remarks. Instead it stores it which the
/// client can retrieve and presents as its own analysis		/// client can retrieve and presents as its own analysis
/// (e.g. -Rpass-analysis=loop-vectorize).		/// (e.g. -Rpass-analysis=loop-vectorize).
OptimizationRemarkAnalysis &recordAnalysis(StringRef RemarkName,		OptimizationRemarkAnalysis &recordAnalysis(StringRef RemarkName,
Instruction *Instr = nullptr);		Instruction *Instr = nullptr);

/// Collect memory access with loop invariant strides.		/// Collect memory access with loop invariant strides.
///		///
/// Looks for accesses like "a[i * StrideA]" where "StrideA" is loop		/// Looks for accesses like "a[i * StrideA]" where "StrideA" is loop
/// invariant.		/// invariant.
void collectStridedAccess(Value *LoadOrStoreInst);		void collectStridedAccess(Value *LoadOrStoreInst,
		const TargetTransformInfo *TTI);

std::unique_ptr<PredicatedScalarEvolution> PSE;		std::unique_ptr<PredicatedScalarEvolution> PSE;

/// We need to check that all of the pointers in this list are disjoint		/// We need to check that all of the pointers in this list are disjoint
/// at runtime. Using std::unique_ptr to make using move ctor simpler.		/// at runtime. Using std::unique_ptr to make using move ctor simpler.
std::unique_ptr<RuntimePointerChecking> PtrRtChecking;		std::unique_ptr<RuntimePointerChecking> PtrRtChecking;

/// the Memory Dependence Checker which can determine the		/// the Memory Dependence Checker which can determine the
▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines

private:		private:
/// The cache.		/// The cache.
DenseMap<Loop *, std::unique_ptr<LoopAccessInfo>> LoopAccessInfoMap;		DenseMap<Loop *, std::unique_ptr<LoopAccessInfo>> LoopAccessInfoMap;

// The used analysis passes.		// The used analysis passes.
ScalarEvolution *SE = nullptr;		ScalarEvolution *SE = nullptr;
const TargetLibraryInfo *TLI = nullptr;		const TargetLibraryInfo *TLI = nullptr;
		const TargetTransformInfo *TTI = nullptr;
AliasAnalysis *AA = nullptr;		AliasAnalysis *AA = nullptr;
DominatorTree *DT = nullptr;		DominatorTree *DT = nullptr;
LoopInfo *LI = nullptr;		LoopInfo *LI = nullptr;
};		};

/// This analysis provides dependence information for the memory		/// This analysis provides dependence information for the memory
/// accesses of a loop.		/// accesses of a loop.
///		///
Show All 28 Lines

llvm/lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 1,783 Lines • ▼ Show 20 Lines	if (ExitCount == PSE->getSE()->getCouldNotCompute()) {
return false;		return false;
}		}

return true;		return true;
}		}

void LoopAccessInfo::analyzeLoop(AliasAnalysis AA, LoopInfo LI,		void LoopAccessInfo::analyzeLoop(AliasAnalysis AA, LoopInfo LI,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
		const TargetTransformInfo *TTI,
DominatorTree *DT) {		DominatorTree *DT) {
typedef SmallPtrSet<Value*, 16> ValueSet;		typedef SmallPtrSet<Value*, 16> ValueSet;

// Holds the Load and Store instructions.		// Holds the Load and Store instructions.
SmallVector<LoadInst *, 16> Loads;		SmallVector<LoadInst *, 16> Loads;
SmallVector<StoreInst *, 16> Stores;		SmallVector<StoreInst *, 16> Stores;

// Holds all the different accesses in the loop.		// Holds all the different accesses in the loop.
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
LLVM_DEBUG(dbgs() << "LAA: Found a non-simple load.\n");		LLVM_DEBUG(dbgs() << "LAA: Found a non-simple load.\n");
HasComplexMemInst = true;		HasComplexMemInst = true;
continue;		continue;
}		}
NumLoads++;		NumLoads++;
Loads.push_back(Ld);		Loads.push_back(Ld);
DepChecker->addAccess(Ld);		DepChecker->addAccess(Ld);
if (EnableMemAccessVersioning)		if (EnableMemAccessVersioning)
collectStridedAccess(Ld);		collectStridedAccess(Ld, TTI);
continue;		continue;
}		}

// Save 'store' instructions. Abort if other instructions write to memory.		// Save 'store' instructions. Abort if other instructions write to memory.
if (I.mayWriteToMemory()) {		if (I.mayWriteToMemory()) {
auto *St = dyn_cast<StoreInst>(&I);		auto *St = dyn_cast<StoreInst>(&I);
if (!St) {		if (!St) {
recordAnalysis("CantVectorizeInstruction", St)		recordAnalysis("CantVectorizeInstruction", St)
<< "instruction cannot be vectorized";		<< "instruction cannot be vectorized";
HasComplexMemInst = true;		HasComplexMemInst = true;
continue;		continue;
}		}
if (!St->isSimple() && !IsAnnotatedParallel) {		if (!St->isSimple() && !IsAnnotatedParallel) {
recordAnalysis("NonSimpleStore", St)		recordAnalysis("NonSimpleStore", St)
<< "write with atomic ordering or volatile write";		<< "write with atomic ordering or volatile write";
LLVM_DEBUG(dbgs() << "LAA: Found a non-simple store.\n");		LLVM_DEBUG(dbgs() << "LAA: Found a non-simple store.\n");
HasComplexMemInst = true;		HasComplexMemInst = true;
continue;		continue;
}		}
NumStores++;		NumStores++;
Stores.push_back(St);		Stores.push_back(St);
DepChecker->addAccess(St);		DepChecker->addAccess(St);
if (EnableMemAccessVersioning)		if (EnableMemAccessVersioning)
collectStridedAccess(St);		collectStridedAccess(St, TTI);
}		}
} // Next instr.		} // Next instr.
} // Next block.		} // Next block.

if (HasComplexMemInst) {		if (HasComplexMemInst) {
CanVecMem = false;		CanVecMem = false;
return;		return;
}		}
▲ Show 20 Lines • Show All 372 Lines • ▼ Show 20 Lines
std::pair<Instruction , Instruction >		std::pair<Instruction , Instruction >
LoopAccessInfo::addRuntimeChecks(Instruction *Loc) const {		LoopAccessInfo::addRuntimeChecks(Instruction *Loc) const {
if (!PtrRtChecking->Need)		if (!PtrRtChecking->Need)
return std::make_pair(nullptr, nullptr);		return std::make_pair(nullptr, nullptr);

return addRuntimeChecks(Loc, PtrRtChecking->getChecks());		return addRuntimeChecks(Loc, PtrRtChecking->getChecks());
}		}

void LoopAccessInfo::collectStridedAccess(Value *MemAccess) {		void LoopAccessInfo::collectStridedAccess(Value *MemAccess,
		const TargetTransformInfo *TTI) {
Value *Ptr = nullptr;		Value *Ptr = nullptr;
if (LoadInst *LI = dyn_cast<LoadInst>(MemAccess))		if (LoadInst *LI = dyn_cast<LoadInst>(MemAccess)) {
Ptr = LI->getPointerOperand();		Ptr = LI->getPointerOperand();
		AyalUnsubmitted Not Done Reply Inline Actions Separate the existing `Value Ptr = getLoadStorePointerOperand(MemAccess); if (!Ptr) return;` part from the new gather/scatter consideration? Would have been nice to reuse LV's `isLegalGatherOrScatter(Value V)`, or perhaps refactor it into `if (TTI && TTI->isLegalGatherOrScatter(MemAccess)) return;`? Worth informing of filtered strides with LLVM_DEBUG messages. (Can check if Ptr is already in SymbolicStrides and exit early; unrelated change.) Ayal: Separate the existing `Value *Ptr = getLoadStorePointerOperand(MemAccess); if (!Ptr) return;`…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Yeah, my original version of this patch had isLegalGatherOrScatter passed as a functor from the LoopVectoriser through to LAA. It felt quite ugly though, this seemed a lot simpler to just pass TTI (plus I was looking as the SVE patch set and they had also added TTI to LAA for different reasons). But adding isLegalGatherOrScatter to TTI sounds like a better idea. That will make this cleaner. dmgreen: Yeah, my original version of this patch had isLegalGatherOrScatter passed as a functor from the…
else if (StoreInst *SI = dyn_cast<StoreInst>(MemAccess))		if (TTI && TTI->isLegalMaskedGather(LI->getType(),
		getLoadStoreAlignment(MemAccess)))
		return;
		} else if (StoreInst *SI = dyn_cast<StoreInst>(MemAccess)) {
Ptr = SI->getPointerOperand();		Ptr = SI->getPointerOperand();
else		if (TTI && TTI->isLegalMaskedScatter(SI->getValueOperand()->getType(),
		getLoadStoreAlignment(MemAccess)))
		return;
		} else
return;		return;

Value *Stride = getStrideFromPointer(Ptr, PSE->getSE(), TheLoop);		Value *Stride = getStrideFromPointer(Ptr, PSE->getSE(), TheLoop);
if (!Stride)		if (!Stride)
return;		return;

LLVM_DEBUG(dbgs() << "LAA: Found a strided access that is a candidate for "		LLVM_DEBUG(dbgs() << "LAA: Found a strided access that is a candidate for "
"versioning:");		"versioning:");
LLVM_DEBUG(dbgs() << " Ptr: " << Ptr << " Stride: " << Stride << "\n");		LLVM_DEBUG(dbgs() << " Ptr: " << Ptr << " Stride: " << Stride << "\n");

// Avoid adding the "Stride == 1" predicate when we know that		// Avoid adding the "Stride == 1" predicate when we know that
// Stride >= Trip-Count. Such a predicate will effectively optimize a single		// Stride >= Trip-Count. Such a predicate will effectively optimize a single
// or zero iteration loop, as Trip-Count <= Stride == 1.		// or zero iteration loop, as Trip-Count <= Stride == 1.
//		//
// TODO: We are currently not making a very informed decision on when it is		// TODO: We are currently not making a very informed decision on when it is
// beneficial to apply stride versioning. It might make more sense that the		// beneficial to apply stride versioning. It might make more sense that the
// users of this analysis (such as the vectorizer) will trigger it, based on		// users of this analysis (such as the vectorizer) will trigger it, based on
// their specific cost considerations; For example, in cases where stride		// their specific cost considerations; For example, in cases where stride
// versioning does not help resolving memory accesses/dependences, the		// versioning does not help resolving memory accesses/dependences, the
// vectorizer should evaluate the cost of the runtime test, and the benefit		// vectorizer should evaluate the cost of the runtime test, and the benefit
// of various possible stride specializations, considering the alternatives		// of various possible stride specializations, considering the alternatives
// of using gather/scatters (if available).		// of using gather/scatters (if available).

		AyalUnsubmitted Done Reply Inline Actions May want to add a cl::opt flag to turn this filtering on/off. Ayal: May want to add a cl::opt flag to turn this filtering on/off.
const SCEV *StrideExpr = PSE->getSCEV(Stride);		const SCEV *StrideExpr = PSE->getSCEV(Stride);
		AyalUnsubmitted Done Reply Inline Actions Suggest to start the line with "LAA:" Ayal: Suggest to start the line with "LAA:"
const SCEV *BETakenCount = PSE->getBackedgeTakenCount();		const SCEV *BETakenCount = PSE->getBackedgeTakenCount();

// Match the types so we can compare the stride and the BETakenCount.		// Match the types so we can compare the stride and the BETakenCount.
// The Stride can be positive/negative, so we sign extend Stride;		// The Stride can be positive/negative, so we sign extend Stride;
// The backedgeTakenCount is non-negative, so we zero extend BETakenCount.		// The backedgeTakenCount is non-negative, so we zero extend BETakenCount.
const DataLayout &DL = TheLoop->getHeader()->getModule()->getDataLayout();		const DataLayout &DL = TheLoop->getHeader()->getModule()->getDataLayout();
uint64_t StrideTypeSize = DL.getTypeAllocSize(StrideExpr->getType());		uint64_t StrideTypeSize = DL.getTypeAllocSize(StrideExpr->getType());
uint64_t BETypeSize = DL.getTypeAllocSize(BETakenCount->getType());		uint64_t BETypeSize = DL.getTypeAllocSize(BETakenCount->getType());
Show All 17 Lines	void LoopAccessInfo::collectStridedAccess(Value *MemAccess,
}		}
LLVM_DEBUG(dbgs() << "LAA: Found a strided access that we can version.");		LLVM_DEBUG(dbgs() << "LAA: Found a strided access that we can version.");

SymbolicStrides[Ptr] = Stride;		SymbolicStrides[Ptr] = Stride;
StrideSet.insert(Stride);		StrideSet.insert(Stride);
}		}

LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,		LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,
const TargetLibraryInfo TLI, AliasAnalysis AA,		const TargetLibraryInfo *TLI,
DominatorTree DT, LoopInfo LI)		const TargetTransformInfo *TTI,
		AliasAnalysis AA, DominatorTree DT,
		LoopInfo *LI)
: PSE(std::make_unique<PredicatedScalarEvolution>(SE, L)),		: PSE(std::make_unique<PredicatedScalarEvolution>(SE, L)),
PtrRtChecking(std::make_unique<RuntimePointerChecking>(SE)),		PtrRtChecking(std::make_unique<RuntimePointerChecking>(SE)),
DepChecker(std::make_unique<MemoryDepChecker>(*PSE, L)), TheLoop(L),		DepChecker(std::make_unique<MemoryDepChecker>(*PSE, L)), TheLoop(L),
NumLoads(0), NumStores(0), MaxSafeDepDistBytes(-1), CanVecMem(false),		NumLoads(0), NumStores(0), MaxSafeDepDistBytes(-1), CanVecMem(false),
HasConvergentOp(false),		HasConvergentOp(false),
HasDependenceInvolvingLoopInvariantAddress(false) {		HasDependenceInvolvingLoopInvariantAddress(false) {
if (canAnalyzeLoop())		if (canAnalyzeLoop())
analyzeLoop(AA, LI, TLI, DT);		analyzeLoop(AA, LI, TLI, TTI, DT);
}		}

void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {		void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {
if (CanVecMem) {		if (CanVecMem) {
OS.indent(Depth) << "Memory dependences are safe";		OS.indent(Depth) << "Memory dependences are safe";
if (MaxSafeDepDistBytes != -1ULL)		if (MaxSafeDepDistBytes != -1ULL)
OS << " with a maximum dependence distance of " << MaxSafeDepDistBytes		OS << " with a maximum dependence distance of " << MaxSafeDepDistBytes
<< " bytes";		<< " bytes";
Show All 37 Lines
LoopAccessLegacyAnalysis::LoopAccessLegacyAnalysis() : FunctionPass(ID) {		LoopAccessLegacyAnalysis::LoopAccessLegacyAnalysis() : FunctionPass(ID) {
initializeLoopAccessLegacyAnalysisPass(*PassRegistry::getPassRegistry());		initializeLoopAccessLegacyAnalysisPass(*PassRegistry::getPassRegistry());
}		}

const LoopAccessInfo &LoopAccessLegacyAnalysis::getInfo(Loop *L) {		const LoopAccessInfo &LoopAccessLegacyAnalysis::getInfo(Loop *L) {
auto &LAI = LoopAccessInfoMap[L];		auto &LAI = LoopAccessInfoMap[L];

if (!LAI)		if (!LAI)
LAI = std::make_unique<LoopAccessInfo>(L, SE, TLI, AA, DT, LI);		LAI = std::make_unique<LoopAccessInfo>(L, SE, TLI, TTI, AA, DT, LI);

return *LAI.get();		return *LAI.get();
}		}

void LoopAccessLegacyAnalysis::print(raw_ostream &OS, const Module *M) const {		void LoopAccessLegacyAnalysis::print(raw_ostream &OS, const Module *M) const {
LoopAccessLegacyAnalysis &LAA = const_cast<LoopAccessLegacyAnalysis >(this);		LoopAccessLegacyAnalysis &LAA = const_cast<LoopAccessLegacyAnalysis >(this);

for (Loop TopLevelLoop : LI)		for (Loop TopLevelLoop : LI)
for (Loop *L : depth_first(TopLevelLoop)) {		for (Loop *L : depth_first(TopLevelLoop)) {
OS.indent(2) << L->getHeader()->getName() << ":\n";		OS.indent(2) << L->getHeader()->getName() << ":\n";
auto &LAI = LAA.getInfo(L);		auto &LAI = LAA.getInfo(L);
LAI.print(OS, 4);		LAI.print(OS, 4);
}		}
}		}

bool LoopAccessLegacyAnalysis::runOnFunction(Function &F) {		bool LoopAccessLegacyAnalysis::runOnFunction(Function &F) {
SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();		SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();		auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();
TLI = TLIP ? &TLIP->getTLI(F) : nullptr;		TLI = TLIP ? &TLIP->getTLI(F) : nullptr;
		auto *TTIP = getAnalysisIfAvailable<TargetTransformInfoWrapperPass>();
		TTI = TTIP ? &TTIP->getTTI(F) : nullptr;
AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();		DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();

return false;		return false;
}		}

void LoopAccessLegacyAnalysis::getAnalysisUsage(AnalysisUsage &AU) const {		void LoopAccessLegacyAnalysis::getAnalysisUsage(AnalysisUsage &AU) const {
Show All 15 Lines
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
INITIALIZE_PASS_END(LoopAccessLegacyAnalysis, LAA_NAME, laa_name, false, true)		INITIALIZE_PASS_END(LoopAccessLegacyAnalysis, LAA_NAME, laa_name, false, true)

AnalysisKey LoopAccessAnalysis::Key;		AnalysisKey LoopAccessAnalysis::Key;

LoopAccessInfo LoopAccessAnalysis::run(Loop &L, LoopAnalysisManager &AM,		LoopAccessInfo LoopAccessAnalysis::run(Loop &L, LoopAnalysisManager &AM,
LoopStandardAnalysisResults &AR) {		LoopStandardAnalysisResults &AR) {
return LoopAccessInfo(&L, &AR.SE, &AR.TLI, &AR.AA, &AR.DT, &AR.LI);		return LoopAccessInfo(&L, &AR.SE, &AR.TLI, &AR.TTI, &AR.AA, &AR.DT, &AR.LI);
}		}

namespace llvm {		namespace llvm {

Pass *createLAAPass() {		Pass *createLAAPass() {
return new LoopAccessLegacyAnalysis();		return new LoopAccessLegacyAnalysis();
}		}

} // end namespace llvm		} // end namespace llvm

llvm/test/Transforms/LoopVectorize/ARM/mve-mat-mul.ll

	Show All 21 Lines
	; CHECK-NEXT: br label [[FOR_COND8_PREHEADER_US_US:%.*]]			; CHECK-NEXT: br label [[FOR_COND8_PREHEADER_US_US:%.*]]
	; CHECK: for.cond4.for.cond.cleanup6_crit_edge.us:			; CHECK: for.cond4.for.cond.cleanup6_crit_edge.us:
	; CHECK-NEXT: [[INC24_US]] = add nuw nsw i32 [[I_054_US]], 1			; CHECK-NEXT: [[INC24_US]] = add nuw nsw i32 [[I_054_US]], 1
	; CHECK-NEXT: [[EXITCOND86:%.*]] = icmp eq i32 [[INC24_US]], [[N]]			; CHECK-NEXT: [[EXITCOND86:%.*]] = icmp eq i32 [[INC24_US]], [[N]]
	; CHECK-NEXT: br i1 [[EXITCOND86]], label [[FOR_END25_LOOPEXIT:%.*]], label [[FOR_COND8_PREHEADER_US_US_PREHEADER]]			; CHECK-NEXT: br i1 [[EXITCOND86]], label [[FOR_END25_LOOPEXIT:%.*]], label [[FOR_COND8_PREHEADER_US_US_PREHEADER]]
	; CHECK: for.cond8.preheader.us.us:			; CHECK: for.cond8.preheader.us.us:
	; CHECK-NEXT: [[J_051_US_US:%.]] = phi i32 [ [[INC21_US_US:%.]], [[FOR_COND8_FOR_COND_CLEANUP10_CRIT_EDGE_US_US:%.*]] ], [ 0, [[FOR_COND8_PREHEADER_US_US_PREHEADER]] ]			; CHECK-NEXT: [[J_051_US_US:%.]] = phi i32 [ [[INC21_US_US:%.]], [[FOR_COND8_FOR_COND_CLEANUP10_CRIT_EDGE_US_US:%.*]] ], [ 0, [[FOR_COND8_PREHEADER_US_US_PREHEADER]] ]
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[L]], 4			; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[L]], 4
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_SCEVCHECK:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.scevcheck:
	; CHECK-NEXT: [[IDENT_CHECK:%.*]] = icmp ne i32 [[M]], 1
	; CHECK-NEXT: [[TMP0:%.*]] = or i1 false, [[IDENT_CHECK]]
	; CHECK-NEXT: br i1 [[TMP0]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[L]], 4			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[L]], 4
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[L]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[L]], [[N_MOD_VF]]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[M]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <4 x i32> undef, i32 [[J_051_US_US]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT1]], <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP12:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP12:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[INDEX]], i32 0			; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 1
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3>			; CHECK-NEXT: [[TMP2:%.*]] = add i32 [[INDEX]], 2
	; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 0			; CHECK-NEXT: [[TMP3:%.*]] = add i32 [[INDEX]], 3
	; CHECK-NEXT: [[TMP2:%.*]] = add i32 [[TMP1]], [[MUL_US]]			; CHECK-NEXT: [[TMP4:%.*]] = add i32 [[TMP0]], [[MUL_US]]
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i32 [[TMP2]]			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i32 [[TMP4]]
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TMP3]], i32 0			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP5]], i32 0
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP4]] to <4 x i32>*			; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP5]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP7]], align 4
	; CHECK-NEXT: [[TMP6:%.*]] = mul i32 [[TMP1]], [[M]]			; CHECK-NEXT: [[TMP8:%.*]] = mul <4 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
	; CHECK-NEXT: [[TMP7:%.*]] = add i32 [[TMP6]], [[J_051_US_US]]			; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[BROADCAST_SPLAT2]]
	; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i32 [[TMP7]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], <4 x i32> [[TMP9]]
	; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP8]], i32 0			; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP10]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef)
	; CHECK-NEXT: [[TMP10:%.]] = bitcast i32 [[TMP9]] to <4 x i32>*			; CHECK-NEXT: [[TMP11:%.*]] = mul nsw <4 x i32> [[WIDE_MASKED_GATHER]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP10]], align 4
	; CHECK-NEXT: [[TMP11:%.*]] = mul nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[TMP12]] = add nsw <4 x i32> [[TMP11]], [[VEC_PHI]]			; CHECK-NEXT: [[TMP12]] = add nsw <4 x i32> [[TMP11]], [[VEC_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], <i32 4, i32 4, i32 4, i32 4>
	; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0			; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP14:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP12]])			; CHECK-NEXT: [[TMP14:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP12]])
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[L]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[L]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND8_FOR_COND_CLEANUP10_CRIT_EDGE_US_US]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND8_FOR_COND_CLEANUP10_CRIT_EDGE_US_US]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_COND8_PREHEADER_US_US]] ], [ 0, [[VECTOR_SCEVCHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_COND8_PREHEADER_US_US]] ]
	; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_COND8_PREHEADER_US_US]] ], [ 0, [[VECTOR_SCEVCHECK]] ], [ [[TMP14]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_COND8_PREHEADER_US_US]] ], [ [[TMP14]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: br label [[FOR_BODY11_US_US:%.*]]			; CHECK-NEXT: br label [[FOR_BODY11_US_US:%.*]]
	; CHECK: for.cond8.for.cond.cleanup10_crit_edge.us.us:			; CHECK: for.cond8.for.cond.cleanup10_crit_edge.us.us:
	; CHECK-NEXT: [[ADD16_US_US_LCSSA:%.]] = phi i32 [ [[ADD16_US_US:%.]], [[FOR_BODY11_US_US]] ], [ [[TMP14]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[ADD16_US_US_LCSSA:%.]] = phi i32 [ [[ADD16_US_US:%.]], [[FOR_BODY11_US_US]] ], [ [[TMP14]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: [[ADD18_US_US:%.*]] = add i32 [[J_051_US_US]], [[MUL17_US]]			; CHECK-NEXT: [[ADD18_US_US:%.*]] = add i32 [[J_051_US_US]], [[MUL17_US]]
	; CHECK-NEXT: [[ARRAYIDX19_US_US:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i32 [[ADD18_US_US]]			; CHECK-NEXT: [[ARRAYIDX19_US_US:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i32 [[ADD18_US_US]]
	; CHECK-NEXT: store i32 [[ADD16_US_US_LCSSA]], i32* [[ARRAYIDX19_US_US]], align 4			; CHECK-NEXT: store i32 [[ADD16_US_US_LCSSA]], i32* [[ARRAYIDX19_US_US]], align 4
	; CHECK-NEXT: [[INC21_US_US]] = add nuw nsw i32 [[J_051_US_US]], 1			; CHECK-NEXT: [[INC21_US_US]] = add nuw nsw i32 [[J_051_US_US]], 1
	; CHECK-NEXT: [[EXITCOND85:%.*]] = icmp eq i32 [[INC21_US_US]], [[M]]			; CHECK-NEXT: [[EXITCOND85:%.*]] = icmp eq i32 [[INC21_US_US]], [[M]]
	▲ Show 20 Lines • Show All 73 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/optsize.ll

	Show First 20 Lines • Show All 142 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: @scev4stride1			; CHECK-LABEL: @scev4stride1
	; CHECK-NOT: vector.scevcheck			; CHECK-NOT: vector.scevcheck
	; CHECK-NOT: vector.body:			; CHECK-NOT: vector.body:
	; CHECK-LABEL: for.body:			; CHECK-LABEL: for.body:
	; AUTOVF-LABEL: @scev4stride1			; AUTOVF-LABEL: @scev4stride1
	; AUTOVF-NOT: vector.scevcheck			; AUTOVF-NOT: vector.scevcheck
	; AUTOVF-NOT: vector.body:			; AUTOVF-NOT: vector.body:
	; AUTOVF-LABEL: for.body:			; AUTOVF-LABEL: for.body:
	define void @scev4stride1(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %k) #2 {			define void @scev4stride1(i16* noalias nocapture %a, i16* noalias nocapture readonly %b, i32 %k) #2 {
				AyalUnsubmitted Not Done Reply Inline Actions Would indeed be good to have an i16 version retaining current checks (unvectorized behavior), as skx supports gathers of i32 but not for i16; and also the original i32 version with checks for vectorized code. Ayal: Would indeed be good to have an i16 version retaining current checks (unvectorized behavior)…
				dmgreenAuthorUnsubmitted Done Reply Inline Actions The gather version of the original 32bit code here seems to be quite large for -Os. There are large constant pool entries that the gather needs to access? dmgreen: The gather version of the original 32bit code here seems to be quite large for -Os. There are…
				AyalUnsubmitted Not Done Reply Inline Actions The actual size of vectorized loops under -Os does deserve further investigation, in general. Ayal: The actual size of vectorized loops under -Os does deserve further investigation, in general.
	for.body.preheader:			for.body.preheader:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body.preheader, %for.body			for.body: ; preds = %for.body.preheader, %for.body
	%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]			%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
	%mul = mul nsw i32 %i.07, %k			%mul = mul nsw i32 %i.07, %k
	%arrayidx = getelementptr inbounds i32, i32* %b, i32 %mul			%arrayidx = getelementptr inbounds i16, i16* %b, i32 %mul
	%0 = load i32, i32* %arrayidx, align 4			%0 = load i16, i16* %arrayidx, align 4
	%arrayidx1 = getelementptr inbounds i32, i32* %a, i32 %i.07			%arrayidx1 = getelementptr inbounds i16, i16* %a, i32 %i.07
	store i32 %0, i32* %arrayidx1, align 4			store i16 %0, i16* %arrayidx1, align 4
	%inc = add nuw nsw i32 %i.07, 1			%inc = add nuw nsw i32 %i.07, 1
	%exitcond = icmp eq i32 %inc, 256			%exitcond = icmp eq i32 %inc, 256
	br i1 %exitcond, label %for.end.loopexit, label %for.body			br i1 %exitcond, label %for.end.loopexit, label %for.body

	for.end.loopexit: ; preds = %for.body			for.end.loopexit: ; preds = %for.body
	ret void			ret void
	}			}

	Show All 29 Lines

llvm/unittests/Transforms/Vectorize/VPlanSlpTest.cpp

Show All 37 Lines	protected:
VPInterleavedAccessInfo getInterleavedAccessInfo(Function &F, Loop *L,		VPInterleavedAccessInfo getInterleavedAccessInfo(Function &F, Loop *L,
VPlan &Plan) {		VPlan &Plan) {
AC.reset(new AssumptionCache(F));		AC.reset(new AssumptionCache(F));
SE.reset(new ScalarEvolution(F, TLI, AC, DT, *LI));		SE.reset(new ScalarEvolution(F, TLI, AC, DT, *LI));
BasicAA.reset(new BasicAAResult(DL, F, TLI, AC, &DT, &*LI));		BasicAA.reset(new BasicAAResult(DL, F, TLI, AC, &DT, &*LI));
AARes.reset(new AAResults(TLI));		AARes.reset(new AAResults(TLI));
AARes->addAAResult(*BasicAA);		AARes->addAAResult(*BasicAA);
PSE.reset(new PredicatedScalarEvolution(SE, L));		PSE.reset(new PredicatedScalarEvolution(SE, L));
LAI.reset(new LoopAccessInfo(L, &SE, &TLI, &AARes, &DT, &LI));		LAI.reset(new LoopAccessInfo(L, &SE, &TLI, nullptr, &AARes, &DT, &LI));
IAI.reset(new InterleavedAccessInfo(PSE, L, &DT, &LI, &LAI));		IAI.reset(new InterleavedAccessInfo(PSE, L, &DT, &LI, &LAI));
IAI->analyzeInterleaving(false);		IAI->analyzeInterleaving(false);
return {Plan, *IAI};		return {Plan, *IAI};
}		}
};		};

TEST_F(VPlanSlpTest, testSlpSimple_2) {		TEST_F(VPlanSlpTest, testSlpSimple_2) {
const char *ModuleString =		const char *ModuleString =
▲ Show 20 Lines • Show All 843 Lines • Show Last 20 Lines