This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
LoopAccessAnalysis.h
1/1
TargetTransformInfo.h
-
lib/
-
Analysis/
3/4
LoopAccessAnalysis.cpp
-
TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
2/3
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
ARM/
-
mve-mat-mul.ll
-
X86/
1/3
optsize.ll
-
unittests/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
VPlanSlpTest.cpp

Differential D71919

[LoopVectorize] Disable single stride access predicates when gather loads are available.
AcceptedPublic

Authored by dmgreen on Dec 27 2019, 12:54 AM.

Download Raw Diff

Details

Reviewers

Ayal
hsaito
fhahn

Summary

The LoopVectorizer/LAA has the ability to add runtime checks for memory accesses that look like they may be single stride accesses, in an attempt to still run vectorized code. This can happen in a boring matrix multiply kernel, for example:

for(int i = 0; i < n; i++) {
  for (int j = 0; j < m; j++)
  {
    int sum = 0;
    for (int k = 0; k < l; k++)
      sum += A[i*l + k] * B[k*m + j];
    C[i*m + j] = sum;
  }
}

However if we have access to efficient vector gather loads, they should be are a much better option than vectoizing with runtime checks for a stride of 1.

This adds a check into the place that appears to be dictating this, LAA, to check if the MaskedGather or MaskedScatter would be legal.

Diff Detail

Event Timeline

dmgreen created this revision.Dec 27 2019, 12:54 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 27 2019, 12:54 AM

Herald added subscribers: rogfer01, hiraditya. · View Herald Transcript

The LoopVectorizer/LAA has the ability to add runtime checks for memory accesses that look like they may be single stride accesses, in an attempt to still run vectorized code. This can happen in a boring matrix multiply kernel, for example: [snip]

However if we have access to efficient vector gather loads, they should be are a much better option than vectoizing with runtime checks for a stride of 1.

This adds a check into the place that appears to be dictating this, LAA, to check if the MaskedGather or MaskedScatter would be legal.

OK.

Longer version:

Agreed, this is the place that gathers all symbolic strides for which runtime checks are later added by replaceSymbolicStrideSCEV().

Agreed, a gather or scatter would probably be a better option than a runtime check, certainly if the stride turns out to be other than 1, in which case the runtime check option will execute the original scalar loop.
Note that if the stride does turn out to be 1, a runtime check may be faster: the cost of a vector load/store is typically less than that of a gather/scatter, disregarding the overhead of the runtime check itself. So having a way to "manually" restore original performance for such cases may be useful (in addition to EnableMemAccessVersioning). Always preferring a gather or scatter as suggested should be good step forward, given the expected complexity of devising a cost-based preference.

Instead of teaching LAI to make such target/cost-based decisions, it would have been better to let this analysis continue to collect *all* symbolic strides potentially subject to runtime checks, and teach the planning/transform to prune/decide which strides to actually specialize; e.g., have LVP::plan() start by calling "CM.setVersioningForStrideDecisions()", analogous to InterleavedAccessInfo::analyzeInterleaving() which collects all potential interleave groups, and CM::setCostBasedWideningDecision() which decides which of the groups to materialize (per VF). However, this requires a fair amount of refactoring; worth a TODO?

llvm/lib/Analysis/LoopAccessAnalysis.cpp
2287	Separate the existing `Value Ptr = getLoadStorePointerOperand(MemAccess); if (!Ptr) return;` part from the new gather/scatter consideration? Would have been nice to reuse LV's `isLegalGatherOrScatter(Value V)`, or perhaps refactor it into `if (TTI && TTI->isLegalGatherOrScatter(MemAccess)) return;`? Worth informing of filtered strides with LLVM_DEBUG messages. (Can check if Ptr is already in SymbolicStrides and exit early; unrelated change.)
llvm/test/Transforms/LoopVectorize/X86/optsize.ll
151–182	Would indeed be good to have an i16 version retaining current checks (unvectorized behavior), as skx supports gathers of i32 but not for i16; and also the original i32 version with checks for vectorized code.

OK.

Longer version:

Agreed, this is the place that gathers all symbolic strides for which runtime checks are later added by replaceSymbolicStrideSCEV().

Agreed, a gather or scatter would probably be a better option than a runtime check, certainly if the stride turns out to be other than 1, in which case the runtime check option will execute the original scalar loop.
Note that if the stride does turn out to be 1, a runtime check may be faster: the cost of a vector load/store is typically less than that of a gather/scatter, disregarding the overhead of the runtime check itself. So having a way to "manually" restore original performance for such cases may be useful (in addition to EnableMemAccessVersioning). Always preferring a gather or scatter as suggested should be good step forward, given the expected complexity of devising a cost-based preference.

Instead of teaching LAI to make such target/cost-based decisions, it would have been better to let this analysis continue to collect *all* symbolic strides potentially subject to runtime checks, and teach the planning/transform to prune/decide which strides to actually specialize; e.g., have LVP::plan() start by calling "CM.setVersioningForStrideDecisions()", analogous to InterleavedAccessInfo::analyzeInterleaving() which collects all potential interleave groups, and CM::setCostBasedWideningDecision() which decides which of the groups to materialize (per VF). However, this requires a fair amount of refactoring; worth a TODO?

Thanks for taking a look. I agree, this all feels like a layering violation to me! (I found an existing TODO in LAA saying the same thing). The refactoring does seem like a lot of work though. What can I say, I'm looking forward to a point where VPlan can make some of these decisions more structurally.

We (ARM MVE) don't have gathers/scatters enabled yet, this was just hitting the second thing I tried (matrix multiply). It may be a little while before we have them turned on by default.

llvm/test/Transforms/LoopVectorize/X86/optsize.ll
151–182	The gather version of the original 32bit code here seems to be quite large for -Os. There are large constant pool entries that the gather needs to access?

dmgreen marked an inline comment as done.Jan 2 2020, 3:17 AM

dmgreen added inline comments.

llvm/lib/Analysis/LoopAccessAnalysis.cpp
2287	Yeah, my original version of this patch had isLegalGatherOrScatter passed as a functor from the LoopVectoriser through to LAA. It felt quite ugly though, this seemed a lot simpler to just pass TTI (plus I was looking as the SVE patch set and they had also added TTI to LAA for different reasons). But adding isLegalGatherOrScatter to TTI sounds like a better idea. That will make this cleaner.

dmgreen updated this revision to Diff 235843.Jan 2 2020, 3:19 AM

This looks good to me, adding a couple of minor last nits, thanks!

llvm/include/llvm/Analysis/TargetTransformInfo.h
609	"represent \p V" >> "represent a vectorized version of \p V" ? (This is the original comment, but the original context was LV.)
llvm/lib/Analysis/LoopAccessAnalysis.cpp
2314	May want to add a cl::opt flag to turn this filtering on/off.
2315	Suggest to start the line with "LAA:"
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4612	This is another potential customer of [TTI.]isLegalGatherOrScatter(I). After which CM's isLegalMaskedGather() and isLegalMaskedScatter() become useless and should be removed. This also suggests a TTI.isLegalMaskedLoadOrStore(Value *), btw. But these are independent cleanups.
llvm/test/Transforms/LoopVectorize/X86/optsize.ll
151–182	The actual size of vectorized loops under -Os does deserve further investigation, in general.

This revision is now accepted and ready to land.Jan 2 2020, 4:23 AM

Update as per the comments. I may leave this to be committed until the MVE codegen is further along (it depends on the first patch at least).

Ayal added inline comments.Jan 7 2020, 2:34 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4615	Thanks! nit: may look better to have instead bool isLegalMaskedLoadOrStore = isa<LoadInst>(I) ? isLegalMaskedLoad(Ty, Ptr, Alignment) : isLegalMaskedStore(Ty, Ptr, Alignment); return !(isLegalMaskedLoadOrStore \|\| TTI.isLegalGatherOrScatter(I)); (in anticipation of TTI.isLegalMaskedLoadOrStore())

dmgreen mentioned this in D72387: [LoopVectorize][TTI] Add an isLegalMaskedLoadStore method. NFC.Jan 8 2020, 1:05 AM

dmgreen marked an inline comment as done.

dmgreen added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4615	I've put up the isLegalMaskedLoadStore patch in D72387.

The LoopVectorizer/LAA has the ability to add runtime checks for memory accesses that look like they may be single stride accesses, in an attempt to still run vectorized code. This can happen in a boring matrix multiply kernel, for example:

for(int i = 0; i < n; i++) {
  for (int j = 0; j < m; j++)
  {
    int sum = 0;
    for (int k = 0; k < l; k++)
      sum += A[i*l + k] * B[k*m + j];
    C[i*m + j] = sum;
  }
}

Note that a (more boring?) matrix multiply kernel where B is a square matrix, i.e., where stride m is equal to trip count l, will not be specialized for m=1. But this general case may multiply matrix A by a single column matrix B, whose stride m is 1.

Another possible way to prevent such undesired specialization may be with a __builtin_expect/llvm.expect(m>1, 1).

Yeah, thanks. The square case comes up quite a lot.

My intuition is that users are unlikely to call matrix multiply when they are multiplying a matrix and a vector (they would call mat-vect multiply). I may be wrong about how much that happens though. In other cases that are not matrix multiply, the stride of 1 may be more common, but it's hard to tell that without some profiling data.

We here can also just disable the -enable-mem-access-versioning option, which will suffice for our testing in the short term.

We shouldn't make this either/or. Ability to runtime check unit-stride is good, and ability to use gather/scatter is also good. Depending on the target, I see the following situations

don't vectorize the loop ---- unit-strided code alone isn't profitable or loop itself is profitable but not good enough to cover runtime check cost.
vectorize with runtime check, with scalar code to fall back ---- when gather/scatter code is deemed not profitable.
vectorize with runtime check with gather/scatter code to fall back
vectorize with gather/scatter

If we aren't adding proper cost modeling, I think the default action should stay unchanged, i.e., 2), until further study allows us to collectively say another default is better. Looks like ARM wants to go with 4). If so, we need to make this TTI check, and allow LoopVectorize internal flag to override TTI. I don't think we immediately need to have the ability to do 3), but writing it down as TODO would be nice.

Matt added a subscriber: Matt.Sep 2 2021, 12:06 PM

Herald added a subscriber: vkmr. · View Herald TranscriptSep 2 2021, 12:06 PM

dmgreen mentioned this in D147336: [IVDescriptors] Add pointer InductionDescriptors with non-constant strides (try 2).Apr 3 2023, 1:05 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

LoopAccessAnalysis.h

10 lines

TargetTransformInfo.h

4 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

39 lines

TargetTransformInfo.cpp

11 lines

Transforms/

Vectorize/

LoopVectorize.cpp

17 lines

test/

Transforms/

LoopVectorize/

ARM/

mve-mat-mul.ll

48 lines

X86/

optsize.ll

37 lines

unittests/

Transforms/

Vectorize/

VPlanSlpTest.cpp

2 lines

Diff 235843

llvm/include/llvm/Analysis/LoopAccessAnalysis.h

Show First 20 Lines • Show All 510 Lines • ▼ Show 20 Lines
/// ScalarEvolution, we will generate run-time checks by emitting a		/// ScalarEvolution, we will generate run-time checks by emitting a
/// SCEVUnionPredicate.		/// SCEVUnionPredicate.
///		///
/// Checks for both memory dependences and the SCEV predicates contained in the		/// Checks for both memory dependences and the SCEV predicates contained in the
/// PSE must be emitted in order for the results of this analysis to be valid.		/// PSE must be emitted in order for the results of this analysis to be valid.
class LoopAccessInfo {		class LoopAccessInfo {
public:		public:
LoopAccessInfo(Loop L, ScalarEvolution SE, const TargetLibraryInfo *TLI,		LoopAccessInfo(Loop L, ScalarEvolution SE, const TargetLibraryInfo *TLI,
AliasAnalysis AA, DominatorTree DT, LoopInfo *LI);		const TargetTransformInfo TTI, AliasAnalysis AA,
		DominatorTree DT, LoopInfo LI);

/// Return true we can analyze the memory accesses in the loop and there are		/// Return true we can analyze the memory accesses in the loop and there are
/// no memory dependence cycles.		/// no memory dependence cycles.
bool canVectorizeMemory() const { return CanVecMem; }		bool canVectorizeMemory() const { return CanVecMem; }

/// Return true if there is a convergent operation in the loop. There may		/// Return true if there is a convergent operation in the loop. There may
/// still be reported runtime pointer checks that would be required, but it is		/// still be reported runtime pointer checks that would be required, but it is
/// not legal to insert them.		/// not legal to insert them.
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	public:
/// should be re-written (and therefore simplified) according to PSE.		/// should be re-written (and therefore simplified) according to PSE.
/// A user of LoopAccessAnalysis will need to emit the runtime checks		/// A user of LoopAccessAnalysis will need to emit the runtime checks
/// associated with this predicate.		/// associated with this predicate.
const PredicatedScalarEvolution &getPSE() const { return *PSE; }		const PredicatedScalarEvolution &getPSE() const { return *PSE; }

private:		private:
/// Analyze the loop.		/// Analyze the loop.
void analyzeLoop(AliasAnalysis AA, LoopInfo LI,		void analyzeLoop(AliasAnalysis AA, LoopInfo LI,
const TargetLibraryInfo TLI, DominatorTree DT);		const TargetLibraryInfo TLI, const TargetTransformInfo TTI,
		DominatorTree *DT);

/// Check if the structure of the loop allows it to be analyzed by this		/// Check if the structure of the loop allows it to be analyzed by this
/// pass.		/// pass.
bool canAnalyzeLoop();		bool canAnalyzeLoop();

/// Save the analysis remark.		/// Save the analysis remark.
///		///
/// LAA does not directly emits the remarks. Instead it stores it which the		/// LAA does not directly emits the remarks. Instead it stores it which the
/// client can retrieve and presents as its own analysis		/// client can retrieve and presents as its own analysis
/// (e.g. -Rpass-analysis=loop-vectorize).		/// (e.g. -Rpass-analysis=loop-vectorize).
OptimizationRemarkAnalysis &recordAnalysis(StringRef RemarkName,		OptimizationRemarkAnalysis &recordAnalysis(StringRef RemarkName,
Instruction *Instr = nullptr);		Instruction *Instr = nullptr);

/// Collect memory access with loop invariant strides.		/// Collect memory access with loop invariant strides.
///		///
/// Looks for accesses like "a[i * StrideA]" where "StrideA" is loop		/// Looks for accesses like "a[i * StrideA]" where "StrideA" is loop
/// invariant.		/// invariant.
void collectStridedAccess(Value *LoadOrStoreInst);		void collectStridedAccess(Value *LoadOrStoreInst,
		const TargetTransformInfo *TTI);

std::unique_ptr<PredicatedScalarEvolution> PSE;		std::unique_ptr<PredicatedScalarEvolution> PSE;

/// We need to check that all of the pointers in this list are disjoint		/// We need to check that all of the pointers in this list are disjoint
/// at runtime. Using std::unique_ptr to make using move ctor simpler.		/// at runtime. Using std::unique_ptr to make using move ctor simpler.
std::unique_ptr<RuntimePointerChecking> PtrRtChecking;		std::unique_ptr<RuntimePointerChecking> PtrRtChecking;

/// the Memory Dependence Checker which can determine the		/// the Memory Dependence Checker which can determine the
▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines

private:		private:
/// The cache.		/// The cache.
DenseMap<Loop *, std::unique_ptr<LoopAccessInfo>> LoopAccessInfoMap;		DenseMap<Loop *, std::unique_ptr<LoopAccessInfo>> LoopAccessInfoMap;

// The used analysis passes.		// The used analysis passes.
ScalarEvolution *SE = nullptr;		ScalarEvolution *SE = nullptr;
const TargetLibraryInfo *TLI = nullptr;		const TargetLibraryInfo *TLI = nullptr;
		const TargetTransformInfo *TTI = nullptr;
AliasAnalysis *AA = nullptr;		AliasAnalysis *AA = nullptr;
DominatorTree *DT = nullptr;		DominatorTree *DT = nullptr;
LoopInfo *LI = nullptr;		LoopInfo *LI = nullptr;
};		};

/// This analysis provides dependence information for the memory		/// This analysis provides dependence information for the memory
/// accesses of a loop.		/// accesses of a loop.
///		///
Show All 28 Lines

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 600 Lines • ▼ Show 20 Lines	public:
/// Return true if the target supports masked gather.		/// Return true if the target supports masked gather.
bool isLegalMaskedGather(Type *DataType, MaybeAlign Alignment) const;		bool isLegalMaskedGather(Type *DataType, MaybeAlign Alignment) const;

/// Return true if the target supports masked compress store.		/// Return true if the target supports masked compress store.
bool isLegalMaskedCompressStore(Type *DataType) const;		bool isLegalMaskedCompressStore(Type *DataType) const;
/// Return true if the target supports masked expand load.		/// Return true if the target supports masked expand load.
bool isLegalMaskedExpandLoad(Type *DataType) const;		bool isLegalMaskedExpandLoad(Type *DataType) const;

		/// Returns true if the target machine can represent \p V as a masked gather
		AyalUnsubmitted Done Reply Inline Actions "represent \p V" >> "represent a vectorized version of \p V" ? (This is the original comment, but the original context was LV.) Ayal: "represent \p V" >> "represent a vectorized version of \p V" ? (This is the original comment…
		/// or scatter operation.
		bool isLegalGatherOrScatter(Value *V) const;

/// Return true if the target has a unified operation to calculate division		/// Return true if the target has a unified operation to calculate division
/// and remainder. If so, the additional implicit multiplication and		/// and remainder. If so, the additional implicit multiplication and
/// subtraction required to calculate a remainder from division are free. This		/// subtraction required to calculate a remainder from division are free. This
/// can enable more aggressive transformations for division and remainder than		/// can enable more aggressive transformations for division and remainder than
/// would typically be allowed using throughput or size cost models.		/// would typically be allowed using throughput or size cost models.
bool hasDivRemOp(Type *DataType, bool IsSigned) const;		bool hasDivRemOp(Type *DataType, bool IsSigned) const;

/// Return true if the given instruction (assumed to be a memory access		/// Return true if the given instruction (assumed to be a memory access
▲ Show 20 Lines • Show All 1,369 Lines • Show Last 20 Lines

llvm/lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 1,783 Lines • ▼ Show 20 Lines	if (ExitCount == PSE->getSE()->getCouldNotCompute()) {
return false;		return false;
}		}

return true;		return true;
}		}

void LoopAccessInfo::analyzeLoop(AliasAnalysis AA, LoopInfo LI,		void LoopAccessInfo::analyzeLoop(AliasAnalysis AA, LoopInfo LI,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
		const TargetTransformInfo *TTI,
DominatorTree *DT) {		DominatorTree *DT) {
typedef SmallPtrSet<Value*, 16> ValueSet;		typedef SmallPtrSet<Value*, 16> ValueSet;

// Holds the Load and Store instructions.		// Holds the Load and Store instructions.
SmallVector<LoadInst *, 16> Loads;		SmallVector<LoadInst *, 16> Loads;
SmallVector<StoreInst *, 16> Stores;		SmallVector<StoreInst *, 16> Stores;

// Holds all the different accesses in the loop.		// Holds all the different accesses in the loop.
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
LLVM_DEBUG(dbgs() << "LAA: Found a non-simple load.\n");		LLVM_DEBUG(dbgs() << "LAA: Found a non-simple load.\n");
HasComplexMemInst = true;		HasComplexMemInst = true;
continue;		continue;
}		}
NumLoads++;		NumLoads++;
Loads.push_back(Ld);		Loads.push_back(Ld);
DepChecker->addAccess(Ld);		DepChecker->addAccess(Ld);
if (EnableMemAccessVersioning)		if (EnableMemAccessVersioning)
collectStridedAccess(Ld);		collectStridedAccess(Ld, TTI);
continue;		continue;
}		}

// Save 'store' instructions. Abort if other instructions write to memory.		// Save 'store' instructions. Abort if other instructions write to memory.
if (I.mayWriteToMemory()) {		if (I.mayWriteToMemory()) {
auto *St = dyn_cast<StoreInst>(&I);		auto *St = dyn_cast<StoreInst>(&I);
if (!St) {		if (!St) {
recordAnalysis("CantVectorizeInstruction", St)		recordAnalysis("CantVectorizeInstruction", St)
<< "instruction cannot be vectorized";		<< "instruction cannot be vectorized";
HasComplexMemInst = true;		HasComplexMemInst = true;
continue;		continue;
}		}
if (!St->isSimple() && !IsAnnotatedParallel) {		if (!St->isSimple() && !IsAnnotatedParallel) {
recordAnalysis("NonSimpleStore", St)		recordAnalysis("NonSimpleStore", St)
<< "write with atomic ordering or volatile write";		<< "write with atomic ordering or volatile write";
LLVM_DEBUG(dbgs() << "LAA: Found a non-simple store.\n");		LLVM_DEBUG(dbgs() << "LAA: Found a non-simple store.\n");
HasComplexMemInst = true;		HasComplexMemInst = true;
continue;		continue;
}		}
NumStores++;		NumStores++;
Stores.push_back(St);		Stores.push_back(St);
DepChecker->addAccess(St);		DepChecker->addAccess(St);
if (EnableMemAccessVersioning)		if (EnableMemAccessVersioning)
collectStridedAccess(St);		collectStridedAccess(St, TTI);
}		}
} // Next instr.		} // Next instr.
} // Next block.		} // Next block.

if (HasComplexMemInst) {		if (HasComplexMemInst) {
CanVecMem = false;		CanVecMem = false;
return;		return;
}		}
▲ Show 20 Lines • Show All 372 Lines • ▼ Show 20 Lines
std::pair<Instruction , Instruction >		std::pair<Instruction , Instruction >
LoopAccessInfo::addRuntimeChecks(Instruction *Loc) const {		LoopAccessInfo::addRuntimeChecks(Instruction *Loc) const {
if (!PtrRtChecking->Need)		if (!PtrRtChecking->Need)
return std::make_pair(nullptr, nullptr);		return std::make_pair(nullptr, nullptr);

return addRuntimeChecks(Loc, PtrRtChecking->getChecks());		return addRuntimeChecks(Loc, PtrRtChecking->getChecks());
}		}

void LoopAccessInfo::collectStridedAccess(Value *MemAccess) {		void LoopAccessInfo::collectStridedAccess(Value *MemAccess,
		const TargetTransformInfo *TTI) {
Value *Ptr = nullptr;		Value *Ptr = nullptr;
if (LoadInst *LI = dyn_cast<LoadInst>(MemAccess))		if (LoadInst *LI = dyn_cast<LoadInst>(MemAccess))
Ptr = LI->getPointerOperand();		Ptr = LI->getPointerOperand();
		AyalUnsubmitted Not Done Reply Inline Actions Separate the existing `Value Ptr = getLoadStorePointerOperand(MemAccess); if (!Ptr) return;` part from the new gather/scatter consideration? Would have been nice to reuse LV's `isLegalGatherOrScatter(Value V)`, or perhaps refactor it into `if (TTI && TTI->isLegalGatherOrScatter(MemAccess)) return;`? Worth informing of filtered strides with LLVM_DEBUG messages. (Can check if Ptr is already in SymbolicStrides and exit early; unrelated change.) Ayal: Separate the existing `Value *Ptr = getLoadStorePointerOperand(MemAccess); if (!Ptr) return;`…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Yeah, my original version of this patch had isLegalGatherOrScatter passed as a functor from the LoopVectoriser through to LAA. It felt quite ugly though, this seemed a lot simpler to just pass TTI (plus I was looking as the SVE patch set and they had also added TTI to LAA for different reasons). But adding isLegalGatherOrScatter to TTI sounds like a better idea. That will make this cleaner. dmgreen: Yeah, my original version of this patch had isLegalGatherOrScatter passed as a functor from the…
else if (StoreInst *SI = dyn_cast<StoreInst>(MemAccess))		else if (StoreInst *SI = dyn_cast<StoreInst>(MemAccess))
Ptr = SI->getPointerOperand();		Ptr = SI->getPointerOperand();
else		else
return;		return;

Value *Stride = getStrideFromPointer(Ptr, PSE->getSE(), TheLoop);		Value *Stride = getStrideFromPointer(Ptr, PSE->getSE(), TheLoop);
if (!Stride)		if (!Stride)
return;		return;

LLVM_DEBUG(dbgs() << "LAA: Found a strided access that is a candidate for "		LLVM_DEBUG(dbgs() << "LAA: Found a strided access that is a candidate for "
"versioning:");		"versioning:");
LLVM_DEBUG(dbgs() << " Ptr: " << Ptr << " Stride: " << Stride << "\n");		LLVM_DEBUG(dbgs() << " Ptr: " << Ptr << " Stride: " << Stride << "\n");

// Avoid adding the "Stride == 1" predicate when we know that		// If this is load/store could equally be represented as a gather/scatter, as
// Stride >= Trip-Count. Such a predicate will effectively optimize a single		// opposed to adding a unit stride runtime check, the gather/scatter is likely
// or zero iteration loop, as Trip-Count <= Stride == 1.		// to be useful in more cases (even if it might be slower than a sequential
		// load).
//		//
// TODO: We are currently not making a very informed decision on when it is		// TODO: We are currently not making a very informed decision on when it is
// beneficial to apply stride versioning. It might make more sense that the		// beneficial to apply stride versioning. It might make more sense that the
// users of this analysis (such as the vectorizer) will trigger it, based on		// users of this analysis (such as the vectorizer) will trigger it, based on
// their specific cost considerations; For example, in cases where stride		// their specific cost considerations; For example, in cases where stride
// versioning does not help resolving memory accesses/dependences, the		// versioning does not help resolving memory accesses/dependences, the
// vectorizer should evaluate the cost of the runtime test, and the benefit		// vectorizer should evaluate the cost of the runtime test, and the benefit
// of various possible stride specializations, considering the alternatives		// of various possible stride specializations, considering the alternatives
// of using gather/scatters (if available).		// of using gather/scatters (if available).
		if (TTI && TTI->isLegalGatherOrScatter(MemAccess)) {
		AyalUnsubmitted Done Reply Inline Actions May want to add a cl::opt flag to turn this filtering on/off. Ayal: May want to add a cl::opt flag to turn this filtering on/off.
		LLVM_DEBUG(dbgs() << " But leaving as a gather/scatter instead.\n");
		AyalUnsubmitted Done Reply Inline Actions Suggest to start the line with "LAA:" Ayal: Suggest to start the line with "LAA:"
		return;
		}

		// Avoid adding the "Stride == 1" predicate when we know that
		// Stride >= Trip-Count. Such a predicate will effectively optimize a single
		// or zero iteration loop, as Trip-Count <= Stride == 1.

const SCEV *StrideExpr = PSE->getSCEV(Stride);		const SCEV *StrideExpr = PSE->getSCEV(Stride);
const SCEV *BETakenCount = PSE->getBackedgeTakenCount();		const SCEV *BETakenCount = PSE->getBackedgeTakenCount();

// Match the types so we can compare the stride and the BETakenCount.		// Match the types so we can compare the stride and the BETakenCount.
// The Stride can be positive/negative, so we sign extend Stride;		// The Stride can be positive/negative, so we sign extend Stride;
// The backedgeTakenCount is non-negative, so we zero extend BETakenCount.		// The backedgeTakenCount is non-negative, so we zero extend BETakenCount.
const DataLayout &DL = TheLoop->getHeader()->getModule()->getDataLayout();		const DataLayout &DL = TheLoop->getHeader()->getModule()->getDataLayout();
Show All 19 Lines	void LoopAccessInfo::collectStridedAccess(Value *MemAccess,
}		}
LLVM_DEBUG(dbgs() << "LAA: Found a strided access that we can version.");		LLVM_DEBUG(dbgs() << "LAA: Found a strided access that we can version.");

SymbolicStrides[Ptr] = Stride;		SymbolicStrides[Ptr] = Stride;
StrideSet.insert(Stride);		StrideSet.insert(Stride);
}		}

LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,		LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,
const TargetLibraryInfo TLI, AliasAnalysis AA,		const TargetLibraryInfo *TLI,
DominatorTree DT, LoopInfo LI)		const TargetTransformInfo *TTI,
		AliasAnalysis AA, DominatorTree DT,
		LoopInfo *LI)
: PSE(std::make_unique<PredicatedScalarEvolution>(SE, L)),		: PSE(std::make_unique<PredicatedScalarEvolution>(SE, L)),
PtrRtChecking(std::make_unique<RuntimePointerChecking>(SE)),		PtrRtChecking(std::make_unique<RuntimePointerChecking>(SE)),
DepChecker(std::make_unique<MemoryDepChecker>(*PSE, L)), TheLoop(L),		DepChecker(std::make_unique<MemoryDepChecker>(*PSE, L)), TheLoop(L),
NumLoads(0), NumStores(0), MaxSafeDepDistBytes(-1), CanVecMem(false),		NumLoads(0), NumStores(0), MaxSafeDepDistBytes(-1), CanVecMem(false),
HasConvergentOp(false),		HasConvergentOp(false),
HasDependenceInvolvingLoopInvariantAddress(false) {		HasDependenceInvolvingLoopInvariantAddress(false) {
if (canAnalyzeLoop())		if (canAnalyzeLoop())
analyzeLoop(AA, LI, TLI, DT);		analyzeLoop(AA, LI, TLI, TTI, DT);
}		}

void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {		void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {
if (CanVecMem) {		if (CanVecMem) {
OS.indent(Depth) << "Memory dependences are safe";		OS.indent(Depth) << "Memory dependences are safe";
if (MaxSafeDepDistBytes != -1ULL)		if (MaxSafeDepDistBytes != -1ULL)
OS << " with a maximum dependence distance of " << MaxSafeDepDistBytes		OS << " with a maximum dependence distance of " << MaxSafeDepDistBytes
<< " bytes";		<< " bytes";
Show All 37 Lines
LoopAccessLegacyAnalysis::LoopAccessLegacyAnalysis() : FunctionPass(ID) {		LoopAccessLegacyAnalysis::LoopAccessLegacyAnalysis() : FunctionPass(ID) {
initializeLoopAccessLegacyAnalysisPass(*PassRegistry::getPassRegistry());		initializeLoopAccessLegacyAnalysisPass(*PassRegistry::getPassRegistry());
}		}

const LoopAccessInfo &LoopAccessLegacyAnalysis::getInfo(Loop *L) {		const LoopAccessInfo &LoopAccessLegacyAnalysis::getInfo(Loop *L) {
auto &LAI = LoopAccessInfoMap[L];		auto &LAI = LoopAccessInfoMap[L];

if (!LAI)		if (!LAI)
LAI = std::make_unique<LoopAccessInfo>(L, SE, TLI, AA, DT, LI);		LAI = std::make_unique<LoopAccessInfo>(L, SE, TLI, TTI, AA, DT, LI);

return *LAI.get();		return *LAI.get();
}		}

void LoopAccessLegacyAnalysis::print(raw_ostream &OS, const Module *M) const {		void LoopAccessLegacyAnalysis::print(raw_ostream &OS, const Module *M) const {
LoopAccessLegacyAnalysis &LAA = const_cast<LoopAccessLegacyAnalysis >(this);		LoopAccessLegacyAnalysis &LAA = const_cast<LoopAccessLegacyAnalysis >(this);

for (Loop TopLevelLoop : LI)		for (Loop TopLevelLoop : LI)
for (Loop *L : depth_first(TopLevelLoop)) {		for (Loop *L : depth_first(TopLevelLoop)) {
OS.indent(2) << L->getHeader()->getName() << ":\n";		OS.indent(2) << L->getHeader()->getName() << ":\n";
auto &LAI = LAA.getInfo(L);		auto &LAI = LAA.getInfo(L);
LAI.print(OS, 4);		LAI.print(OS, 4);
}		}
}		}

bool LoopAccessLegacyAnalysis::runOnFunction(Function &F) {		bool LoopAccessLegacyAnalysis::runOnFunction(Function &F) {
SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();		SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();		auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();
TLI = TLIP ? &TLIP->getTLI(F) : nullptr;		TLI = TLIP ? &TLIP->getTLI(F) : nullptr;
		auto *TTIP = getAnalysisIfAvailable<TargetTransformInfoWrapperPass>();
		TTI = TTIP ? &TTIP->getTTI(F) : nullptr;
AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();		DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();

return false;		return false;
}		}

void LoopAccessLegacyAnalysis::getAnalysisUsage(AnalysisUsage &AU) const {		void LoopAccessLegacyAnalysis::getAnalysisUsage(AnalysisUsage &AU) const {
Show All 15 Lines
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
INITIALIZE_PASS_END(LoopAccessLegacyAnalysis, LAA_NAME, laa_name, false, true)		INITIALIZE_PASS_END(LoopAccessLegacyAnalysis, LAA_NAME, laa_name, false, true)

AnalysisKey LoopAccessAnalysis::Key;		AnalysisKey LoopAccessAnalysis::Key;

LoopAccessInfo LoopAccessAnalysis::run(Loop &L, LoopAnalysisManager &AM,		LoopAccessInfo LoopAccessAnalysis::run(Loop &L, LoopAnalysisManager &AM,
LoopStandardAnalysisResults &AR) {		LoopStandardAnalysisResults &AR) {
return LoopAccessInfo(&L, &AR.SE, &AR.TLI, &AR.AA, &AR.DT, &AR.LI);		return LoopAccessInfo(&L, &AR.SE, &AR.TLI, &AR.TTI, &AR.AA, &AR.DT, &AR.LI);
}		}

namespace llvm {		namespace llvm {

Pass *createLAAPass() {		Pass *createLAAPass() {
return new LoopAccessLegacyAnalysis();		return new LoopAccessLegacyAnalysis();
}		}

} // end namespace llvm		} // end namespace llvm

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 327 Lines • ▼ Show 20 Lines
	bool TargetTransformInfo::isLegalMaskedCompressStore(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedCompressStore(Type *DataType) const {
	return TTIImpl->isLegalMaskedCompressStore(DataType);			return TTIImpl->isLegalMaskedCompressStore(DataType);
	}			}

	bool TargetTransformInfo::isLegalMaskedExpandLoad(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedExpandLoad(Type *DataType) const {
	return TTIImpl->isLegalMaskedExpandLoad(DataType);			return TTIImpl->isLegalMaskedExpandLoad(DataType);
	}			}

				bool TargetTransformInfo::isLegalGatherOrScatter(Value *V) const {
				LoadInst *LI = dyn_cast<LoadInst>(V);
				StoreInst *SI = dyn_cast<StoreInst>(V);
				if (!LI && !SI)
				return false;
				Type *Ty = LI ? LI->getType() : SI->getValueOperand()->getType();
				MaybeAlign Align = getLoadStoreAlignment(V);
				return (LI && isLegalMaskedGather(Ty, Align)) \|\|
				(SI && isLegalMaskedScatter(Ty, Align));
				}

	bool TargetTransformInfo::hasDivRemOp(Type *DataType, bool IsSigned) const {			bool TargetTransformInfo::hasDivRemOp(Type *DataType, bool IsSigned) const {
	return TTIImpl->hasDivRemOp(DataType, IsSigned);			return TTIImpl->hasDivRemOp(DataType, IsSigned);
	}			}

	bool TargetTransformInfo::hasVolatileVariant(Instruction *I,			bool TargetTransformInfo::hasVolatileVariant(Instruction *I,
	unsigned AddrSpace) const {			unsigned AddrSpace) const {
	return TTIImpl->hasVolatileVariant(I, AddrSpace);			return TTIImpl->hasVolatileVariant(I, AddrSpace);
	}			}
	▲ Show 20 Lines • Show All 1,053 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,217 Lines • ▼ Show 20 Lines	public:
}		}

/// Returns true if the target machine supports masked gather operation		/// Returns true if the target machine supports masked gather operation
/// for the given \p DataType.		/// for the given \p DataType.
bool isLegalMaskedGather(Type *DataType, MaybeAlign Alignment) {		bool isLegalMaskedGather(Type *DataType, MaybeAlign Alignment) {
return TTI.isLegalMaskedGather(DataType, Alignment);		return TTI.isLegalMaskedGather(DataType, Alignment);
}		}

/// Returns true if the target machine can represent \p V as a masked gather
/// or scatter operation.
bool isLegalGatherOrScatter(Value *V) {
bool LI = isa<LoadInst>(V);
bool SI = isa<StoreInst>(V);
if (!LI && !SI)
return false;
auto *Ty = getMemInstValueType(V);
MaybeAlign Align = getLoadStoreAlignment(V);
return (LI && isLegalMaskedGather(Ty, Align)) \|\|
(SI && isLegalMaskedScatter(Ty, Align));
}

/// Returns true if \p I is an instruction that will be scalarized with		/// Returns true if \p I is an instruction that will be scalarized with
/// predication. Such instructions include conditional stores and		/// predication. Such instructions include conditional stores and
/// instructions that may divide by zero.		/// instructions that may divide by zero.
/// If a non-zero VF has been calculated, we check if I will be scalarized		/// If a non-zero VF has been calculated, we check if I will be scalarized
/// predication for that VF.		/// predication for that VF.
bool isScalarWithPredication(Instruction *I, unsigned VF = 1);		bool isScalarWithPredication(Instruction *I, unsigned VF = 1);

// Returns true if \p I is an instruction that will be predicated either		// Returns true if \p I is an instruction that will be predicated either
▲ Show 20 Lines • Show All 3,370 Lines • ▼ Show 20 Lines	if (VF > 1) {
assert(WideningDecision != CM_Unknown &&		assert(WideningDecision != CM_Unknown &&
"Widening decision should be ready at this moment");		"Widening decision should be ready at this moment");
return WideningDecision == CM_Scalarize;		return WideningDecision == CM_Scalarize;
}		}
const MaybeAlign Alignment = getLoadStoreAlignment(I);		const MaybeAlign Alignment = getLoadStoreAlignment(I);
return isa<LoadInst>(I) ? !(isLegalMaskedLoad(Ty, Ptr, Alignment) \|\|		return isa<LoadInst>(I) ? !(isLegalMaskedLoad(Ty, Ptr, Alignment) \|\|
isLegalMaskedGather(Ty, Alignment))		isLegalMaskedGather(Ty, Alignment))
: !(isLegalMaskedStore(Ty, Ptr, Alignment) \|\|		: !(isLegalMaskedStore(Ty, Ptr, Alignment) \|\|
isLegalMaskedScatter(Ty, Alignment));		isLegalMaskedScatter(Ty, Alignment));
		AyalUnsubmitted Done Reply Inline Actions This is another potential customer of [TTI.]isLegalGatherOrScatter(I). After which CM's isLegalMaskedGather() and isLegalMaskedScatter() become useless and should be removed. This also suggests a TTI.isLegalMaskedLoadOrStore(Value ), btw. But these are independent cleanups. Ayal:* This is another potential customer of [TTI.]isLegalGatherOrScatter(I). After which CM's…
}		}
case Instruction::UDiv:		case Instruction::UDiv:
case Instruction::SDiv:		case Instruction::SDiv:
		AyalUnsubmitted Not Done Reply Inline Actions Thanks! nit: may look better to have instead bool isLegalMaskedLoadOrStore = isa<LoadInst>(I) ? isLegalMaskedLoad(Ty, Ptr, Alignment) : isLegalMaskedStore(Ty, Ptr, Alignment); return !(isLegalMaskedLoadOrStore \|\| TTI.isLegalGatherOrScatter(I)); (in anticipation of TTI.isLegalMaskedLoadOrStore()) Ayal: Thanks! nit: may look better to have instead ``` bool isLegalMaskedLoadOrStore =…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I've put up the isLegalMaskedLoadStore patch in D72387. dmgreen: I've put up the isLegalMaskedLoadStore patch in D72387.
case Instruction::SRem:		case Instruction::SRem:
case Instruction::URem:		case Instruction::URem:
return mayDivideByZero(*I);		return mayDivideByZero(*I);
}		}
return false;		return false;
}		}

bool LoopVectorizationCostModel::interleavedAccessCanBeWidened(Instruction *I,		bool LoopVectorizationCostModel::interleavedAccessCanBeWidened(Instruction *I,
▲ Show 20 Lines • Show All 528 Lines • ▼ Show 20 Lines	for (Instruction &I : BB->instructionsWithoutDebug()) {
//		//
// FIXME: The check here attempts to predict whether a load or store will		// FIXME: The check here attempts to predict whether a load or store will
// be vectorized. We only know this for certain after a VF has		// be vectorized. We only know this for certain after a VF has
// been selected. Here, we assume that if an access can be		// been selected. Here, we assume that if an access can be
// vectorized, it will be. We should also look at extending this		// vectorized, it will be. We should also look at extending this
// optimization to non-pointer types.		// optimization to non-pointer types.
//		//
if (T->isPointerTy() && !isConsecutiveLoadOrStore(&I) &&		if (T->isPointerTy() && !isConsecutiveLoadOrStore(&I) &&
!isAccessInterleaved(&I) && !isLegalGatherOrScatter(&I))		!isAccessInterleaved(&I) && !TTI.isLegalGatherOrScatter(&I))
continue;		continue;

MinWidth = std::min(MinWidth,		MinWidth = std::min(MinWidth,
(unsigned)DL.getTypeSizeInBits(T->getScalarType()));		(unsigned)DL.getTypeSizeInBits(T->getScalarType()));
MaxWidth = std::max(MaxWidth,		MaxWidth = std::max(MaxWidth,
(unsigned)DL.getTypeSizeInBits(T->getScalarType()));		(unsigned)DL.getTypeSizeInBits(T->getScalarType()));
}		}
}		}
▲ Show 20 Lines • Show All 872 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
continue;		continue;

NumAccesses = Group->getNumMembers();		NumAccesses = Group->getNumMembers();
if (interleavedAccessCanBeWidened(&I, VF))		if (interleavedAccessCanBeWidened(&I, VF))
InterleaveCost = getInterleaveGroupCost(&I, VF);		InterleaveCost = getInterleaveGroupCost(&I, VF);
}		}

unsigned GatherScatterCost =		unsigned GatherScatterCost =
isLegalGatherOrScatter(&I)		TTI.isLegalGatherOrScatter(&I)
? getGatherScatterCost(&I, VF) * NumAccesses		? getGatherScatterCost(&I, VF) * NumAccesses
: std::numeric_limits<unsigned>::max();		: std::numeric_limits<unsigned>::max();

unsigned ScalarizationCost =		unsigned ScalarizationCost =
getMemInstScalarizationCost(&I, VF) * NumAccesses;		getMemInstScalarizationCost(&I, VF) * NumAccesses;

// Choose better solution for the current VF,		// Choose better solution for the current VF,
// write down this decision and use it during vectorization.		// write down this decision and use it during vectorization.
▲ Show 20 Lines • Show All 1,922 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/mve-mat-mul.ll

	Show All 21 Lines
	; CHECK-NEXT: br label [[FOR_COND8_PREHEADER_US_US:%.*]]			; CHECK-NEXT: br label [[FOR_COND8_PREHEADER_US_US:%.*]]
	; CHECK: for.cond4.for.cond.cleanup6_crit_edge.us:			; CHECK: for.cond4.for.cond.cleanup6_crit_edge.us:
	; CHECK-NEXT: [[INC24_US]] = add nuw nsw i32 [[I_054_US]], 1			; CHECK-NEXT: [[INC24_US]] = add nuw nsw i32 [[I_054_US]], 1
	; CHECK-NEXT: [[EXITCOND86:%.*]] = icmp eq i32 [[INC24_US]], [[N]]			; CHECK-NEXT: [[EXITCOND86:%.*]] = icmp eq i32 [[INC24_US]], [[N]]
	; CHECK-NEXT: br i1 [[EXITCOND86]], label [[FOR_END25_LOOPEXIT:%.*]], label [[FOR_COND8_PREHEADER_US_US_PREHEADER]]			; CHECK-NEXT: br i1 [[EXITCOND86]], label [[FOR_END25_LOOPEXIT:%.*]], label [[FOR_COND8_PREHEADER_US_US_PREHEADER]]
	; CHECK: for.cond8.preheader.us.us:			; CHECK: for.cond8.preheader.us.us:
	; CHECK-NEXT: [[J_051_US_US:%.]] = phi i32 [ [[INC21_US_US:%.]], [[FOR_COND8_FOR_COND_CLEANUP10_CRIT_EDGE_US_US:%.*]] ], [ 0, [[FOR_COND8_PREHEADER_US_US_PREHEADER]] ]			; CHECK-NEXT: [[J_051_US_US:%.]] = phi i32 [ [[INC21_US_US:%.]], [[FOR_COND8_FOR_COND_CLEANUP10_CRIT_EDGE_US_US:%.*]] ], [ 0, [[FOR_COND8_PREHEADER_US_US_PREHEADER]] ]
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[L]], 4			; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[L]], 4
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_SCEVCHECK:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.scevcheck:
	; CHECK-NEXT: [[IDENT_CHECK:%.*]] = icmp ne i32 [[M]], 1
	; CHECK-NEXT: [[TMP0:%.*]] = or i1 false, [[IDENT_CHECK]]
	; CHECK-NEXT: br i1 [[TMP0]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[L]], 4			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[L]], 4
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[L]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[L]], [[N_MOD_VF]]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[M]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <4 x i32> undef, i32 [[J_051_US_US]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT1]], <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP12:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP12:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[INDEX]], i32 0			; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 1
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3>			; CHECK-NEXT: [[TMP2:%.*]] = add i32 [[INDEX]], 2
	; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 0			; CHECK-NEXT: [[TMP3:%.*]] = add i32 [[INDEX]], 3
	; CHECK-NEXT: [[TMP2:%.*]] = add i32 [[TMP1]], [[MUL_US]]			; CHECK-NEXT: [[TMP4:%.*]] = add i32 [[TMP0]], [[MUL_US]]
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i32 [[TMP2]]			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i32 [[TMP4]]
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TMP3]], i32 0			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP5]], i32 0
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP4]] to <4 x i32>*			; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP5]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP7]], align 4
	; CHECK-NEXT: [[TMP6:%.*]] = mul i32 [[TMP1]], [[M]]			; CHECK-NEXT: [[TMP8:%.*]] = mul <4 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
	; CHECK-NEXT: [[TMP7:%.*]] = add i32 [[TMP6]], [[J_051_US_US]]			; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[BROADCAST_SPLAT2]]
	; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i32 [[TMP7]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], <4 x i32> [[TMP9]]
	; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP8]], i32 0			; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP10]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef)
	; CHECK-NEXT: [[TMP10:%.]] = bitcast i32 [[TMP9]] to <4 x i32>*			; CHECK-NEXT: [[TMP11:%.*]] = mul nsw <4 x i32> [[WIDE_MASKED_GATHER]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP10]], align 4			; CHECK-NEXT: [[TMP12]] = add <4 x i32> [[TMP11]], [[VEC_PHI]]
	; CHECK-NEXT: [[TMP11:%.*]] = mul nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[TMP12]] = add nsw <4 x i32> [[TMP11]], [[VEC_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], <i32 4, i32 4, i32 4, i32 4>
	; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0			; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP14:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP12]])			; CHECK-NEXT: [[TMP14:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP12]])
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[L]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[L]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND8_FOR_COND_CLEANUP10_CRIT_EDGE_US_US]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND8_FOR_COND_CLEANUP10_CRIT_EDGE_US_US]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_COND8_PREHEADER_US_US]] ], [ 0, [[VECTOR_SCEVCHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_COND8_PREHEADER_US_US]] ]
	; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_COND8_PREHEADER_US_US]] ], [ 0, [[VECTOR_SCEVCHECK]] ], [ [[TMP14]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_COND8_PREHEADER_US_US]] ], [ [[TMP14]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: br label [[FOR_BODY11_US_US:%.*]]			; CHECK-NEXT: br label [[FOR_BODY11_US_US:%.*]]
	; CHECK: for.cond8.for.cond.cleanup10_crit_edge.us.us:			; CHECK: for.cond8.for.cond.cleanup10_crit_edge.us.us:
	; CHECK-NEXT: [[ADD16_US_US_LCSSA:%.]] = phi i32 [ [[ADD16_US_US:%.]], [[FOR_BODY11_US_US]] ], [ [[TMP14]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[ADD16_US_US_LCSSA:%.]] = phi i32 [ [[ADD16_US_US:%.]], [[FOR_BODY11_US_US]] ], [ [[TMP14]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: [[ADD18_US_US:%.*]] = add i32 [[J_051_US_US]], [[MUL17_US]]			; CHECK-NEXT: [[ADD18_US_US:%.*]] = add i32 [[J_051_US_US]], [[MUL17_US]]
	; CHECK-NEXT: [[ARRAYIDX19_US_US:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i32 [[ADD18_US_US]]			; CHECK-NEXT: [[ARRAYIDX19_US_US:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i32 [[ADD18_US_US]]
	; CHECK-NEXT: store i32 [[ADD16_US_US_LCSSA]], i32* [[ARRAYIDX19_US_US]], align 4			; CHECK-NEXT: store i32 [[ADD16_US_US_LCSSA]], i32* [[ARRAYIDX19_US_US]], align 4
	; CHECK-NEXT: [[INC21_US_US]] = add nuw nsw i32 [[J_051_US_US]], 1			; CHECK-NEXT: [[INC21_US_US]] = add nuw nsw i32 [[J_051_US_US]], 1
	; CHECK-NEXT: [[EXITCOND85:%.*]] = icmp eq i32 [[INC21_US_US]], [[M]]			; CHECK-NEXT: [[EXITCOND85:%.*]] = icmp eq i32 [[INC21_US_US]], [[M]]
	▲ Show 20 Lines • Show All 73 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/optsize.ll

Show First 20 Lines • Show All 134 Lines • ▼ Show 20 Lines	for.end: ; preds = %for.body
ret i32 0		ret i32 0
}		}

attributes #1 = { minsize }		attributes #1 = { minsize }


; We can't vectorize this one because we version for stride==1; even having TC		; We can't vectorize this one because we version for stride==1; even having TC
; a multiple of VF.		; a multiple of VF.
; CHECK-LABEL: @scev4stride1		; CHECK-LABEL: @scev4stride1_16
; CHECK-NOT: vector.scevcheck		; CHECK-NOT: vector.scevcheck
; CHECK-NOT: vector.body:		; CHECK-NOT: vector.body:
; CHECK-LABEL: for.body:		; CHECK-LABEL: for.body:
; AUTOVF-LABEL: @scev4stride1		; AUTOVF-LABEL: @scev4stride1_16
; AUTOVF-NOT: vector.scevcheck		; AUTOVF-NOT: vector.scevcheck
; AUTOVF-NOT: vector.body:		; AUTOVF-NOT: vector.body:
; AUTOVF-LABEL: for.body:		; AUTOVF-LABEL: for.body:
define void @scev4stride1(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %k) #2 {		define void @scev4stride1_16(i16* noalias nocapture %a, i16* noalias nocapture readonly %b, i32 %k) #2 {
		for.body.preheader:
		br label %for.body

		for.body: ; preds = %for.body.preheader, %for.body
		%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
		%mul = mul nsw i32 %i.07, %k
		%arrayidx = getelementptr inbounds i16, i16* %b, i32 %mul
		%0 = load i16, i16* %arrayidx, align 4
		%arrayidx1 = getelementptr inbounds i16, i16* %a, i32 %i.07
		store i16 %0, i16* %arrayidx1, align 4
		%inc = add nuw nsw i32 %i.07, 1
		%exitcond = icmp eq i32 %inc, 256
		br i1 %exitcond, label %for.end.loopexit, label %for.body

		for.end.loopexit: ; preds = %for.body
		ret void
		}

		; We can vectorize this one because we can instead use gather loads without needing runtime checks.
		; These checks make sure that the scalar remainder loop will not be called.
		; CHECK-LABEL: @scev4stride1_32
		; CHECK-NOT: vector.scevcheck
		; CHECK: br i1 false, label %scalar.ph, label %vector.ph
		; CHECK: %cmp.n = icmp eq i32 256, 256
		; CHECK: br i1 %cmp.n, label %for.end.loopexit, label %scalar.ph
		; AUTOVF-LABEL: @scev4stride1_32
		; AUTOVF-NOT: vector.scevcheck
		; AUTOVF: br i1 false, label %scalar.ph, label %vector.ph
		; AUTOVF: %cmp.n = icmp eq i32 256, 256
		; AUTOVF: br i1 %cmp.n, label %for.end.loopexit, label %scalar.ph
		define void @scev4stride1_32(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %k) #2 {
		AyalUnsubmitted Not Done Reply Inline Actions Would indeed be good to have an i16 version retaining current checks (unvectorized behavior), as skx supports gathers of i32 but not for i16; and also the original i32 version with checks for vectorized code. Ayal: Would indeed be good to have an i16 version retaining current checks (unvectorized behavior)…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions The gather version of the original 32bit code here seems to be quite large for -Os. There are large constant pool entries that the gather needs to access? dmgreen: The gather version of the original 32bit code here seems to be quite large for -Os. There are…
		AyalUnsubmitted Not Done Reply Inline Actions The actual size of vectorized loops under -Os does deserve further investigation, in general. Ayal: The actual size of vectorized loops under -Os does deserve further investigation, in general.
for.body.preheader:		for.body.preheader:
br label %for.body		br label %for.body

for.body: ; preds = %for.body.preheader, %for.body		for.body: ; preds = %for.body.preheader, %for.body
%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]		%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
%mul = mul nsw i32 %i.07, %k		%mul = mul nsw i32 %i.07, %k
%arrayidx = getelementptr inbounds i32, i32* %b, i32 %mul		%arrayidx = getelementptr inbounds i32, i32* %b, i32 %mul
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
Show All 39 Lines

llvm/unittests/Transforms/Vectorize/VPlanSlpTest.cpp

Show All 37 Lines	protected:
VPInterleavedAccessInfo getInterleavedAccessInfo(Function &F, Loop *L,		VPInterleavedAccessInfo getInterleavedAccessInfo(Function &F, Loop *L,
VPlan &Plan) {		VPlan &Plan) {
AC.reset(new AssumptionCache(F));		AC.reset(new AssumptionCache(F));
SE.reset(new ScalarEvolution(F, TLI, AC, DT, *LI));		SE.reset(new ScalarEvolution(F, TLI, AC, DT, *LI));
BasicAA.reset(new BasicAAResult(DL, F, TLI, AC, &DT, &*LI));		BasicAA.reset(new BasicAAResult(DL, F, TLI, AC, &DT, &*LI));
AARes.reset(new AAResults(TLI));		AARes.reset(new AAResults(TLI));
AARes->addAAResult(*BasicAA);		AARes->addAAResult(*BasicAA);
PSE.reset(new PredicatedScalarEvolution(SE, L));		PSE.reset(new PredicatedScalarEvolution(SE, L));
LAI.reset(new LoopAccessInfo(L, &SE, &TLI, &AARes, &DT, &LI));		LAI.reset(new LoopAccessInfo(L, &SE, &TLI, nullptr, &AARes, &DT, &LI));
IAI.reset(new InterleavedAccessInfo(PSE, L, &DT, &LI, &LAI));		IAI.reset(new InterleavedAccessInfo(PSE, L, &DT, &LI, &LAI));
IAI->analyzeInterleaving(false);		IAI->analyzeInterleaving(false);
return {Plan, *IAI};		return {Plan, *IAI};
}		}
};		};

TEST_F(VPlanSlpTest, testSlpSimple_2) {		TEST_F(VPlanSlpTest, testSlpSimple_2) {
const char *ModuleString =		const char *ModuleString =
▲ Show 20 Lines • Show All 843 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Disable single stride access predicates when gather loads are available.AcceptedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 235843

llvm/include/llvm/Analysis/LoopAccessAnalysis.h

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/lib/Analysis/LoopAccessAnalysis.cpp

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/ARM/mve-mat-mul.ll

llvm/test/Transforms/LoopVectorize/X86/optsize.ll

llvm/unittests/Transforms/Vectorize/VPlanSlpTest.cpp

[LoopVectorize] Disable single stride access predicates when gather loads are available.
AcceptedPublic