This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
1/1
VectorUtils.h
-
CodeGen/
-
BasicTTIImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/
-
AArch64/
-
AArch64TargetTransformInfo.h
-
AArch64TargetTransformInfo.cpp
-
ARM/
-
ARMTargetTransformInfo.h
-
ARMTargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
1/2
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
sve-tail-folding-option.ll

Differential D128342

[AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved accesses
ClosedPublic

Authored by david-arm on Jun 22 2022, 7:36 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
fhahn
dmgreen
peterwaller-arm
kmclaughlin
dorit

Commits

rG4ef9cb6c170a: [AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved…

Summary

If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.

Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.

Added an extra test here:

Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

david-arm created this revision.Jun 22 2022, 7:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2022, 7:36 AM

Herald added subscribers: ctetreau, rogfer01, bollu and 2 others. · View Herald Transcript

david-arm requested review of this revision.Jun 22 2022, 7:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2022, 7:36 AM

Herald added subscribers: llvm-commits, vkmr. · View Herald Transcript

Harbormaster completed remote builds in B171318: Diff 439017.Jun 22 2022, 7:37 AM

sdesmalen added inline comments.Jun 28 2022, 1:52 AM

llvm/include/llvm/Analysis/VectorUtils.h
814	nit: s/haveGroups/hasGroups/ ?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5053	I'm not sure I understand where the `UserVF` comes into play with this choice, given that this a choice between "tail folding" vs "scalar epilogue". Can you elaborate?
llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-interleave.ll
1 ↗	(On Diff #439017)	Why is this loop still vectorized with a VF of 8?

fhahn added inline comments.Jun 28 2022, 2:44 AM

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-interleave.ll
1 ↗	(On Diff #439017)	This needs `REQUIRES: asserts` I think, as it uses `-debug`, same for the other test. Could. you precommit the tests, so only the impact of the patch is shown in this diff?
9 ↗	(On Diff #439017)	nit: everything expcept the `no alias` attribute appears to be unnecessary

david-arm added inline comments.Jun 28 2022, 5:02 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5053	So there is a MVE test that tries to force the use of tail-folding for a loop with masked interleaved accesses, and the test forces a particular VF. So I had two choices here: Either fix the test to reflect the new behaviour, or In cases where the user has explicitly requested tail-folding and forced a VF, then probably the user really wants tail-folding enabled for that VF. Alternatively I could add a brand new flag controlling this behaviour, something like -use-tail-folding-when-interleaving, etc.

david-arm added inline comments.Jun 28 2022, 5:14 AM

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-interleave.ll
1 ↗	(On Diff #439017)	Hi @sdesmalen, this is actually what we want to happen because this is a loop with interleaved memory accesses where NEON can do this very efficiently. It's actually a VF of 4, but we have the <8 x float> and <12 x float> types in the loop due to deinterleaving loads and interleaving stores, respectively. What this patch is doing is instructing the vectoriser to disable tail folding because performance is likely to be terrible, and instead fall back on normal unpredicated vector loops that don't require masked interleaved accesses.

Cleaned up tests.
Renamed haveGroups -> hasGroups.

david-arm marked 3 inline comments as done.Jun 28 2022, 6:47 AM

Harbormaster completed remote builds in B172462: Diff 440598.Jun 28 2022, 7:33 AM

In terms of MVE - A VLD2/VLD4 cannot be predicated so in that regards we do not support "MaskedInterleavedAccesses". There is code in canTailPredicateLoop that attempts to get that right. Any other interleaving group width will be emulated with a gather/scatter though, which can happily be masked.
https://godbolt.org/z/KzvEqz439

For SVE my understanding is that LD2/LD3/LD4 can be predicated, and other widths (and current codegen as interleaving is not yet supported) will use gather/scatter which can be masked. In the long run they may have MaskedInterleavedAccesses returning true.

Is the problem that we are trying to use one variable for both Neon and SVE vectorization, where SVE prefers folding the tail, and NEON will need not to?

In D128342#3615721, @dmgreen wrote:

In terms of MVE - A VLD2/VLD4 cannot be predicated so in that regards we do not support "MaskedInterleavedAccesses". There is code in canTailPredicateLoop that attempts to get that right. Any other interleaving group width will be emulated with a gather/scatter though, which can happily be masked.
https://godbolt.org/z/KzvEqz439

For SVE my understanding is that LD2/LD3/LD4 can be predicated, and other widths (and current codegen as interleaving is not yet supported) will use gather/scatter which can be masked. In the long run they may have MaskedInterleavedAccesses returning true.

Is the problem that we are trying to use one variable for both Neon and SVE vectorization, where SVE prefers folding the tail, and NEON will need not to?

Hi @dmgreen, when I first discovered this issue it gave me a headache! The fundamental problem here is that we don't create a matrix of vplan costs with vectorisation style (tail-folding, vector loop + scalar epilogue, etc.) as one axis and VF as the other. This means we cannot look at all possible permutations and choose the lowest cost combination, so our only option is to decide up-front whether or not to use tail-folding before we've calculated any costs.

We currently don't have support for interleaving with SVE using ld2/st2/etc because we cannot rely upon shufflevector in the same way that fixed-width vectorisation does. This means that if we choose to tail-fold a loop with masked interleaved accesses the costs for both NEON and SVE will be so high that we will not vectorise at all. Whereas if we didn't choose to tail-fold we'd actually end up vectorising the interleaved memory accesses using NEON's ld2/st3, which is still faster than a scalar loop. So what I'm trying to do here is avoid causing regressions when enabling tail-folding for any target that does not support masked interleaved memory accesses. If at some point in the future we can support them for SVE then useMaskedInterleavedAccesses will return true.

Hi @dmgreen, when I first discovered this issue it gave me a headache! The fundamental problem here is that we don't create a matrix of vplan costs with vectorisation style (tail-folding, vector loop + scalar epilogue, etc.) as one axis and VF as the other. This means we cannot look at all possible permutations and choose the lowest cost combination, so our only option is to decide up-front whether or not to use tail-folding before we've calculated any costs.

We currently don't have support for interleaving with SVE using ld2/st2/etc because we cannot rely upon shufflevector in the same way that fixed-width vectorisation does. This means that if we choose to tail-fold a loop with masked interleaved accesses the costs for both NEON and SVE will be so high that we will not vectorise at all. Whereas if we didn't choose to tail-fold we'd actually end up vectorising the interleaved memory accesses using NEON's ld2/st3, which is still faster than a scalar loop. So what I'm trying to do here is avoid causing regressions when enabling tail-folding for any target that does not support masked interleaved memory accesses. If at some point in the future we can support them for SVE then useMaskedInterleavedAccesses will return true.

Yeah I remember that causing problems for Sjoerd on MVE in the past, with it deciding too much up-front.

Would it be possible to treat this the same way as MVE does - return false for preferPredicateOverEpilogue if the stride of a memory access is 2 or 3 or 4. That would prevent the MVE side getting worse. (It's not a lot worse, but predicating the loop is better than not doing it in the second example up above). And in the long run we can try to keep pushing the vectorizer towards being able to cost different vplans against one another.

Matt added a subscriber: Matt.Jun 28 2022, 2:23 PM

Hi @dmgreen, I can certainly look into this! I'd just assumed that trying to have predicated loops for targets that don't have masked interleaved accesses would be bad in general that's all. It means falling back on gather/scatters or scalarising, either of which is usually very expensive.

It looks like in LoopVectorizePass::processLoop we set up the InterleavedAccessInfo object after we've called getScalarEpilogueLowering, which is a shame but perhaps I can shuffle things around.

david-arm planned changes to this revision.Jun 30 2022, 6:39 AM

Changed to using the TTI hook 'preferPredicateOverEpilogue' in order to make the decision about whether to use tail-folding or not. This requires refactoring the LoopVectorizer.cpp code a little in order to pass in the InterleaveAccessInfo object.

Herald added subscribers: shiva0217, tschuett. · View Herald TranscriptJul 19 2022, 3:08 AM

david-arm added a parent revision: D129560: [AArch64] Add target hook for preferPredicateOverEpilogue.Jul 19 2022, 3:08 AM

Harbormaster completed remote builds in B176211: Diff 445758.Jul 19 2022, 4:03 AM

david-arm added a child revision: D130618: [AArch64][LoopVectorize] Enable tail-folding of simple loops on neoverse-v1.Jul 27 2022, 4:29 AM

Hi @david-arm, this change looks reasonable to me. Can you please wait a day or so before landing, just to make sure the other reviewers have a chance to look at it again?

This revision is now accepted and ready to land.Jul 29 2022, 7:09 AM

This revision was landed with ongoing or failed builds.Aug 2 2022, 1:52 AM

Closed by commit rG4ef9cb6c170a: [AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved… (authored by david-arm). · Explain Why

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rG4ef9cb6c170a: [AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved….

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

20 lines

TargetTransformInfoImpl.h

3 lines

VectorUtils.h

3 lines

CodeGen/

BasicTTIImpl.h

5 lines

lib/

Analysis/

TargetTransformInfo.cpp

6 lines

Target/

AArch64/

AArch64TargetTransformInfo.h

3 lines

AArch64TargetTransformInfo.cpp

9 lines

ARM/

ARMTargetTransformInfo.h

3 lines

ARMTargetTransformInfo.cpp

3 lines

Transforms/

Vectorize/

LoopVectorize.cpp

39 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-tail-folding-option.ll

53 lines

Diff 449227

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
class BlockFrequencyInfo;		class BlockFrequencyInfo;
class DominatorTree;		class DominatorTree;
class BranchInst;		class BranchInst;
class CallBase;		class CallBase;
class Function;		class Function;
class GlobalValue;		class GlobalValue;
class InstCombiner;		class InstCombiner;
class OptimizationRemarkEmitter;		class OptimizationRemarkEmitter;
		class InterleavedAccessInfo;
class IntrinsicInst;		class IntrinsicInst;
class LoadInst;		class LoadInst;
class Loop;		class Loop;
class LoopInfo;		class LoopInfo;
class LoopVectorizationLegality;		class LoopVectorizationLegality;
class ProfileSummaryInfo;		class ProfileSummaryInfo;
class RecurrenceDescriptor;		class RecurrenceDescriptor;
class SCEV;		class SCEV;
▲ Show 20 Lines • Show All 468 Lines • ▼ Show 20 Lines	bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *LibInfo,		AssumptionCache &AC, TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) const;		HardwareLoopInfo &HWLoopInfo) const;

/// Query the target whether it would be prefered to create a predicated		/// Query the target whether it would be prefered to create a predicated
/// vector loop, which can avoid the need to emit a scalar epilogue loop.		/// vector loop, which can avoid the need to emit a scalar epilogue loop.
bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
LoopVectorizationLegality *LVL) const;		LoopVectorizationLegality *LVL,
		InterleavedAccessInfo *IAI) const;

/// Query the target whether lowering of the llvm.get.active.lane.mask		/// Query the target whether lowering of the llvm.get.active.lane.mask
/// intrinsic is supported and how the mask should be used. A return value		/// intrinsic is supported and how the mask should be used. A return value
/// of PredicationStyle::Data indicates the mask is used as data only,		/// of PredicationStyle::Data indicates the mask is used as data only,
/// whereas PredicationStyle::DataAndControlFlow indicates we should also use		/// whereas PredicationStyle::DataAndControlFlow indicates we should also use
/// the mask for control flow in the loop. If unsupported the return value is		/// the mask for control flow in the loop. If unsupported the return value is
/// PredicationStyle::None.		/// PredicationStyle::None.
PredicationStyle emitGetActiveLaneMask() const;		PredicationStyle emitGetActiveLaneMask() const;
▲ Show 20 Lines • Show All 1,019 Lines • ▼ Show 20 Lines	virtual void getUnrollingPreferences(Loop *L, ScalarEvolution &,
UnrollingPreferences &UP,		UnrollingPreferences &UP,
OptimizationRemarkEmitter *ORE) = 0;		OptimizationRemarkEmitter *ORE) = 0;
virtual void getPeelingPreferences(Loop *L, ScalarEvolution &SE,		virtual void getPeelingPreferences(Loop *L, ScalarEvolution &SE,
PeelingPreferences &PP) = 0;		PeelingPreferences &PP) = 0;
virtual bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		virtual bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) = 0;		HardwareLoopInfo &HWLoopInfo) = 0;
virtual bool preferPredicateOverEpilogue(Loop L, LoopInfo LI,		virtual bool
ScalarEvolution &SE,		preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC, TargetLibraryInfo *TLI,
TargetLibraryInfo *TLI,		DominatorTree DT, LoopVectorizationLegality LVL,
DominatorTree *DT,		InterleavedAccessInfo *IAI) = 0;
LoopVectorizationLegality *LVL) = 0;
virtual PredicationStyle emitGetActiveLaneMask() = 0;		virtual PredicationStyle emitGetActiveLaneMask() = 0;
virtual Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,		virtual Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,
IntrinsicInst &II) = 0;		IntrinsicInst &II) = 0;
virtual Optional<Value *>		virtual Optional<Value *>
simplifyDemandedUseBitsIntrinsic(InstCombiner &IC, IntrinsicInst &II,		simplifyDemandedUseBitsIntrinsic(InstCombiner &IC, IntrinsicInst &II,
APInt DemandedMask, KnownBits &Known,		APInt DemandedMask, KnownBits &Known,
bool &KnownBitsComputed) = 0;		bool &KnownBitsComputed) = 0;
virtual Optional<Value *> simplifyDemandedVectorEltsIntrinsic(		virtual Optional<Value *> simplifyDemandedVectorEltsIntrinsic(
▲ Show 20 Lines • Show All 367 Lines • ▼ Show 20 Lines	public:
bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *LibInfo,		AssumptionCache &AC, TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) override {		HardwareLoopInfo &HWLoopInfo) override {
return Impl.isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);		return Impl.isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
}		}
bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
LoopVectorizationLegality *LVL) override {		LoopVectorizationLegality *LVL,
return Impl.preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LVL);		InterleavedAccessInfo *IAI) override {
		return Impl.preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LVL, IAI);
}		}
PredicationStyle emitGetActiveLaneMask() override {		PredicationStyle emitGetActiveLaneMask() override {
return Impl.emitGetActiveLaneMask();		return Impl.emitGetActiveLaneMask();
}		}
Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,		Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,
IntrinsicInst &II) override {		IntrinsicInst &II) override {
return Impl.instCombineIntrinsic(IC, II);		return Impl.instCombineIntrinsic(IC, II);
}		}
▲ Show 20 Lines • Show All 639 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 157 Lines • ▼ Show 20 Lines	bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *LibInfo,		AssumptionCache &AC, TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) const {		HardwareLoopInfo &HWLoopInfo) const {
return false;		return false;
}		}

bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
LoopVectorizationLegality *LVL) const {		LoopVectorizationLegality *LVL,
		InterleavedAccessInfo *IAI) const {
return false;		return false;
}		}

PredicationStyle emitGetActiveLaneMask() const {		PredicationStyle emitGetActiveLaneMask() const {
return PredicationStyle::None;		return PredicationStyle::None;
}		}

Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,		Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,
▲ Show 20 Lines • Show All 1,106 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/VectorUtils.h

Show First 20 Lines • Show All 805 Lines • ▼ Show 20 Lines	public:
/// out-of-bounds requires a scalar epilogue iteration for correctness.		/// out-of-bounds requires a scalar epilogue iteration for correctness.
bool requiresScalarEpilogue() const { return RequiresScalarEpilogue; }		bool requiresScalarEpilogue() const { return RequiresScalarEpilogue; }

/// Invalidate groups that require a scalar epilogue (due to gaps). This can		/// Invalidate groups that require a scalar epilogue (due to gaps). This can
/// happen when optimizing for size forbids a scalar epilogue, and the gap		/// happen when optimizing for size forbids a scalar epilogue, and the gap
/// cannot be filtered by masking the load/store.		/// cannot be filtered by masking the load/store.
void invalidateGroupsRequiringScalarEpilogue();		void invalidateGroupsRequiringScalarEpilogue();

		/// Returns true if we have any interleave groups.
		sdesmalenUnsubmitted Done Reply Inline Actions nit: s/haveGroups/hasGroups/ ? sdesmalen: nit: s/haveGroups/hasGroups/ ?
		bool hasGroups() const { return !InterleaveGroups.empty(); }

private:		private:
/// A wrapper around ScalarEvolution, used to add runtime SCEV checks.		/// A wrapper around ScalarEvolution, used to add runtime SCEV checks.
/// Simplifies SCEV expressions in the context of existing SCEV assumptions.		/// Simplifies SCEV expressions in the context of existing SCEV assumptions.
/// The interleaved access analysis can also add new predicates (for example		/// The interleaved access analysis can also add new predicates (for example
/// by versioning strides of pointers).		/// by versioning strides of pointers).
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;

Loop *TheLoop;		Loop *TheLoop;
▲ Show 20 Lines • Show All 148 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 597 Lines • ▼ Show 20 Lines	bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) {		HardwareLoopInfo &HWLoopInfo) {
return BaseT::isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);		return BaseT::isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
}		}

bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
LoopVectorizationLegality *LVL) {		LoopVectorizationLegality *LVL,
return BaseT::preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LVL);		InterleavedAccessInfo *IAI) {
		return BaseT::preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LVL, IAI);
}		}

PredicationStyle emitGetActiveLaneMask() {		PredicationStyle emitGetActiveLaneMask() {
return BaseT::emitGetActiveLaneMask();		return BaseT::emitGetActiveLaneMask();
}		}

Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,		Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,
IntrinsicInst &II) {		IntrinsicInst &II) {
▲ Show 20 Lines • Show All 1,764 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 289 Lines • ▼ Show 20 Lines
	bool TargetTransformInfo::isHardwareLoopProfitable(			bool TargetTransformInfo::isHardwareLoopProfitable(
	Loop *L, ScalarEvolution &SE, AssumptionCache &AC,			Loop *L, ScalarEvolution &SE, AssumptionCache &AC,
	TargetLibraryInfo *LibInfo, HardwareLoopInfo &HWLoopInfo) const {			TargetLibraryInfo *LibInfo, HardwareLoopInfo &HWLoopInfo) const {
	return TTIImpl->isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);			return TTIImpl->isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
	}			}

	bool TargetTransformInfo::preferPredicateOverEpilogue(			bool TargetTransformInfo::preferPredicateOverEpilogue(
	Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,			Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,
	TargetLibraryInfo TLI, DominatorTree DT,			TargetLibraryInfo TLI, DominatorTree DT, LoopVectorizationLegality *LVL,
	LoopVectorizationLegality *LVL) const {			InterleavedAccessInfo *IAI) const {
	return TTIImpl->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LVL);			return TTIImpl->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LVL, IAI);
	}			}

	PredicationStyle TargetTransformInfo::emitGetActiveLaneMask() const {			PredicationStyle TargetTransformInfo::emitGetActiveLaneMask() const {
	return TTIImpl->emitGetActiveLaneMask();			return TTIImpl->emitGetActiveLaneMask();
	}			}

	Optional<Instruction *>			Optional<Instruction *>
	TargetTransformInfo::instCombineIntrinsic(InstCombiner &IC,			TargetTransformInfo::instCombineIntrinsic(InstCombiner &IC,
	▲ Show 20 Lines • Show All 938 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 337 Lines • ▼ Show 20 Lines	PredicationStyle emitGetActiveLaneMask() const {
if (ST->hasSVE())		if (ST->hasSVE())
return PredicationStyle::DataAndControlFlow;		return PredicationStyle::DataAndControlFlow;
return PredicationStyle::None;		return PredicationStyle::None;
}		}

bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
LoopVectorizationLegality *LVL);		LoopVectorizationLegality *LVL,
		InterleavedAccessInfo *IAI);

bool supportsScalableVectors() const { return ST->hasSVE(); }		bool supportsScalableVectors() const { return ST->hasSVE(); }

bool enableScalableVectorization() const { return ST->hasSVE(); }		bool enableScalableVectorization() const { return ST->hasSVE(); }

bool isLegalToVectorizeReduction(const RecurrenceDescriptor &RdxDesc,		bool isLegalToVectorizeReduction(const RecurrenceDescriptor &RdxDesc,
ElementCount VF) const;		ElementCount VF) const;

Show All 19 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 3,025 Lines • ▼ Show 20 Lines	if (Kind == TTI::SK_InsertSubvector && LT.second.isFixedLengthVector() &&
}		}
}		}

return BaseT::getShuffleCost(Kind, Tp, Mask, Index, SubTp);		return BaseT::getShuffleCost(Kind, Tp, Mask, Index, SubTp);
}		}

bool AArch64TTIImpl::preferPredicateOverEpilogue(		bool AArch64TTIImpl::preferPredicateOverEpilogue(
Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,		Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,
TargetLibraryInfo TLI, DominatorTree DT, LoopVectorizationLegality *LVL) {		TargetLibraryInfo TLI, DominatorTree DT, LoopVectorizationLegality *LVL,
		InterleavedAccessInfo *IAI) {
if (!ST->hasSVE() \|\| TailFoldingKindLoc == TailFoldingKind::TFDisabled)		if (!ST->hasSVE() \|\| TailFoldingKindLoc == TailFoldingKind::TFDisabled)
return false;		return false;

		// We don't currently support vectorisation with interleaving for SVE - with
		// such loops we're better off not using tail-folding. This gives us a chance
		// to fall back on fixed-width vectorisation using NEON's ld2/st2/etc.
		if (IAI->hasGroups())
		return false;

TailFoldingKind Required; // Defaults to 0.		TailFoldingKind Required; // Defaults to 0.
if (LVL->getReductionVars().size())		if (LVL->getReductionVars().size())
Required.add(TailFoldingKind::TFReductions);		Required.add(TailFoldingKind::TFReductions);
if (LVL->getFirstOrderRecurrences().size())		if (LVL->getFirstOrderRecurrences().size())
Required.add(TailFoldingKind::TFRecurrences);		Required.add(TailFoldingKind::TFRecurrences);
if (!Required)		if (!Required)
Required.add(TailFoldingKind::TFSimple);		Required.add(TailFoldingKind::TFSimple);

return (TailFoldingKindLoc & Required) == Required;		return (TailFoldingKindLoc & Required) == Required;
}		}

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 289 Lines • ▼ Show 20 Lines	public:
bool isLoweredToCall(const Function *F);		bool isLoweredToCall(const Function *F);
bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo);		HardwareLoopInfo &HWLoopInfo);
bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
LoopVectorizationLegality *LVL);		LoopVectorizationLegality *LVL,
		InterleavedAccessInfo *IAI);
void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP,		TTI::UnrollingPreferences &UP,
OptimizationRemarkEmitter *ORE);		OptimizationRemarkEmitter *ORE);

PredicationStyle emitGetActiveLaneMask() const;		PredicationStyle emitGetActiveLaneMask() const;

void getPeelingPreferences(Loop *L, ScalarEvolution &SE,		void getPeelingPreferences(Loop *L, ScalarEvolution &SE,
TTI::PeelingPreferences &PP);		TTI::PeelingPreferences &PP);
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 2,226 Lines • ▼ Show 20 Lines	static bool canTailPredicateLoop(Loop L, LoopInfo LI, ScalarEvolution &SE,
}		}

LLVM_DEBUG(dbgs() << "tail-predication: all instructions allowed!\n");		LLVM_DEBUG(dbgs() << "tail-predication: all instructions allowed!\n");
return true;		return true;
}		}

bool ARMTTIImpl::preferPredicateOverEpilogue(		bool ARMTTIImpl::preferPredicateOverEpilogue(
Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,		Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,
TargetLibraryInfo TLI, DominatorTree DT, LoopVectorizationLegality *LVL) {		TargetLibraryInfo TLI, DominatorTree DT, LoopVectorizationLegality *LVL,
		InterleavedAccessInfo *IAI) {
if (!EnableTailPredication) {		if (!EnableTailPredication) {
LLVM_DEBUG(dbgs() << "Tail-predication not enabled.\n");		LLVM_DEBUG(dbgs() << "Tail-predication not enabled.\n");
return false;		return false;
}		}

// Creating a predicated vector loop is the first step for generating a		// Creating a predicated vector loop is the first step for generating a
// tail-predicated hardware loop, for which we need the MVE masked		// tail-predicated hardware loop, for which we need the MVE masked
// load/stores instructions:		// load/stores instructions:
▲ Show 20 Lines • Show All 174 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,044 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
if (!useMaskedInterleavedAccesses(TTI)) {		if (!useMaskedInterleavedAccesses(TTI)) {
assert(WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() &&		assert(WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() &&
"No decisions should have been taken at this point");		"No decisions should have been taken at this point");
// Note: There is no need to invalidate any cost modeling decisions here, as		// Note: There is no need to invalidate any cost modeling decisions here, as
// non where taken so far.		// non where taken so far.
InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();		InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();
}		}

FixedScalableVFPair MaxFactors = computeFeasibleMaxVF(TC, UserVF, true);		FixedScalableVFPair MaxFactors = computeFeasibleMaxVF(TC, UserVF, true);
		sdesmalenUnsubmitted Not Done Reply Inline Actions I'm not sure I understand where the `UserVF` comes into play with this choice, given that this a choice between "tail folding" vs "scalar epilogue". Can you elaborate? sdesmalen: I'm not sure I understand where the `UserVF` comes into play with this choice, given that this…
		david-armAuthorUnsubmitted Done Reply Inline Actions So there is a MVE test that tries to force the use of tail-folding for a loop with masked interleaved accesses, and the test forces a particular VF. So I had two choices here: Either fix the test to reflect the new behaviour, or In cases where the user has explicitly requested tail-folding and forced a VF, then probably the user really wants tail-folding enabled for that VF. Alternatively I could add a brand new flag controlling this behaviour, something like -use-tail-folding-when-interleaving, etc. david-arm: So there is a MVE test that tries to force the use of tail-folding for a loop with masked…
// Avoid tail folding if the trip count is known to be a multiple of any VF		// Avoid tail folding if the trip count is known to be a multiple of any VF
// we chose.		// we chose.
// FIXME: The condition below pessimises the case for fixed-width vectors,		// FIXME: The condition below pessimises the case for fixed-width vectors,
// when scalable VFs are also candidates for vectorization.		// when scalable VFs are also candidates for vectorization.
if (MaxFactors.FixedVF.isVector() && !MaxFactors.ScalableVF) {		if (MaxFactors.FixedVF.isVector() && !MaxFactors.ScalableVF) {
ElementCount MaxFixedVF = MaxFactors.FixedVF;		ElementCount MaxFixedVF = MaxFactors.FixedVF;
assert((UserVF.isNonZero() \|\| isPowerOf2_32(MaxFixedVF.getFixedValue())) &&		assert((UserVF.isNonZero() \|\| isPowerOf2_32(MaxFixedVF.getFixedValue())) &&
"MaxFixedVF must be a power of 2");		"MaxFixedVF must be a power of 2");
▲ Show 20 Lines • Show All 4,645 Lines • ▼ Show 20 Lines
// Determine how to lower the scalar epilogue, which depends on 1) optimising		// Determine how to lower the scalar epilogue, which depends on 1) optimising
// for minimum code-size, 2) predicate compiler options, 3) loop hints forcing		// for minimum code-size, 2) predicate compiler options, 3) loop hints forcing
// predication, and 4) a TTI hook that analyses whether the loop is suitable		// predication, and 4) a TTI hook that analyses whether the loop is suitable
// for predication.		// for predication.
static ScalarEpilogueLowering getScalarEpilogueLowering(		static ScalarEpilogueLowering getScalarEpilogueLowering(
Function F, Loop L, LoopVectorizeHints &Hints, ProfileSummaryInfo *PSI,		Function F, Loop L, LoopVectorizeHints &Hints, ProfileSummaryInfo *PSI,
BlockFrequencyInfo BFI, TargetTransformInfo TTI, TargetLibraryInfo *TLI,		BlockFrequencyInfo BFI, TargetTransformInfo TTI, TargetLibraryInfo *TLI,
AssumptionCache AC, LoopInfo LI, ScalarEvolution SE, DominatorTree DT,		AssumptionCache AC, LoopInfo LI, ScalarEvolution SE, DominatorTree DT,
LoopVectorizationLegality &LVL) {		LoopVectorizationLegality &LVL, InterleavedAccessInfo *IAI) {
// 1) OptSize takes precedence over all other options, i.e. if this is set,		// 1) OptSize takes precedence over all other options, i.e. if this is set,
// don't look at hints or options, and don't request a scalar epilogue.		// don't look at hints or options, and don't request a scalar epilogue.
// (For PGSO, as shouldOptimizeForSize isn't currently accessible from		// (For PGSO, as shouldOptimizeForSize isn't currently accessible from
// LoopAccessInfo (due to code dependency and not being able to reliably get		// LoopAccessInfo (due to code dependency and not being able to reliably get
// PSI/BFI from a loop analysis under NPM), we cannot suppress the collection		// PSI/BFI from a loop analysis under NPM), we cannot suppress the collection
// of strides in LoopAccessInfo::analyzeLoop() and vectorize without		// of strides in LoopAccessInfo::analyzeLoop() and vectorize without
// versioning when the vectorization is forced, unlike hasOptSize. So revert		// versioning when the vectorization is forced, unlike hasOptSize. So revert
// back to the old way and vectorize with versioning when forced. See D81345.)		// back to the old way and vectorize with versioning when forced. See D81345.)
Show All 18 Lines	static ScalarEpilogueLowering getScalarEpilogueLowering(
switch (Hints.getPredicate()) {		switch (Hints.getPredicate()) {
case LoopVectorizeHints::FK_Enabled:		case LoopVectorizeHints::FK_Enabled:
return CM_ScalarEpilogueNotNeededUsePredicate;		return CM_ScalarEpilogueNotNeededUsePredicate;
case LoopVectorizeHints::FK_Disabled:		case LoopVectorizeHints::FK_Disabled:
return CM_ScalarEpilogueAllowed;		return CM_ScalarEpilogueAllowed;
};		};

// 4) if the TTI hook indicates this is profitable, request predication.		// 4) if the TTI hook indicates this is profitable, request predication.
if (TTI->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, &LVL))		if (TTI->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, &LVL, IAI))
return CM_ScalarEpilogueNotNeededUsePredicate;		return CM_ScalarEpilogueNotNeededUsePredicate;

return CM_ScalarEpilogueAllowed;		return CM_ScalarEpilogueAllowed;
}		}

Value VPTransformState::get(VPValue Def, unsigned Part) {		Value VPTransformState::get(VPValue Def, unsigned Part) {
// If Values have been set for this Def return the one relevant for \p Part.		// If Values have been set for this Def return the one relevant for \p Part.
if (hasVectorValue(Def, Part))		if (hasVectorValue(Def, Part))
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	if (isa<SCEVCouldNotCompute>(PSE.getBackedgeTakenCount())) {
LLVM_DEBUG(dbgs() << "LV: cannot compute the outer-loop trip count\n");		LLVM_DEBUG(dbgs() << "LV: cannot compute the outer-loop trip count\n");
return false;		return false;
}		}
assert(EnableVPlanNativePath && "VPlan-native path is disabled.");		assert(EnableVPlanNativePath && "VPlan-native path is disabled.");
Function *F = L->getHeader()->getParent();		Function *F = L->getHeader()->getParent();
InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL->getLAI());		InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL->getLAI());

ScalarEpilogueLowering SEL = getScalarEpilogueLowering(		ScalarEpilogueLowering SEL = getScalarEpilogueLowering(
F, L, Hints, PSI, BFI, TTI, TLI, AC, LI, PSE.getSE(), DT, *LVL);		F, L, Hints, PSI, BFI, TTI, TLI, AC, LI, PSE.getSE(), DT, *LVL, &IAI);

LoopVectorizationCostModel CM(SEL, L, PSE, LI, LVL, *TTI, TLI, DB, AC, ORE, F,		LoopVectorizationCostModel CM(SEL, L, PSE, LI, LVL, *TTI, TLI, DB, AC, ORE, F,
&Hints, IAI);		&Hints, IAI);
// Use the planner for outer loop vectorization.		// Use the planner for outer loop vectorization.
// TODO: CM is not used at this point inside the planner. Turn CM into an		// TODO: CM is not used at this point inside the planner. Turn CM into an
// optional argument if we don't need it in the future.		// optional argument if we don't need it in the future.
LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM, IAI, PSE, Hints, ORE);		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM, IAI, PSE, Hints, ORE);

▲ Show 20 Lines • Show All 227 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
LoopVectorizationLegality LVL(L, PSE, DT, TTI, TLI, AA, F, GetLAA, LI, ORE,		LoopVectorizationLegality LVL(L, PSE, DT, TTI, TLI, AA, F, GetLAA, LI, ORE,
&Requirements, &Hints, DB, AC, BFI, PSI);		&Requirements, &Hints, DB, AC, BFI, PSI);
if (!LVL.canVectorize(EnableVPlanNativePath)) {		if (!LVL.canVectorize(EnableVPlanNativePath)) {
LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");		LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}

// Check the function attributes and profiles to find out if this function
// should be optimized for size.
ScalarEpilogueLowering SEL = getScalarEpilogueLowering(
F, L, Hints, PSI, BFI, TTI, TLI, AC, LI, PSE.getSE(), DT, LVL);

// Entrance to the VPlan-native vectorization path. Outer loops are processed		// Entrance to the VPlan-native vectorization path. Outer loops are processed
// here. They may require CFG and instruction level transformations before		// here. They may require CFG and instruction level transformations before
// even evaluating whether vectorization is profitable. Since we cannot modify		// even evaluating whether vectorization is profitable. Since we cannot modify
// the incoming IR, we need to build VPlan upfront in the vectorization		// the incoming IR, we need to build VPlan upfront in the vectorization
// pipeline.		// pipeline.
if (!L->isInnermost())		if (!L->isInnermost())
return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,		return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,
ORE, BFI, PSI, Hints, Requirements);		ORE, BFI, PSI, Hints, Requirements);

assert(L->isInnermost() && "Inner loop expected.");		assert(L->isInnermost() && "Inner loop expected.");

		InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL.getLAI());
		bool UseInterleaved = TTI->enableInterleavedAccessVectorization();

		// If an override option has been passed in for interleaved accesses, use it.
		if (EnableInterleavedMemAccesses.getNumOccurrences() > 0)
		UseInterleaved = EnableInterleavedMemAccesses;

		// Analyze interleaved memory accesses.
		if (UseInterleaved)
		IAI.analyzeInterleaving(useMaskedInterleavedAccesses(*TTI));

		// Check the function attributes and profiles to find out if this function
		// should be optimized for size.
		ScalarEpilogueLowering SEL = getScalarEpilogueLowering(
		F, L, Hints, PSI, BFI, TTI, TLI, AC, LI, PSE.getSE(), DT, LVL, &IAI);

// Check the loop for a trip count threshold: vectorize loops with a tiny trip		// Check the loop for a trip count threshold: vectorize loops with a tiny trip
// count by optimizing for size, to minimize overheads.		// count by optimizing for size, to minimize overheads.
auto ExpectedTC = getSmallBestKnownTC(*SE, L);		auto ExpectedTC = getSmallBestKnownTC(*SE, L);
if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {		if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {
LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "		LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
<< "This loop is worth vectorizing only if no scalar "		<< "This loop is worth vectorizing only if no scalar "
<< "iteration overheads are incurred.");		<< "iteration overheads are incurred.");
if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)		if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	ORE->emit([&]() {
"floating-point operations";		"floating-point operations";
});		});
LLVM_DEBUG(dbgs() << "LV: loop not vectorized: cannot prove it is safe to "		LLVM_DEBUG(dbgs() << "LV: loop not vectorized: cannot prove it is safe to "
"reorder floating-point operations\n");		"reorder floating-point operations\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}

bool UseInterleaved = TTI->enableInterleavedAccessVectorization();
InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL.getLAI());

// If an override option has been passed in for interleaved accesses, use it.
if (EnableInterleavedMemAccesses.getNumOccurrences() > 0)
UseInterleaved = EnableInterleavedMemAccesses;

// Analyze interleaved memory accesses.
if (UseInterleaved) {
IAI.analyzeInterleaving(useMaskedInterleavedAccesses(*TTI));
}

// Use the cost model.		// Use the cost model.
LoopVectorizationCostModel CM(SEL, L, PSE, LI, &LVL, *TTI, TLI, DB, AC, ORE,		LoopVectorizationCostModel CM(SEL, L, PSE, LI, &LVL, *TTI, TLI, DB, AC, ORE,
F, &Hints, IAI);		F, &Hints, IAI);
CM.collectValuesToIgnore();		CM.collectValuesToIgnore();
CM.collectElementTypesForWidening();		CM.collectElementTypesForWidening();

// Use the planner for vectorization.		// Use the planner for vectorization.
LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM, IAI, PSE, Hints, ORE);		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM, IAI, PSE, Hints, ORE);
▲ Show 20 Lines • Show All 373 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Show First 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	for.body: ; preds = %entry, %for.body
store i32 %add2, i32* %arrayidx3, align 4		store i32 %add2, i32* %arrayidx3, align 4
%exitcond.not = icmp eq i64 %add, %n		%exitcond.not = icmp eq i64 %add, %n
br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0		br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret void		ret void
}		}

		define void @interleave(float* noalias %dst, float* noalias %src, i64 %n) #0 {
		; CHECK-NOTF-LABEL: @interleave(
		; CHECK-NOTF: vector.body:
		; CHECK-NOTF: %[[LOAD:.*]] = load <8 x float>, <8 x float>
		; CHECK-NOTF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
		; CHECK-NOTF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

		; CHECK-TF-LABEL: @interleave(
		; CHECK-TF: vector.body:
		; CHECK-TF: %[[LOAD:.*]] = load <8 x float>, <8 x float>
		; CHECK-TF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
		; CHECK-TF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

		; CHECK-TF-NORED-LABEL: @interleave(
		; CHECK-TF-NORED: vector.body:
		; CHECK-TF-NORED: %[[LOAD:.*]] = load <8 x float>, <8 x float>
		; CHECK-TF-NORED: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
		; CHECK-TF-NORED: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

		; CHECK-TF-NOREC-LABEL: @interleave(
		; CHECK-TF-NOREC: vector.body:
		; CHECK-TF-NOREC: %[[LOAD:.*]] = load <8 x float>, <8 x float>
		; CHECK-TF-NOREC: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
		; CHECK-TF-NOREC: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

		entry:
		br label %for.body

		for.body: ; preds = %entry, %for.body
		%i.021 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
		%mul = shl nuw nsw i64 %i.021, 1
		%arrayidx = getelementptr inbounds float, float* %src, i64 %mul
		%0 = load float, float* %arrayidx, align 4
		%mul1 = mul nuw nsw i64 %i.021, 3
		%arrayidx2 = getelementptr inbounds float, float* %dst, i64 %mul1
		store float %0, float* %arrayidx2, align 4
		%add = or i64 %mul, 1
		%arrayidx4 = getelementptr inbounds float, float* %src, i64 %add
		%1 = load float, float* %arrayidx4, align 4
		%add6 = add nuw nsw i64 %mul1, 1
		%arrayidx7 = getelementptr inbounds float, float* %dst, i64 %add6
		store float %1, float* %arrayidx7, align 4
		%add9 = add nuw nsw i64 %mul1, 2
		%arrayidx10 = getelementptr inbounds float, float* %dst, i64 %add9
		store float 3.000000e+00, float* %arrayidx10, align 4
		%inc = add nuw nsw i64 %i.021, 1
		%exitcond.not = icmp eq i64 %inc, %n
		br i1 %exitcond.not, label %for.end, label %for.body

		for.end: ; preds = %for.body, %entry
		ret void
		}

attributes #0 = { "target-features"="+sve" }		attributes #0 = { "target-features"="+sve" }

!0 = distinct !{!0, !1, !2, !3, !4}		!0 = distinct !{!0, !1, !2, !3, !4}
!1 = !{!"llvm.loop.vectorize.width", i32 4}		!1 = !{!"llvm.loop.vectorize.width", i32 4}
!2 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}		!2 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
!3 = !{!"llvm.loop.interleave.count", i32 1}		!3 = !{!"llvm.loop.interleave.count", i32 1}
!4 = !{!"llvm.loop.vectorize.enable", i1 true}		!4 = !{!"llvm.loop.vectorize.enable", i1 true}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved accessesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 449227

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/Analysis/VectorUtils.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

[AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved accesses
ClosedPublic