This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
3/6
TargetTransformInfoImpl.h
-
Transforms/Vectorize/
-
Vectorize/
-
LoopVectorizationLegality.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/X86/
-
X86/
-
X86TargetTransformInfo.h
9/11
X86TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
3/4
LoopVectorizationLegality.cpp
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/X86/
-
Transforms/
-
LoopVectorize/
-
X86/
-
nontemporal.ll

Differential D61764

[LV] Suppress vectorization in some nontemporal cases
ClosedPublic

Authored by wristow on May 9 2019, 3:21 PM.

Download Raw Diff

Details

Reviewers

mkuper
fhahn
ABataev
RKSimon
andreadb

Commits

rG6452bdd29b5a: [LV] Suppress vectorization in some nontemporal cases
rL363581: [LV] Suppress vectorization in some nontemporal cases

Summary

When considering a loop containing nontemporal stores or loads for
vectorization, suppress the vectorization if the corresponding
vectorized store or load with the aligment of the original scaler
memory op is not supported with the nontemporal hint on the target.

This adds two new functions:
bool isLegalNTStore(Type *DataType, unsigned Alignment) const;
bool isLegalNTLoad(Type *DataType, unsigned Alignment) const;

to TTI, leaving the target independent default implementation as
returning true, but with overriding implementations for X86 that
check the legality based on available Subtarget features.

This fixes https://llvm.org/PR40759

Diff Detail

Event Timeline

wristow created this revision.May 9 2019, 3:21 PM

Herald added a subscriber: rkruppe. · View Herald TranscriptMay 9 2019, 3:21 PM

One point I want to explicitly raise, is that prior to this proposed change, the current vectorization implementation ignores the nontemporal hint when checking whether a memory op is vectorizable. So conceptually, it's as though the two new routines:
bool isLegalNTStore(Type *DataType, unsigned Alignment)
bool isLegalNTLoad(Type *DataType, unsigned Alignment)
return true for all types and alignments. With that in mind, I've made the (default) target-independent implementations of these functions return true -- it's only the overriding functions on X86 (which is the target PR40759 was reported against) that will ever return false here (so on non X86 targets, this would be an NFC commit). But in looking through the other backends, I believe that none of them support misaligned nontemporal mem ops. So possibly a better default would be to have these new functions return true only if the specified DataType is minimally aligned at Alignment.

RKSimon added reviewers: fhahn, ABataev, RKSimon.May 10 2019, 3:46 AM

ping

Would it be possible to add tests where non-temporal load/stores successfully vectorize?

lib/Target/X86/X86TargetTransformInfo.cpp
3164	SSE4A nt-stores can happen with any alignment, and AFAICT without any perf penalty.

In D61764#1517571, @RKSimon wrote:

Would it be possible to add tests where non-temporal load/stores successfully vectorize?

Glad to see your comment about SSE4A supporting nt-stores at any alignment. With that, I can make an X86 test-case that does vectorize.

lib/Target/X86/X86TargetTransformInfo.cpp
3164	I didn't realize that. I'll update the patch, and include a test for it.

wristow marked an inline comment as done.May 28 2019, 4:47 PM

wristow added inline comments.

lib/Target/X86/X86TargetTransformInfo.cpp
3164	Looking into this, I'm confused... Are you saying (for example) that with SSE4A, `vmovntps` can do an nt-store with a misaligned address? Looking through the docs, I'm seeing a requirement for the address to be aligned. Or are you saying (for example) the SSE4A instruction `movntss` (which takes a vector-register operand containing the value to be stored) can take an arbitrary alignment for the memory address? If that's what your point is, then yes I should change the above to allow misaligned `float` and `double` nt-stores. But `movntss` is only storing one `float` element of the vector register (ignoring the other elems), and so it doesn't allow us to vectorize the case. In short, yes I should change that for `float` and `double` nt-stores on SSE4A, but since it doesn't allow us to vectorize, I wonder if I'm misunderstanding your point. Or are you saying something else? (Like I said, I'm confused.)

Updated the patch to allow arbitrary alignment of float and double nt-stores for SSE4A.

In D61764#1519660, @wristow wrote:

In D61764#1517571, @RKSimon wrote:

Would it be possible to add tests where non-temporal load/stores successfully vectorize?

Glad to see your comment about SSE4A supporting nt-stores at any alignment. With that, I can make an X86 test-case that does vectorize.

Actually, if I understand your SSE4A point correctly, then that doesn't allow vector nt-stores at abrbitrary alignment. So AFAIK, there aren't any vector nt mem-ops on X86, and so for X86, I cannot make a test that successfully vectorizes.

lib/Target/X86/X86TargetTransformInfo.cpp
3164	I'm thinking your point must be my second guess above (that `movntss` and `movntsd` can store `float`/`double` non-temporally at an arbitrary boundary). So I've updated the patch to do that.

RKSimon added inline comments.May 31 2019, 1:18 AM

lib/Target/X86/X86TargetTransformInfo.cpp
3164	Sorry @wristow I missed your previous question - yes I was referring to SSE4A allowing unaligned scalar float/double nt-stores. Regular (v)movntps still has natural alignment requirements. I also raised PR42026 about using movntss/movntsd/movnti to scalarize unaligned vectors, which I think with suitable costs would still allow us to vectorize everything else - vectorizer would create a unaligned vector ntstore ir instruction and we'd scalarize the store in the backend.

wristow marked an inline comment as done.May 31 2019, 9:23 AM

wristow added inline comments.

lib/Target/X86/X86TargetTransformInfo.cpp
3164	Thanks for that explanation, @RKSimon . I understand much better now. With that, is there still an additional test (for X86) where non-temporal loads/stores are vectorized that is possible? I think PR42026 is essentially attacking that problem from the other end. Thinking about it, would a fix for PR42026 obviate the change here?

RKSimon added inline comments.Jun 4 2019, 9:48 AM

include/llvm/Analysis/TargetTransformInfoImpl.h
227	I realise this is the current default but its almost certainly better to get this return false - @fhahn any comments?
lib/Target/X86/X86TargetTransformInfo.cpp
3157	typo
3164	I think getting this patch in first makes sense. I see PR42026 more as a failsafe if anything has managed to create unaligned nt-stores - it doesn't help with nt-loads either.

I think this patch looks good.
The new TTI hooks looks good, and the change seems conservative enough. But more importantly it fixes the perf issue reported as PR40759.

I'll leave the final decision to Simon. However, from my point of view this patch looks good.

lib/Target/X86/X86TargetTransformInfo.cpp
3164	My understanding is that we still want this change regardless of PR42026. I was chatting with Simon about this issue. A fix for PR42026 can be seen as some sort of "last resort" if badly aligned NT instructions reach ISel.

This revision is now accepted and ready to land.Jun 4 2019, 9:50 AM

wristow marked 3 inline comments as done.Jun 4 2019, 10:52 AM

wristow added inline comments.

include/llvm/Analysis/TargetTransformInfoImpl.h
227	FWIW, I have a similar view. I was thinking of rather than returning simply true (or false), returning true only if the specified `DataType` is minimally aligned at `Alignment`. In practice for now (since this is currently only called when checking for vectorization), this will effectively mean it defaults to returning false. But if these routines were ever used more generally, checking the alignment of `DataType` seems like a sensible default. I'll wait a bit to see whether @fhahn has a preference.
lib/Target/X86/X86TargetTransformInfo.cpp
3157	Thanks. Will fix before committing.
3164	OK. Sounds good.

This patch was accepted by @andreadb about a week ago, but Andrea said he'd leave the final decision to @RKSimon. With Simon's comments from a week ago asking @fhahn if he had any preferences for keeping the current default as-is, or changing it to something that seems to reflect existing hardware better, I've been hesitant to actually commit it, even though it's marked as Accepted. With a week now gone by, I'll plan on committing this tomorrow, unless Simon or Florian raise any concerns.

Sorry I totally missed this review. Overal LGTM, just a few minor comments inline.

include/llvm/Analysis/TargetTransformInfoImpl.h
227	I agree, I think we should make non-temporal support opt in.
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
755	Can we drop the } else here, as we return in the if () case?
757	I guess we also want to extend the cost model to ensure the vectorization factor we pick supports NT loads/stores, as a follow-up?

Thanks @fhahn. I'll update the patch to address those comments.

include/llvm/Analysis/TargetTransformInfoImpl.h
227	Sounds good. I suspect that change will require making some test-changes to be compatible with the more conservative default. Assuming it does, I'll update the patch here for any further comments, rather than directly committing. FTR, I'm going to use my intermediate suggestion of checking the whether `DataType` is minimally aligned at `Alignment` (rather than returning `false`), under the assumption that many architectures do support aligned non-temporal references, and if these new utilities were ever used more generically (outside of vectorization checks), then it would be more meaningful to return `true` in those aligned cases.
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
755	Good point. Will do.
757	That seems like a good idea, but I'm not sure how we'll exercise that code with current architectures we support. That is, I think none of our architectures have support for any misaligned non-temporal vector memory references. So in practice, vectorization in these cases will always be suppressed (and so the arbitrary selection of 2 vector elements would behave identically if, for example, we arbitrarily picked 4 or 8).

Updated patch to address comments from @RKSimon cand @fhahn -- primarily change the default so that nontemporal misaligned mem-ops are assumed to not exist.

wristow marked 4 inline comments as done.Jun 12 2019, 1:46 AM

Thanks @wristow LGTM with one minor

include/llvm/Analysis/TargetTransformInfoImpl.h
233	Might be safer to duplicate the isLegalNTStore code depending on how well targets override both/either of the calls?

wristow marked an inline comment as done.Jun 13 2019, 7:35 PM

wristow added inline comments.

include/llvm/Analysis/TargetTransformInfoImpl.h
233	Will do. (I'm on vacation right now, but will wrap this up first hing next week.)

Closed by commit rL363581: [LV] Suppress vectorization in some nontemporal cases (authored by wristow). · Explain WhyJun 17 2019, 10:16 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2019, 10:16 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

13 lines

TargetTransformInfoImpl.h

4 lines

Transforms/

Vectorize/

LoopVectorizationLegality.h

16 lines

lib/

Analysis/

TargetTransformInfo.cpp

10 lines

Target/

X86/

X86TargetTransformInfo.h

2 lines

X86TargetTransformInfo.cpp

35 lines

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

29 lines

LoopVectorize.cpp

2 lines

test/

Transforms/

LoopVectorize/

X86/

nontemporal.ll

112 lines

Diff 202378

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 498 Lines • ▼ Show 20 Lines	public:
/// modes that operate across loop iterations.		/// modes that operate across loop iterations.
bool shouldFavorBackedgeIndex(const Loop *L) const;		bool shouldFavorBackedgeIndex(const Loop *L) const;

/// Return true if the target supports masked load.		/// Return true if the target supports masked load.
bool isLegalMaskedStore(Type *DataType) const;		bool isLegalMaskedStore(Type *DataType) const;
/// Return true if the target supports masked store.		/// Return true if the target supports masked store.
bool isLegalMaskedLoad(Type *DataType) const;		bool isLegalMaskedLoad(Type *DataType) const;

		/// Return true if the target supports nontemporal store.
		bool isLegalNTStore(Type *DataType, unsigned Alignment) const;
		/// Return true if the target supports nontemporal load.
		bool isLegalNTLoad(Type *DataType, unsigned Alignment) const;

/// Return true if the target supports masked scatter.		/// Return true if the target supports masked scatter.
bool isLegalMaskedScatter(Type *DataType) const;		bool isLegalMaskedScatter(Type *DataType) const;
/// Return true if the target supports masked gather.		/// Return true if the target supports masked gather.
bool isLegalMaskedGather(Type *DataType) const;		bool isLegalMaskedGather(Type *DataType) const;

/// Return true if the target supports masked compress store.		/// Return true if the target supports masked compress store.
bool isLegalMaskedCompressStore(Type *DataType) const;		bool isLegalMaskedCompressStore(Type *DataType) const;
/// Return true if the target supports masked expand load.		/// Return true if the target supports masked expand load.
▲ Show 20 Lines • Show All 567 Lines • ▼ Show 20 Lines	virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
Instruction *I) = 0;		Instruction *I) = 0;
virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2) = 0;		TargetTransformInfo::LSRCost &C2) = 0;
virtual bool canMacroFuseCmp() = 0;		virtual bool canMacroFuseCmp() = 0;
virtual bool shouldFavorPostInc() const = 0;		virtual bool shouldFavorPostInc() const = 0;
virtual bool shouldFavorBackedgeIndex(const Loop *L) const = 0;		virtual bool shouldFavorBackedgeIndex(const Loop *L) const = 0;
virtual bool isLegalMaskedStore(Type *DataType) = 0;		virtual bool isLegalMaskedStore(Type *DataType) = 0;
virtual bool isLegalMaskedLoad(Type *DataType) = 0;		virtual bool isLegalMaskedLoad(Type *DataType) = 0;
		virtual bool isLegalNTStore(Type *DataType, unsigned Alignment) = 0;
		virtual bool isLegalNTLoad(Type *DataType, unsigned Alignment) = 0;
virtual bool isLegalMaskedScatter(Type *DataType) = 0;		virtual bool isLegalMaskedScatter(Type *DataType) = 0;
virtual bool isLegalMaskedGather(Type *DataType) = 0;		virtual bool isLegalMaskedGather(Type *DataType) = 0;
virtual bool isLegalMaskedCompressStore(Type *DataType) = 0;		virtual bool isLegalMaskedCompressStore(Type *DataType) = 0;
virtual bool isLegalMaskedExpandLoad(Type *DataType) = 0;		virtual bool isLegalMaskedExpandLoad(Type *DataType) = 0;
virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;		virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;
virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;		virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;
virtual bool prefersVectorizedAddressing() = 0;		virtual bool prefersVectorizedAddressing() = 0;
virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,		virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,
▲ Show 20 Lines • Show All 233 Lines • ▼ Show 20 Lines	bool shouldFavorBackedgeIndex(const Loop *L) const override {
return Impl.shouldFavorBackedgeIndex(L);		return Impl.shouldFavorBackedgeIndex(L);
}		}
bool isLegalMaskedStore(Type *DataType) override {		bool isLegalMaskedStore(Type *DataType) override {
return Impl.isLegalMaskedStore(DataType);		return Impl.isLegalMaskedStore(DataType);
}		}
bool isLegalMaskedLoad(Type *DataType) override {		bool isLegalMaskedLoad(Type *DataType) override {
return Impl.isLegalMaskedLoad(DataType);		return Impl.isLegalMaskedLoad(DataType);
}		}
		bool isLegalNTStore(Type *DataType, unsigned Alignment) override {
		return Impl.isLegalNTStore(DataType, Alignment);
		}
		bool isLegalNTLoad(Type *DataType, unsigned Alignment) override {
		return Impl.isLegalNTLoad(DataType, Alignment);
		}
bool isLegalMaskedScatter(Type *DataType) override {		bool isLegalMaskedScatter(Type *DataType) override {
return Impl.isLegalMaskedScatter(DataType);		return Impl.isLegalMaskedScatter(DataType);
}		}
bool isLegalMaskedGather(Type *DataType) override {		bool isLegalMaskedGather(Type *DataType) override {
return Impl.isLegalMaskedGather(DataType);		return Impl.isLegalMaskedGather(DataType);
}		}
bool isLegalMaskedCompressStore(Type *DataType) override {		bool isLegalMaskedCompressStore(Type *DataType) override {
return Impl.isLegalMaskedCompressStore(DataType);		return Impl.isLegalMaskedCompressStore(DataType);
▲ Show 20 Lines • Show All 412 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 216 Lines • ▼ Show 20 Lines	public:
bool shouldFavorPostInc() const { return false; }		bool shouldFavorPostInc() const { return false; }

bool shouldFavorBackedgeIndex(const Loop *L) const { return false; }		bool shouldFavorBackedgeIndex(const Loop *L) const { return false; }

bool isLegalMaskedStore(Type *DataType) { return false; }		bool isLegalMaskedStore(Type *DataType) { return false; }

bool isLegalMaskedLoad(Type *DataType) { return false; }		bool isLegalMaskedLoad(Type *DataType) { return false; }

		bool isLegalNTStore(Type *DataType, unsigned Alignment) { return true; }

		bool isLegalNTLoad(Type *DataType, unsigned Alignment) { return true; }
		RKSimonUnsubmitted Not Done Reply Inline Actions I realise this is the current default but its almost certainly better to get this return false - @fhahn any comments? RKSimon: I realise this is the current default but its almost certainly better to get this return false…
		wristowAuthorUnsubmitted Done Reply Inline Actions FWIW, I have a similar view. I was thinking of rather than returning simply true (or false), returning true only if the specified `DataType` is minimally aligned at `Alignment`. In practice for now (since this is currently only called when checking for vectorization), this will effectively mean it defaults to returning false. But if these routines were ever used more generally, checking the alignment of `DataType` seems like a sensible default. I'll wait a bit to see whether @fhahn has a preference. wristow: FWIW, I have a similar view. I was thinking of rather than returning simply true (or false)…
		fhahnUnsubmitted Not Done Reply Inline Actions I agree, I think we should make non-temporal support opt in. fhahn: I agree, I think we should make non-temporal support opt in.
		wristowAuthorUnsubmitted Done Reply Inline Actions Sounds good. I suspect that change will require making some test-changes to be compatible with the more conservative default. Assuming it does, I'll update the patch here for any further comments, rather than directly committing. FTR, I'm going to use my intermediate suggestion of checking the whether `DataType` is minimally aligned at `Alignment` (rather than returning `false`), under the assumption that many architectures do support aligned non-temporal references, and if these new utilities were ever used more generically (outside of vectorization checks), then it would be more meaningful to return `true` in those aligned cases. wristow: Sounds good. I suspect that change will require making some test-changes to be compatible with…

bool isLegalMaskedScatter(Type *DataType) { return false; }		bool isLegalMaskedScatter(Type *DataType) { return false; }

bool isLegalMaskedGather(Type *DataType) { return false; }		bool isLegalMaskedGather(Type *DataType) { return false; }

bool isLegalMaskedCompressStore(Type *DataType) { return false; }		bool isLegalMaskedCompressStore(Type *DataType) { return false; }
		RKSimonUnsubmitted Not Done Reply Inline Actions Might be safer to duplicate the isLegalNTStore code depending on how well targets override both/either of the calls? RKSimon: Might be safer to duplicate the isLegalNTStore code depending on how well targets override…
		wristowAuthorUnsubmitted Done Reply Inline Actions Will do. (I'm on vacation right now, but will wrap this up first hing next week.) wristow: Will do. (I'm on vacation right now, but will wrap this up first hing next week.)

bool isLegalMaskedExpandLoad(Type *DataType) { return false; }		bool isLegalMaskedExpandLoad(Type *DataType) { return false; }

bool hasDivRemOp(Type *DataType, bool IsSigned) { return false; }		bool hasDivRemOp(Type *DataType, bool IsSigned) { return false; }

bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) { return false; }		bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) { return false; }

bool prefersVectorizedAddressing() { return true; }		bool prefersVectorizedAddressing() { return true; }
▲ Show 20 Lines • Show All 637 Lines • Show Last 20 Lines

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

Show First 20 Lines • Show All 199 Lines • ▼ Show 20 Lines
/// single induction variable, that all types are supported and vectorize-able,		/// single induction variable, that all types are supported and vectorize-able,
/// etc. This code reflects the capabilities of InnerLoopVectorizer.		/// etc. This code reflects the capabilities of InnerLoopVectorizer.
/// This class is also used by InnerLoopVectorizer for identifying		/// This class is also used by InnerLoopVectorizer for identifying
/// induction variable and the different reduction variables.		/// induction variable and the different reduction variables.
class LoopVectorizationLegality {		class LoopVectorizationLegality {
public:		public:
LoopVectorizationLegality(		LoopVectorizationLegality(
Loop L, PredicatedScalarEvolution &PSE, DominatorTree DT,		Loop L, PredicatedScalarEvolution &PSE, DominatorTree DT,
TargetLibraryInfo TLI, AliasAnalysis AA, Function *F,		TargetTransformInfo TTI, TargetLibraryInfo TLI, AliasAnalysis *AA,
std::function<const LoopAccessInfo &(Loop &)> GetLAA, LoopInfo LI,		Function F, std::function<const LoopAccessInfo &(Loop &)> GetLAA,
OptimizationRemarkEmitter ORE, LoopVectorizationRequirements R,		LoopInfo LI, OptimizationRemarkEmitter ORE,
LoopVectorizeHints H, DemandedBits DB, AssumptionCache *AC)		LoopVectorizationRequirements R, LoopVectorizeHints H, DemandedBits *DB,
: TheLoop(L), LI(LI), PSE(PSE), TLI(TLI), DT(DT), GetLAA(GetLAA),		AssumptionCache *AC)
ORE(ORE), Requirements(R), Hints(H), DB(DB), AC(AC) {}		: TheLoop(L), LI(LI), PSE(PSE), TTI(TTI), TLI(TLI), DT(DT),
		GetLAA(GetLAA), ORE(ORE), Requirements(R), Hints(H), DB(DB), AC(AC) {}

/// ReductionList contains the reduction descriptors for all		/// ReductionList contains the reduction descriptors for all
/// of the reductions that were found in the loop.		/// of the reductions that were found in the loop.
using ReductionList = DenseMap<PHINode *, RecurrenceDescriptor>;		using ReductionList = DenseMap<PHINode *, RecurrenceDescriptor>;

/// InductionList saves induction variables and maps them to the		/// InductionList saves induction variables and maps them to the
/// induction descriptor.		/// induction descriptor.
using InductionList = MapVector<PHINode *, InductionDescriptor>;		using InductionList = MapVector<PHINode *, InductionDescriptor>;
▲ Show 20 Lines • Show All 179 Lines • ▼ Show 20 Lines	private:

/// A wrapper around ScalarEvolution used to add runtime SCEV checks.		/// A wrapper around ScalarEvolution used to add runtime SCEV checks.
/// Applies dynamic knowledge to simplify SCEV expressions in the context		/// Applies dynamic knowledge to simplify SCEV expressions in the context
/// of existing SCEV assumptions. The analysis will also add a minimal set		/// of existing SCEV assumptions. The analysis will also add a minimal set
/// of new predicates if this is required to enable vectorization and		/// of new predicates if this is required to enable vectorization and
/// unrolling.		/// unrolling.
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;

		/// Target Transform Info.
		TargetTransformInfo *TTI;

/// Target Library Info.		/// Target Library Info.
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;

/// Dominator Tree.		/// Dominator Tree.
DominatorTree *DT;		DominatorTree *DT;

// LoopAccess analysis.		// LoopAccess analysis.
std::function<const LoopAccessInfo &(Loop &)> *GetLAA;		std::function<const LoopAccessInfo &(Loop &)> *GetLAA;
▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 171 Lines • ▼ Show 20 Lines
	bool TargetTransformInfo::isLegalMaskedStore(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedStore(Type *DataType) const {
	return TTIImpl->isLegalMaskedStore(DataType);			return TTIImpl->isLegalMaskedStore(DataType);
	}			}

	bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType) const {
	return TTIImpl->isLegalMaskedLoad(DataType);			return TTIImpl->isLegalMaskedLoad(DataType);
	}			}

				bool TargetTransformInfo::isLegalNTStore(Type *DataType,
				unsigned Alignment) const {
				return TTIImpl->isLegalNTStore(DataType, Alignment);
				}

				bool TargetTransformInfo::isLegalNTLoad(Type *DataType,
				unsigned Alignment) const {
				return TTIImpl->isLegalNTLoad(DataType, Alignment);
				}

	bool TargetTransformInfo::isLegalMaskedGather(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedGather(Type *DataType) const {
	return TTIImpl->isLegalMaskedGather(DataType);			return TTIImpl->isLegalMaskedGather(DataType);
	}			}

	bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType) const {
	return TTIImpl->isLegalMaskedScatter(DataType);			return TTIImpl->isLegalMaskedScatter(DataType);
	}			}

	▲ Show 20 Lines • Show All 1,063 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 180 Lines • ▼ Show 20 Lines	public:
int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);		int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);
int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);
bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2);		TargetTransformInfo::LSRCost &C2);
bool canMacroFuseCmp();		bool canMacroFuseCmp();
bool isLegalMaskedLoad(Type *DataType);		bool isLegalMaskedLoad(Type *DataType);
bool isLegalMaskedStore(Type *DataType);		bool isLegalMaskedStore(Type *DataType);
		bool isLegalNTLoad(Type *DataType, unsigned Alignment);
		bool isLegalNTStore(Type *DataType, unsigned Alignment);
bool isLegalMaskedGather(Type *DataType);		bool isLegalMaskedGather(Type *DataType);
bool isLegalMaskedScatter(Type *DataType);		bool isLegalMaskedScatter(Type *DataType);
bool isLegalMaskedExpandLoad(Type *DataType);		bool isLegalMaskedExpandLoad(Type *DataType);
bool isLegalMaskedCompressStore(Type *DataType);		bool isLegalMaskedCompressStore(Type *DataType);
bool hasDivRemOp(Type *DataType, bool IsSigned);		bool hasDivRemOp(Type *DataType, bool IsSigned);
bool isFCmpOrdCheaperThanFCmpZero(Type *Ty);		bool isFCmpOrdCheaperThanFCmpZero(Type *Ty);
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;
Show All 18 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 3,134 Lines • ▼ Show 20 Lines	bool X86TTIImpl::isLegalMaskedLoad(Type *DataTy) {
return IntWidth == 32 \|\| IntWidth == 64 \|\|		return IntWidth == 32 \|\| IntWidth == 64 \|\|
((IntWidth == 8 \|\| IntWidth == 16) && ST->hasBWI());		((IntWidth == 8 \|\| IntWidth == 16) && ST->hasBWI());
}		}

bool X86TTIImpl::isLegalMaskedStore(Type *DataType) {		bool X86TTIImpl::isLegalMaskedStore(Type *DataType) {
return isLegalMaskedLoad(DataType);		return isLegalMaskedLoad(DataType);
}		}

		bool X86TTIImpl::isLegalNTLoad(Type *DataType, unsigned Alignment) {
		unsigned DataSize = DL.getTypeStoreSize(DataType);
		// The only supported nontemporal loads are for aligned vectors of 16 or 32
		// bytes. Note that 32-byte nontemporal vector loads are supported by AVX2
		// (the equivalent stores only require AVX).
		if (Alignment >= DataSize && (DataSize == 16 \|\| DataSize == 32))
		return DataSize == 16 ? ST->hasSSE1() : ST->hasAVX2();

		return false;
		}

		bool X86TTIImpl::isLegalNTStore(Type *DataType, unsigned Alignment) {
		unsigned DataSize = DL.getTypeStoreSize(DataType);

		// SSE4A supports nontermporal stores of float and double at arbitrary
		RKSimonUnsubmitted Done Reply Inline Actions typo RKSimon: typo
		wristowAuthorUnsubmitted Done Reply Inline Actions Thanks. Will fix before committing. wristow: Thanks. Will fix before committing.
		// alignment.
		if (ST->hasSSE4A() && (DataType->isFloatTy() \|\| DataType->isDoubleTy()))
		return true;

		// Besides the SSE4A subtarget exception above, only aligned stores are
		// available nontemporaly on any other subtarget. And only stores with a size
		// of 4..32 bytes (powers of 2, only) are permitted.
		RKSimonUnsubmitted Done Reply Inline Actions SSE4A nt-stores can happen with any alignment, and AFAICT without any perf penalty. RKSimon: SSE4A nt-stores can happen with any alignment, and AFAICT without any perf penalty.
		wristowAuthorUnsubmitted Done Reply Inline Actions I didn't realize that. I'll update the patch, and include a test for it. wristow: I didn't realize that. I'll update the patch, and include a test for it.
		wristowAuthorUnsubmitted Done Reply Inline Actions Looking into this, I'm confused... Are you saying (for example) that with SSE4A, `vmovntps` can do an nt-store with a misaligned address? Looking through the docs, I'm seeing a requirement for the address to be aligned. Or are you saying (for example) the SSE4A instruction `movntss` (which takes a vector-register operand containing the value to be stored) can take an arbitrary alignment for the memory address? If that's what your point is, then yes I should change the above to allow misaligned `float` and `double` nt-stores. But `movntss` is only storing one `float` element of the vector register (ignoring the other elems), and so it doesn't allow us to vectorize the case. In short, yes I should change that for `float` and `double` nt-stores on SSE4A, but since it doesn't allow us to vectorize, I wonder if I'm misunderstanding your point. Or are you saying something else? (Like I said, I'm confused.) wristow: Looking into this, I'm confused... Are you saying (for example) that with SSE4A, `vmovntps`…
		wristowAuthorUnsubmitted Done Reply Inline Actions I'm thinking your point must be my second guess above (that `movntss` and `movntsd` can store `float`/`double` non-temporally at an arbitrary boundary). So I've updated the patch to do that. wristow: I'm thinking your point must be my second guess above (that `movntss` and `movntsd` can store…
		RKSimonUnsubmitted Done Reply Inline Actions Sorry @wristow I missed your previous question - yes I was referring to SSE4A allowing unaligned scalar float/double nt-stores. Regular (v)movntps still has natural alignment requirements. I also raised PR42026 about using movntss/movntsd/movnti to scalarize unaligned vectors, which I think with suitable costs would still allow us to vectorize everything else - vectorizer would create a unaligned vector ntstore ir instruction and we'd scalarize the store in the backend. RKSimon: Sorry @wristow I missed your previous question - yes I was referring to SSE4A allowing…
		wristowAuthorUnsubmitted Done Reply Inline Actions Thanks for that explanation, @RKSimon . I understand much better now. With that, is there still an additional test (for X86) where non-temporal loads/stores are vectorized that is possible? I think PR42026 is essentially attacking that problem from the other end. Thinking about it, would a fix for PR42026 obviate the change here? wristow: Thanks for that explanation, @RKSimon . I understand much better now. With that, is there…
		andreadbUnsubmitted Not Done Reply Inline Actions My understanding is that we still want this change regardless of PR42026. I was chatting with Simon about this issue. A fix for PR42026 can be seen as some sort of "last resort" if badly aligned NT instructions reach ISel. andreadb: My understanding is that we still want this change regardless of PR42026. I was chatting with…
		RKSimonUnsubmitted Not Done Reply Inline Actions I think getting this patch in first makes sense. I see PR42026 more as a failsafe if anything has managed to create unaligned nt-stores - it doesn't help with nt-loads either. RKSimon: I think getting this patch in first makes sense. I see PR42026 more as a failsafe if anything…
		wristowAuthorUnsubmitted Done Reply Inline Actions OK. Sounds good. wristow: OK. Sounds good.
		if (Alignment < DataSize \|\| DataSize < 4 \|\| DataSize > 32 \|\|
		!isPowerOf2_32(DataSize))
		return false;

		// 32-byte vector nontemporal stores are supported by AVX (the equivalent
		// loads require AVX2).
		if (DataSize == 32)
		return ST->hasAVX();
		else if (DataSize == 16)
		return ST->hasSSE1();
		return true;
		}

bool X86TTIImpl::isLegalMaskedExpandLoad(Type *DataTy) {		bool X86TTIImpl::isLegalMaskedExpandLoad(Type *DataTy) {
if (!isa<VectorType>(DataTy))		if (!isa<VectorType>(DataTy))
return false;		return false;

if (!ST->hasAVX512())		if (!ST->hasAVX512())
return false;		return false;

// The backend can't handle a single element vector.		// The backend can't handle a single element vector.
▲ Show 20 Lines • Show All 428 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 736 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
!I.getType()->isVoidTy()) \|\|		!I.getType()->isVoidTy()) \|\|
isa<ExtractElementInst>(I)) {		isa<ExtractElementInst>(I)) {
ORE->emit(createMissedAnalysis("CantVectorizeInstructionReturnType", &I)		ORE->emit(createMissedAnalysis("CantVectorizeInstructionReturnType", &I)
<< "instruction return type cannot be vectorized");		<< "instruction return type cannot be vectorized");
LLVM_DEBUG(dbgs() << "LV: Found unvectorizable type.\n");		LLVM_DEBUG(dbgs() << "LV: Found unvectorizable type.\n");
return false;		return false;
}		}

// Check that the stored type is vectorizable.
if (auto *ST = dyn_cast<StoreInst>(&I)) {		if (auto *ST = dyn_cast<StoreInst>(&I)) {
Type *T = ST->getValueOperand()->getType();		Type *T = ST->getValueOperand()->getType();
		// Check that the stored type is vectorizable.
if (!VectorType::isValidElementType(T)) {		if (!VectorType::isValidElementType(T)) {
ORE->emit(createMissedAnalysis("CantVectorizeStore", ST)		ORE->emit(createMissedAnalysis("CantVectorizeStore", ST)
<< "store instruction cannot be vectorized");		<< "store instruction cannot be vectorized");
return false;		return false;

		// For nontemporal stores, check that a nontemporal vector version is
		// supported on the target.
		} else if (ST->getMetadata(LLVMContext::MD_nontemporal)) {
		fhahnUnsubmitted Done Reply Inline Actions Can we drop the } else here, as we return in the if () case? fhahn: Can we drop the } else here, as we return in the if () case?
		wristowAuthorUnsubmitted Done Reply Inline Actions Good point. Will do. wristow: Good point. Will do.
		// Arbitrarily try a vector of 2 elements.
		Type VecTy = VectorType::get(T, /NumElements=*/2);
		fhahnUnsubmitted Not Done Reply Inline Actions I guess we also want to extend the cost model to ensure the vectorization factor we pick supports NT loads/stores, as a follow-up? fhahn: I guess we also want to extend the cost model to ensure the vectorization factor we pick…
		wristowAuthorUnsubmitted Done Reply Inline Actions That seems like a good idea, but I'm not sure how we'll exercise that code with current architectures we support. That is, I think none of our architectures have support for any misaligned non-temporal vector memory references. So in practice, vectorization in these cases will always be suppressed (and so the arbitrary selection of 2 vector elements would behave identically if, for example, we arbitrarily picked 4 or 8). wristow: That seems like a good idea, but I'm not sure how we'll exercise that code with current…
		assert(VecTy && "did not find vectorized version of stored type");
		unsigned Alignment = getLoadStoreAlignment(ST);
		if (!TTI->isLegalNTStore(VecTy, Alignment)) {
		ORE->emit(createMissedAnalysis("CantVectorizeNontemporalStore", ST)
		<< "nontemporal store instruction cannot be vectorized");
		return false;
		}
		}

		} else if (auto *LD = dyn_cast<LoadInst>(&I)) {
		if (LD->getMetadata(LLVMContext::MD_nontemporal)) {
		// For nontemporal loads, check that a nontemporal vector version is
		// supported on the target (arbitrarily try a vector of 2 elements).
		Type VecTy = VectorType::get(I.getType(), /NumElements=*/2);
		assert(VecTy && "did not find vectorized version of load type");
		unsigned Alignment = getLoadStoreAlignment(LD);
		if (!TTI->isLegalNTLoad(VecTy, Alignment)) {
		ORE->emit(createMissedAnalysis("CantVectorizeNontemporalLoad", LD)
		<< "nontemporal load instruction cannot be vectorized");
		return false;
		}
}		}

// FP instructions can allow unsafe algebra, thus vectorizable by		// FP instructions can allow unsafe algebra, thus vectorizable by
// non-IEEE-754 compliant SIMD units.		// non-IEEE-754 compliant SIMD units.
// This applies to floating-point math operations and calls, not memory		// This applies to floating-point math operations and calls, not memory
// operations, shuffles, or casts, as they don't change precision or		// operations, shuffles, or casts, as they don't change precision or
// semantics.		// semantics.
} else if (I.getType()->isFloatingPointTy() && (CI \|\| I.isBinaryOp()) &&		} else if (I.getType()->isFloatingPointTy() && (CI \|\| I.isBinaryOp()) &&
▲ Show 20 Lines • Show All 436 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,264 Lines • ▼ Show 20 Lines	if (!Hints.allowVectorization(F, L, VectorizeOnlyWhenForced)) {
LLVM_DEBUG(dbgs() << "LV: Loop hints prevent vectorization.\n");		LLVM_DEBUG(dbgs() << "LV: Loop hints prevent vectorization.\n");
return false;		return false;
}		}

PredicatedScalarEvolution PSE(SE, L);		PredicatedScalarEvolution PSE(SE, L);

// Check if it is legal to vectorize the loop.		// Check if it is legal to vectorize the loop.
LoopVectorizationRequirements Requirements(*ORE);		LoopVectorizationRequirements Requirements(*ORE);
LoopVectorizationLegality LVL(L, PSE, DT, TLI, AA, F, GetLAA, LI, ORE,		LoopVectorizationLegality LVL(L, PSE, DT, TTI, TLI, AA, F, GetLAA, LI, ORE,
&Requirements, &Hints, DB, AC);		&Requirements, &Hints, DB, AC);
if (!LVL.canVectorize(EnableVPlanNativePath)) {		if (!LVL.canVectorize(EnableVPlanNativePath)) {
LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");		LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}

// Check the function attributes and profiles to find out if this function		// Check the function attributes and profiles to find out if this function
▲ Show 20 Lines • Show All 383 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/X86/nontemporal.ll

				; RUN: opt < %s -loop-vectorize -force-vector-width=4 -S \| FileCheck %s

				; The three test-cases below are all based on modified versions of a simple copy-loop:
				;
				; void foo(unsigned src, unsigned dst, unsigned nElts) {
				; for (unsigned i = 0; i < nElts; ++i) {
				; unsigned tmp = src[i];
				; dst[i] = tmp;
				; }
				; }
				;
				; In the first version, there are no nontemporal stores or loads, and so vectorization
				; is safely done.
				;
				; In the second version, the store into dst[i] has the nontemporal hint. The alignment
				; on X86_64 for 'unsigned' is 4, so the vector store generally will not be aligned to the
				; vector size (of 16 here). Unaligned nontemporal vector stores are not supported on X86_64,
				; and so the vectorization is suppressed (because when vectorizing it, the nontemoral hint
				; would not be honored in the final code-gen).
				;
				; The third version is analogous to the second, except rather than the store, it is the
				; load from 'src[i]' that has the nontemporal hint. Vectorization is suppressed in this
				; case because (like stores) unaligned nontemoral vector loads are not supported on X86_64.

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64"

				; CHECK-LABEL: @vectorTest(
				define void @vectorTest(i32* noalias readonly %src, i32* noalias %dst, i32 %nElts) {
				entry:
				%cmp8 = icmp eq i32 %nElts, 0
				br i1 %cmp8, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %nElts to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				; Check that we vectorized the load, and that there is no nontemporal hint.
				; CHECK: %wide.load = load <4 x i32>, <4 x i32>* %{{[0-9]+}}, align 4{{$}}
				%arrayidx = getelementptr inbounds i32, i32* %src, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				; Check that we vectorized the store, and that there is no nontemporal hint.
				; CHECK: store <4 x i32> %wide.load, <4 x i32>* %{{[0-9]+}}, align 4{{$}}
				%arrayidx2 = getelementptr inbounds i32, i32* %dst, i64 %indvars.iv
				store i32 %0, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; CHECK-LABEL: @vectorNTStoreTest(
				; Check that the vectorized type of the store does not appear.
				; CHECK-NOT: 4 x i32
				define void @vectorNTStoreTest(i32* noalias readonly %src, i32* noalias %dst, i32 %nElts) {
				entry:
				%cmp8 = icmp eq i32 %nElts, 0
				br i1 %cmp8, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %nElts to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %src, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %dst, i64 %indvars.iv
				; Check that the store is not vectorized and that we don't lose the !nontemporal hint in it.
				; CHECK: store i32 %{{[0-9]+}}, i32* %arrayidx2, align 4, !nontemporal !4
				store i32 %0, i32* %arrayidx2, align 4, !nontemporal !0
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; CHECK-LABEL: @vectorNTLoadTest(
				; Check that the vectorized type of the load does not appear.
				; CHECK-NOT: 4 x i32
				define void @vectorNTLoadTest(i32* noalias readonly %src, i32* noalias %dst, i32 %nElts) {
				entry:
				%cmp8 = icmp eq i32 %nElts, 0
				br i1 %cmp8, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %nElts to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %src, i64 %indvars.iv
				; Check that the load is not vectorized and that we don't lose the !nontemporal hint in it.
				; CHECK: load i32, i32* %arrayidx, align 4, !nontemporal !4
				%0 = load i32, i32* %arrayidx, align 4, !nontemporal !0
				%arrayidx2 = getelementptr inbounds i32, i32* %dst, i64 %indvars.iv
				store i32 %0, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				!0 = !{i32 1}

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Suppress vectorization in some nontemporal casesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 202378

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Analysis/TargetTransformInfoImpl.h

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

lib/Analysis/TargetTransformInfo.cpp

lib/Target/X86/X86TargetTransformInfo.h

lib/Target/X86/X86TargetTransformInfo.cpp

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/X86/nontemporal.ll

[LV] Suppress vectorization in some nontemporal cases
ClosedPublic