This is an archive of the discontinued LLVM Phabricator instance.

Make SLP vectorizer consider the cost that vectorized instruction cannot use memory operand as destination on X86
Needs ReviewPublic

Authored by wmi on Jun 9 2015, 5:33 PM.

Download Raw Diff

Details

Reviewers

nadav
aschwaighofer

Summary

This is the patch to fix the performance problem reported in https://llvm.org/bugs/show_bug.cgi?id=23510.

Many X86 scalar instructions support using memory operand as destination but most vector instructions do not support it. In SLP cost evaluation,

scalar version:

t1 = load [mem];
t1 = shift 5, t1          
store t1, [mem]
...
t4 = load [mem4];
t4 = shift 5, t4          
store t4, [mem4]

slp vectorized version:

v1 = vload [mem];
v1 = vshift 5, v1
store v1, [mem]

SLP cost model thinks there will be 12 - 3 = 9 insns savings. But scalar version can be converted to the following form on x86 while vectorized instruction cannot:

[mem1] = shift 5, [mem1]
[mem2] = shift 5, [mem2]
[mem3] = shift 5, [mem3]
[mem4] = shift 5, [mem4]

We add the extra cost VL * 2 to the SLP cost evaluation to handle such case (VL is the vector length).

Diff Detail

Repository: rL LLVM

Event Timeline

wmi updated this revision to Diff 27417.Jun 9 2015, 5:33 PM

wmi retitled this revision from to Make SLP vectorizer consider the cost that vectorized instruction cannot use memory operand as destination on X86.

wmi updated this object.

wmi edited the test plan for this revision. (Show Details)

wmi added reviewers: nadav, aschwaighofer.

wmi set the repository for this revision to rL LLVM.

wmi added subscribers: Unknown Object (MLST), davidxl.

Hi Wei,

Thank you for working on this. The SLP and Loop vectorizers (and the target transform info) can’t model the entire complexity of the different targets. The vectorizers do not model the cost of addressing modes and assume that folding memory operations into instructions is only a code size reduction that does not influence runtime performance because out-of-order processors translate these instructions into multiple uops that perform the load/store and the arithmetic operation. This is only an approximation of the way out-of-order processors work, but we need to make such assumptions if we want to limit the complexity of the vectorizer. One exception to this rule is the special handling of vector geps, and scatter/gather instructions. Your example demonstrates that the assumption that we make about how out-of-order processors work is not always true. However, I still believe that the vectorizers should not model things like instruction encodings because it can complicate the vectorizer and TTI significantly. I believe that a better place to make a decision about the optimal code sequence in your example would be SelectionDAG. The codegen has more information to make such a decision. We don’t want TTI to expose an API that will be the superset of all target specific information that any pass may care about. I suggest that we keep the current vectorizer cost model and implement a peephole to reverse vectorization if needed, in the x86 backend. We already do something similar for wide AVX loads. On Sandybridge it is often beneficial to split 256bit loads/stores, and the decision to split such loads is done in the codegen and not inside the vectorizer.

Thanks,
Nadav

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

8 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

5 lines

Target/

X86/

X86TargetTransformInfo.h

2 lines

X86TargetTransformInfo.cpp

20 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

41 lines

test/

Transforms/

SLPVectorizer/

X86/

pr23510.ll

36 lines

Diff 27417

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 482 Lines • ▼ Show 20 Lines	public:

/// \returns The cost, if any, of keeping values of the given types alive		/// \returns The cost, if any, of keeping values of the given types alive
/// over a callsite.		/// over a callsite.
///		///
/// Some types may require the use of register classes that do not have		/// Some types may require the use of register classes that do not have
/// any callee-saved registers, so would require a spill and fill.		/// any callee-saved registers, so would require a spill and fill.
unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) const;		unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) const;

		/// \return whether the target supports using memory operand as the
		/// destination for the opcode and type.
		bool isLegalMemDestOperand(unsigned Opcode, Type *Ty) const;

/// \returns True if the intrinsic is a supported memory intrinsic. Info		/// \returns True if the intrinsic is a supported memory intrinsic. Info
/// will contain additional information - whether the intrinsic may write		/// will contain additional information - whether the intrinsic may write
/// or read to memory, volatility and the pointer. Info is undefined		/// or read to memory, volatility and the pointer. Info is undefined
/// if false is returned.		/// if false is returned.
bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info) const;		bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info) const;

/// \returns A value which is the result of the given memory intrinsic. New		/// \returns A value which is the result of the given memory intrinsic. New
/// instructions may be created to extract the result from the given intrinsic		/// instructions may be created to extract the result from the given intrinsic
▲ Show 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	virtual unsigned getReductionCost(unsigned Opcode, Type *Ty,
bool IsPairwiseForm) = 0;		bool IsPairwiseForm) = 0;
virtual unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,		virtual unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Type *> Tys) = 0;		ArrayRef<Type *> Tys) = 0;
virtual unsigned getCallInstrCost(Function F, Type RetTy,		virtual unsigned getCallInstrCost(Function F, Type RetTy,
ArrayRef<Type *> Tys) = 0;		ArrayRef<Type *> Tys) = 0;
virtual unsigned getNumberOfParts(Type *Tp) = 0;		virtual unsigned getNumberOfParts(Type *Tp) = 0;
virtual unsigned getAddressComputationCost(Type *Ty, bool IsComplex) = 0;		virtual unsigned getAddressComputationCost(Type *Ty, bool IsComplex) = 0;
virtual unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) = 0;		virtual unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) = 0;
		virtual bool isLegalMemDestOperand(unsigned Opcode, Type *Ty) = 0;
virtual bool getTgtMemIntrinsic(IntrinsicInst *Inst,		virtual bool getTgtMemIntrinsic(IntrinsicInst *Inst,
MemIntrinsicInfo &Info) = 0;		MemIntrinsicInfo &Info) = 0;
virtual Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,		virtual Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,
Type *ExpectedType) = 0;		Type *ExpectedType) = 0;
};		};

template <typename T>		template <typename T>
class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {		class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	unsigned getNumberOfParts(Type *Tp) override {
return Impl.getNumberOfParts(Tp);		return Impl.getNumberOfParts(Tp);
}		}
unsigned getAddressComputationCost(Type *Ty, bool IsComplex) override {		unsigned getAddressComputationCost(Type *Ty, bool IsComplex) override {
return Impl.getAddressComputationCost(Ty, IsComplex);		return Impl.getAddressComputationCost(Ty, IsComplex);
}		}
unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) override {		unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) override {
return Impl.getCostOfKeepingLiveOverCall(Tys);		return Impl.getCostOfKeepingLiveOverCall(Tys);
}		}
		bool isLegalMemDestOperand(unsigned Opcode, Type *Ty) override {
		return Impl.isLegalMemDestOperand(Opcode, Ty);
		}
bool getTgtMemIntrinsic(IntrinsicInst *Inst,		bool getTgtMemIntrinsic(IntrinsicInst *Inst,
MemIntrinsicInfo &Info) override {		MemIntrinsicInfo &Info) override {
return Impl.getTgtMemIntrinsic(Inst, Info);		return Impl.getTgtMemIntrinsic(Inst, Info);
}		}
Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,		Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,
Type *ExpectedType) override {		Type *ExpectedType) override {
return Impl.getOrCreateResultFromMemIntrinsic(Inst, ExpectedType);		return Impl.getOrCreateResultFromMemIntrinsic(Inst, ExpectedType);
}		}
▲ Show 20 Lines • Show All 107 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 311 Lines • ▼ Show 20 Lines	public:
unsigned getNumberOfParts(Type *Tp) { return 0; }		unsigned getNumberOfParts(Type *Tp) { return 0; }

unsigned getAddressComputationCost(Type *Tp, bool) { return 0; }		unsigned getAddressComputationCost(Type *Tp, bool) { return 0; }

unsigned getReductionCost(unsigned, Type *, bool) { return 1; }		unsigned getReductionCost(unsigned, Type *, bool) { return 1; }

unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) { return 0; }		unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) { return 0; }

		bool isLegalMemDestOperand(unsigned, Type *) { return 0; }

bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info) {		bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info) {
return false;		return false;
}		}

Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,		Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,
Type *ExpectedType) {		Type *ExpectedType) {
return nullptr;		return nullptr;
}		}
▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines	unsigned TargetTransformInfo::getReductionCost(unsigned Opcode, Type *Ty,
return TTIImpl->getReductionCost(Opcode, Ty, IsPairwiseForm);		return TTIImpl->getReductionCost(Opcode, Ty, IsPairwiseForm);
}		}

unsigned		unsigned
TargetTransformInfo::getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) const {		TargetTransformInfo::getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) const {
return TTIImpl->getCostOfKeepingLiveOverCall(Tys);		return TTIImpl->getCostOfKeepingLiveOverCall(Tys);
}		}

		bool TargetTransformInfo::isLegalMemDestOperand(unsigned Opcode,
		Type *Ty) const {
		return TTIImpl->isLegalMemDestOperand(Opcode, Ty);
		}

bool TargetTransformInfo::getTgtMemIntrinsic(IntrinsicInst *Inst,		bool TargetTransformInfo::getTgtMemIntrinsic(IntrinsicInst *Inst,
MemIntrinsicInfo &Info) const {		MemIntrinsicInfo &Info) const {
return TTIImpl->getTgtMemIntrinsic(Inst, Info);		return TTIImpl->getTgtMemIntrinsic(Inst, Info);
}		}

Value *TargetTransformInfo::getOrCreateResultFromMemIntrinsic(		Value *TargetTransformInfo::getOrCreateResultFromMemIntrinsic(
IntrinsicInst Inst, Type ExpectedType) const {		IntrinsicInst Inst, Type ExpectedType) const {
return TTIImpl->getOrCreateResultFromMemIntrinsic(Inst, ExpectedType);		return TTIImpl->getOrCreateResultFromMemIntrinsic(Inst, ExpectedType);
▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	public:
unsigned getIntImmCost(int64_t);		unsigned getIntImmCost(int64_t);

unsigned getIntImmCost(const APInt &Imm, Type *Ty);		unsigned getIntImmCost(const APInt &Imm, Type *Ty);

unsigned getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm,		unsigned getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);
unsigned getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		unsigned getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);
		bool isLegalMemDestOperand(unsigned Opcode, Type *Ty);
bool isLegalMaskedLoad(Type *DataType, int Consecutive);		bool isLegalMaskedLoad(Type *DataType, int Consecutive);
bool isLegalMaskedStore(Type *DataType, int Consecutive);		bool isLegalMaskedStore(Type *DataType, int Consecutive);

/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	unsigned X86TTIImpl::getRegisterBitWidth(bool Vector) {
}		}

if (ST->is64Bit())		if (ST->is64Bit())
return 64;		return 64;
return 32;		return 32;

}		}

		bool X86TTIImpl::isLegalMemDestOperand(unsigned Opcode, Type *Ty) {
		if (Ty->isVectorTy())
		return false;
		switch (Opcode) {
		case Instruction::Add:
		case Instruction::Sub:
		case Instruction::Mul:
		case Instruction::UDiv:
		case Instruction::SDiv:
		case Instruction::And:
		case Instruction::Or:
		case Instruction::Xor:
		case Instruction::Shl:
		case Instruction::LShr:
		case Instruction::AShr:
		return true;
		}
		return false;
		}

unsigned X86TTIImpl::getMaxInterleaveFactor(unsigned VF) {		unsigned X86TTIImpl::getMaxInterleaveFactor(unsigned VF) {
// If the loop will not be vectorized, don't interleave the loop.		// If the loop will not be vectorized, don't interleave the loop.
// Let regular unroll to unroll the loop, which saves the overflow		// Let regular unroll to unroll the loop, which saves the overflow
// check and memory check cost.		// check and memory check cost.
if (VF == 1)		if (VF == 1)
return 1;		return 1;

if (ST->isAtom())		if (ST->isAtom())
▲ Show 20 Lines • Show All 1,056 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 353 Lines • ▼ Show 20 Lines	public:
/// \brief Vectorize the tree that starts with the elements in \p VL.		/// \brief Vectorize the tree that starts with the elements in \p VL.
/// Returns the vectorized root.		/// Returns the vectorized root.
Value *vectorizeTree();		Value *vectorizeTree();

/// \returns the cost incurred by unwanted spills and fills, caused by		/// \returns the cost incurred by unwanted spills and fills, caused by
/// holding live values over call sites.		/// holding live values over call sites.
int getSpillCost();		int getSpillCost();

		/// \returns the cost incurred by other side effect like failing to
		/// combine insns after vectorization.
		int getOtherCost();

/// \returns the vectorization cost of the subtree that starts at \p VL.		/// \returns the vectorization cost of the subtree that starts at \p VL.
/// A negative number means that this is profitable.		/// A negative number means that this is profitable.
int getTreeCost();		int getTreeCost();

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);
▲ Show 20 Lines • Show All 1,335 Lines • ▼ Show 20 Lines	for (unsigned N = 0; N < VectorizableTree.size(); ++N) {

PrevInst = Inst;		PrevInst = Inst;
}		}

DEBUG(dbgs() << "SLP: SpillCost=" << Cost << "\n");		DEBUG(dbgs() << "SLP: SpillCost=" << Cost << "\n");
return Cost;		return Cost;
}		}

		int BoUpSLP::getOtherCost() {
		// Many X86 scalar instructions support using memory operand as destination
		// but most vector instructions do not support it. Like:
		// shrq $5, (%rdx)
		// shrq $5, 8(%rdx)
		// is often better than:
		// movdqu (%rdx), %xmm0
		// psrlq $5, %xmm0
		// movdqu %xmm0, (%rdx)
		if (VectorizableTree.size() >= 3) {
		StoreInst *SI = dyn_cast<StoreInst>(VectorizableTree[0].Scalars[0]);
		if (!SI)
		return 0;

		LoadInst *LI = dyn_cast<LoadInst>(VectorizableTree[2].Scalars[0]);
		if (!LI)
		return 0;

		Instruction *IT = cast<Instruction>(VectorizableTree[1].Scalars[0]);
		ArrayRef<Value *> VL = VectorizableTree[0].Scalars;
		VectorType *VecTy = VectorType::get(IT->getType(), VL.size());

		// If scalar version of IT instruction cannot use memory operand as
		// destination or vector version can, no extra cost.
		// If LI and SI have different memory addresses, no extra cost.
		if (!TTI->isLegalMemDestOperand(IT->getOpcode(), IT->getType()) \|\|
		TTI->isLegalMemDestOperand(IT->getOpcode(), VecTy) \|\|
		SI->getOperand(1) != LI->getOperand(0))
		return 0;

		return VL.size() * 2;
		}
		return 0;
		}

int BoUpSLP::getTreeCost() {		int BoUpSLP::getTreeCost() {
int Cost = 0;		int Cost = 0;
DEBUG(dbgs() << "SLP: Calculating cost for tree of size " <<		DEBUG(dbgs() << "SLP: Calculating cost for tree of size " <<
VectorizableTree.size() << ".\n");		VectorizableTree.size() << ".\n");

// We only vectorize tiny trees if it is fully vectorizable.		// We only vectorize tiny trees if it is fully vectorizable.
if (VectorizableTree.size() < 3 && !isFullyVectorizableTinyTree()) {		if (VectorizableTree.size() < 3 && !isFullyVectorizableTinyTree()) {
if (VectorizableTree.empty()) {		if (VectorizableTree.empty()) {
Show All 27 Lines	for (UserList::iterator I = ExternalUses.begin(), E = ExternalUses.end();

VectorType *VecTy = VectorType::get(I->Scalar->getType(), BundleWidth);		VectorType *VecTy = VectorType::get(I->Scalar->getType(), BundleWidth);
ExtractCost += TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,		ExtractCost += TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
I->Lane);		I->Lane);
}		}

Cost += getSpillCost();		Cost += getSpillCost();

		Cost += getOtherCost();

DEBUG(dbgs() << "SLP: Total Cost " << Cost + ExtractCost<< ".\n");		DEBUG(dbgs() << "SLP: Total Cost " << Cost + ExtractCost<< ".\n");
return Cost + ExtractCost;		return Cost + ExtractCost;
}		}

int BoUpSLP::getGatherCost(Type *Ty) {		int BoUpSLP::getGatherCost(Type *Ty) {
int Cost = 0;		int Cost = 0;
for (unsigned i = 0, e = cast<VectorType>(Ty)->getNumElements(); i < e; ++i)		for (unsigned i = 0, e = cast<VectorType>(Ty)->getNumElements(); i < e; ++i)
Cost += TTI->getVectorInstrCost(Instruction::InsertElement, Ty, i);		Cost += TTI->getVectorInstrCost(Instruction::InsertElement, Ty, i);
▲ Show 20 Lines • Show All 2,258 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/pr23510.ll

				; PR23510
				; RUN: opt < %s -mtriple=x86_64-linux-gnu -basicaa -slp-vectorizer -S \| FileCheck %s
				; Check that slp does not generate vectorized lshr.
				; CHECK-LABEL: @foo(
				; CHECK-NOT: lshr <2 x i64>

				define void @foo(float* nocapture readonly %p1, i32 %p2, i64* nocapture %p3, float* nocapture %p4) {
				entry:
				%idx.ext = sext i32 %p2 to i64
				%add.ptr = getelementptr inbounds float, float* %p1, i64 %idx.ext
				%arrayidx1 = getelementptr inbounds float, float* %add.ptr, i64 5
				%tmp = load float, float* %arrayidx1, align 4
				%arrayidx2 = getelementptr inbounds float, float* %p4, i64 3
				%tmp1 = load float, float* %arrayidx2, align 4
				%add = fadd float %tmp, %tmp1
				store float %add, float* %arrayidx2, align 4
				store i64 0, i64* %p3, align 8
				%arrayidx4 = getelementptr inbounds i64, i64* %p3, i64 1
				%tmp2 = load i64, i64* %arrayidx4, align 8
				%shr5 = lshr i64 %tmp2, 5
				store i64 %shr5, i64* %arrayidx4, align 8
				%arrayidx6 = getelementptr inbounds i64, i64* %p3, i64 2
				%tmp3 = load i64, i64* %arrayidx6, align 8
				%shr7 = lshr i64 %tmp3, 5
				store i64 %shr7, i64* %arrayidx6, align 8
				%arrayidx8 = getelementptr inbounds i64, i64* %p3, i64 3
				%tmp4 = load i64, i64* %arrayidx8, align 8
				%shr9 = lshr i64 %tmp4, 5
				store i64 %shr9, i64* %arrayidx8, align 8
				%add.ptr11 = getelementptr inbounds float, float* %add.ptr, i64 %idx.ext
				%tmp5 = load float, float* %add.ptr11, align 4
				%tmp6 = load float, float* %p4, align 4
				%add15 = fadd float %tmp5, %tmp6
				store float %add15, float* %p4, align 4
				ret void
				}