This is an archive of the discontinued LLVM Phabricator instance.

new method TargetTransformInfo::supportsVectorElementLoadStore() for LoopVectorizer
ClosedPublic

Authored by jonpa on Mar 6 2017, 11:49 PM.

Download Raw Diff

Details

Reviewers

rengolin
anemet
jmolloy
hfinkel

Summary

Since SystemZ supports vector element load / store instructions, there is no need for extracts / inserts if a vector load / store gets scalarized. This patch lets Target specify that it supports such instructions by means of a new TTI virtual method that defaults to false.

The use for this is in the LoopVectorizer getScalarizationOverhead() method. As a side note, I am thinking perhaps this function should go into BasicTTIImpl.h, but since there are a at least two changes in review in this area, I am waiting with that.

Diff Detail

Event Timeline

jonpa created this revision.Mar 6 2017, 11:49 PM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptMar 6 2017, 11:49 PM

ping.

This patch does not have any tests, because it will go in with https://reviews.llvm.org/D29631.

I was hoping to make review simpler by creating a separate diff for this issue.

jonpa added reviewers: anemet, jmolloy, hfinkel.Mar 14 2017, 3:10 AM

Why aren't you putting this in TTI.getOperandsScalarizationOverhead?

In D30680#700694, @anemet wrote:

Why aren't you putting this in TTI.getOperandsScalarizationOverhead?

I don't see how that would work, because this is dependent on the instruction opcode. It is only relevant if it is a store or a load, and only for the store case are the operands checked for extracts. Are you suggesting adding an extra parameter to that function (opcode)?

ping.

Sorry about the delay on this but I was working on something related for ARM that may benefit from this as well. What I need for ARM is something that can communicate to the SLPVectorizer that load-pair and store-pair (of two registers) is efficiently supported on the target. I am wondering if we can combine the two things if your new hook would take the type and the vectorization width.

What do you think?

In D30680#713268, @anemet wrote:

Sorry about the delay on this but I was working on something related for ARM that may benefit from this as well. What I need for ARM is something that can communicate to the SLPVectorizer that load-pair and store-pair (of two registers) is efficiently supported on the target. I am wondering if we can combine the two things if your new hook would take the type and the vectorization width.

What do you think?

Is this also in the context of scalarizing a load / store?

For SystemZ, a scalarized memory access will have to do VF memory operations, but there is no need to extract or insert any of the data elements, as there are vector element load/store instructions.

I am not sure if you mean that there is no scalarization cost and that there is a lowered (halfed) memoryOpCost, or that you mean that you could extract / insert two elements at a time?

In the latter case, I suppose we could use something like getLoadStoreScalarizationDiscount(VF), instead of the current patch.

In the first case, I suppose you should use this new hook, along with adjusting the getMemoryOpCost() method.

ping. Adam - any comments?

In D30680#713835, @jonpa wrote:

In D30680#713268, @anemet wrote:

Sorry about the delay on this but I was working on something related for ARM that may benefit from this as well. What I need for ARM is something that can communicate to the SLPVectorizer that load-pair and store-pair (of two registers) is efficiently supported on the target. I am wondering if we can combine the two things if your new hook would take the type and the vectorization width.

What do you think?

Is this also in the context of scalarizing a load / store?

For SystemZ, a scalarized memory access will have to do VF memory operations, but there is no need to extract or insert any of the data elements, as there are vector element load/store instructions.

We have something like this on ARM too. ld1 can load any element of a vector (e.g. ld1.s {v1}[1], [x1] loads lane 1 of vector reg v1) and st1 can store any element. That said, ld1 is still a partial write of the vector register so in terms of performance, it's worse than a regular store which is a full write. I think that modeling its cost as a load + insert (for non-zero-lane) is fairly accurate. Doesn't this match the situation on SystemZ?

In D30680#718786, @anemet wrote:

In D30680#713835, @jonpa wrote:

In D30680#713268, @anemet wrote:

Sorry about the delay on this but I was working on something related for ARM that may benefit from this as well. What I need for ARM is something that can communicate to the SLPVectorizer that load-pair and store-pair (of two registers) is efficiently supported on the target. I am wondering if we can combine the two things if your new hook would take the type and the vectorization width.

What do you think?

Is this also in the context of scalarizing a load / store?

For SystemZ, a scalarized memory access will have to do VF memory operations, but there is no need to extract or insert any of the data elements, as there are vector element load/store instructions.

We have something like this on ARM too. ld1 can load any element of a vector (e.g. ld1.s {v1}[1], [x1] loads lane 1 of vector reg v1) and st1 can store any element. That said, ld1 is still a partial write of the vector register so in terms of performance, it's worse than a regular store which is a full write. I think that modeling its cost as a load + insert (for non-zero-lane) is fairly accurate. Doesn't this match the situation on SystemZ?

As far as I know there is on SystemZ no extra penalty for using a vector load element, so scalarizing a vector load will really cost e.g. 4 loads at VF 4. This should be better than doing 4 scalar loads and 4 inserts.

Are you saying that this only makes sense for stores on ARM? In that case maybe a boolean argument like IsStore might work?

What about the handing of two registers at a time you mentioned earlier?

bjope added a subscriber: bjope.Apr 4 2017, 11:33 PM

In D30680#718817, @jonpa wrote:

In D30680#718786, @anemet wrote:

In D30680#713835, @jonpa wrote:

In D30680#713268, @anemet wrote:

Sorry about the delay on this but I was working on something related for ARM that may benefit from this as well. What I need for ARM is something that can communicate to the SLPVectorizer that load-pair and store-pair (of two registers) is efficiently supported on the target. I am wondering if we can combine the two things if your new hook would take the type and the vectorization width.

What do you think?

Is this also in the context of scalarizing a load / store?

For SystemZ, a scalarized memory access will have to do VF memory operations, but there is no need to extract or insert any of the data elements, as there are vector element load/store instructions.

We have something like this on ARM too. ld1 can load any element of a vector (e.g. ld1.s {v1}[1], [x1] loads lane 1 of vector reg v1) and st1 can store any element. That said, ld1 is still a partial write of the vector register so in terms of performance, it's worse than a regular store which is a full write. I think that modeling its cost as a load + insert (for non-zero-lane) is fairly accurate. Doesn't this match the situation on SystemZ?

As far as I know there is on SystemZ no extra penalty for using a vector load element, so scalarizing a vector load will really cost e.g. 4 loads at VF 4. This should be better than doing 4 scalar loads and 4 inserts.

My point is not whether it's better or not (it certainly is shorter) but whether 4 scalar loads have the same cost as four vector-element loads. The hook would state the latter. Anyhow, for in-order processors I could see how this could be true.

Are you saying that this only makes sense for stores on ARM? In that case maybe a boolean argument like IsStore might work?

Yes, I think so.

What about the handing of two registers at a time you mentioned earlier?

I convinced myself that that is a separate issue. There, we want to communicate that to load or store a pair of registers (<=64bits) only takes one instruction in scalar mode.

include/llvm/Analysis/TargetTransformInfo.h
437	Needs comment.

Comment added.
Test case for SystemZ.

Are you saying that this only makes sense for stores on ARM? In that case maybe a boolean argument like IsStore might work?

Yes, I think so.

OK - well, in that case I guess you are fine with this patch, since it will be easy for you to extend it?

Added 'REQUIRES: asserts' to test case.

anemet added inline comments.Apr 7 2017, 8:52 AM

include/llvm/Analysis/TargetTransformInfo.h
437–439	Again, this should not state the presence of these instructions but whether they are as efficient as their scalar counterparts. Please rephrase. It's probably even better to express this in the name of the hook, e.g. supportsEfficientVectorElementLoadStore.

Updated per review with new method name.

LGTM.

This revision is now accepted and ready to land.Apr 11 2017, 9:05 AM

Thanks for review!
r300056

jonpa closed this revision.Apr 12 2017, 5:54 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

10 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

SystemZ/

SystemZTargetTransformInfo.h

1 line

Transforms/

Vectorize/

LoopVectorize.cpp

8 lines

test/

Transforms/

LoopVectorize/

SystemZ/

load-store-scalarization-cost.ll

33 lines

Diff 94649

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 428 Lines • ▼ Show 20 Lines	public:
/// containing this constant value for the target.		/// containing this constant value for the target.
bool shouldBuildLookupTablesForConstant(Constant *C) const;		bool shouldBuildLookupTablesForConstant(Constant *C) const;

unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) const;		unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) const;

unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) const;		unsigned VF) const;

		/// If target has efficient vector element load/store instructions, it can
		anemetUnsubmitted Done Reply Inline Actions Needs comment. anemet: Needs comment.
		/// return true here so that insertion/extraction costs are not added to
		/// the scalarization cost of a load/store.
		anemetUnsubmitted Not Done Reply Inline Actions Again, this should not state the presence of these instructions but whether they are as efficient as their scalar counterparts. Please rephrase. It's probably even better to express this in the name of the hook, e.g. supportsEfficientVectorElementLoadStore. anemet: Again, this should not state the presence of these instructions but whether they are as…
		bool supportsEfficientVectorElementLoadStore() const;

/// \brief Don't restrict interleaved unrolling to small loops.		/// \brief Don't restrict interleaved unrolling to small loops.
bool enableAggressiveInterleaving(bool LoopHasReductions) const;		bool enableAggressiveInterleaving(bool LoopHasReductions) const;

/// \brief Enable matching of interleaved access groups.		/// \brief Enable matching of interleaved access groups.
bool enableInterleavedAccessVectorization() const;		bool enableInterleavedAccessVectorization() const;

/// \brief Indicate that it is potentially unsafe to automatically vectorize		/// \brief Indicate that it is potentially unsafe to automatically vectorize
/// floating-point operations because the semantics of vector and scalar		/// floating-point operations because the semantics of vector and scalar
▲ Show 20 Lines • Show All 331 Lines • ▼ Show 20 Lines	public:
virtual unsigned getJumpBufAlignment() = 0;		virtual unsigned getJumpBufAlignment() = 0;
virtual unsigned getJumpBufSize() = 0;		virtual unsigned getJumpBufSize() = 0;
virtual bool shouldBuildLookupTables() = 0;		virtual bool shouldBuildLookupTables() = 0;
virtual bool shouldBuildLookupTablesForConstant(Constant *C) = 0;		virtual bool shouldBuildLookupTablesForConstant(Constant *C) = 0;
virtual unsigned		virtual unsigned
getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) = 0;		getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) = 0;
virtual unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		virtual unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) = 0;		unsigned VF) = 0;
		virtual bool supportsEfficientVectorElementLoadStore() = 0;
virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;		virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;
virtual bool enableInterleavedAccessVectorization() = 0;		virtual bool enableInterleavedAccessVectorization() = 0;
virtual bool isFPVectorizationPotentiallyUnsafe() = 0;		virtual bool isFPVectorizationPotentiallyUnsafe() = 0;
virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,		virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
unsigned BitWidth,		unsigned BitWidth,
unsigned AddressSpace,		unsigned AddressSpace,
unsigned Alignment,		unsigned Alignment,
bool *Fast) = 0;		bool *Fast) = 0;
▲ Show 20 Lines • Show All 188 Lines • ▼ Show 20 Lines	unsigned getScalarizationOverhead(Type *Ty, bool Insert,
bool Extract) override {		bool Extract) override {
return Impl.getScalarizationOverhead(Ty, Insert, Extract);		return Impl.getScalarizationOverhead(Ty, Insert, Extract);
}		}
unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) override {		unsigned VF) override {
return Impl.getOperandsScalarizationOverhead(Args, VF);		return Impl.getOperandsScalarizationOverhead(Args, VF);
}		}

		bool supportsEfficientVectorElementLoadStore() override {
		return Impl.supportsEfficientVectorElementLoadStore();
		}

bool enableAggressiveInterleaving(bool LoopHasReductions) override {		bool enableAggressiveInterleaving(bool LoopHasReductions) override {
return Impl.enableAggressiveInterleaving(LoopHasReductions);		return Impl.enableAggressiveInterleaving(LoopHasReductions);
}		}
bool enableInterleavedAccessVectorization() override {		bool enableInterleavedAccessVectorization() override {
return Impl.enableInterleavedAccessVectorization();		return Impl.enableInterleavedAccessVectorization();
}		}
bool isFPVectorizationPotentiallyUnsafe() override {		bool isFPVectorizationPotentiallyUnsafe() override {
return Impl.isFPVectorizationPotentiallyUnsafe();		return Impl.isFPVectorizationPotentiallyUnsafe();
▲ Show 20 Lines • Show All 272 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 256 Lines • ▼ Show 20 Lines	public:

unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) {		unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) {
return 0;		return 0;
}		}

unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) { return 0; }		unsigned VF) { return 0; }

		bool supportsEfficientVectorElementLoadStore() { return false; }

bool enableAggressiveInterleaving(bool LoopHasReductions) { return false; }		bool enableAggressiveInterleaving(bool LoopHasReductions) { return false; }

bool enableInterleavedAccessVectorization() { return false; }		bool enableInterleavedAccessVectorization() { return false; }

bool isFPVectorizationPotentiallyUnsafe() { return false; }		bool isFPVectorizationPotentiallyUnsafe() { return false; }

bool allowsMisalignedMemoryAccesses(LLVMContext &Context,		bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
unsigned BitWidth,		unsigned BitWidth,
▲ Show 20 Lines • Show All 404 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 191 Lines • ▼ Show 20 Lines
	}			}

	unsigned TargetTransformInfo::			unsigned TargetTransformInfo::
	getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,			getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
	unsigned VF) const {			unsigned VF) const {
	return TTIImpl->getOperandsScalarizationOverhead(Args, VF);			return TTIImpl->getOperandsScalarizationOverhead(Args, VF);
	}			}

				bool TargetTransformInfo::supportsEfficientVectorElementLoadStore() const {
				return TTIImpl->supportsEfficientVectorElementLoadStore();
				}

	bool TargetTransformInfo::enableAggressiveInterleaving(bool LoopHasReductions) const {			bool TargetTransformInfo::enableAggressiveInterleaving(bool LoopHasReductions) const {
	return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);			return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);
	}			}

	bool TargetTransformInfo::enableInterleavedAccessVectorization() const {			bool TargetTransformInfo::enableInterleavedAccessVectorization() const {
	return TTIImpl->enableInterleavedAccessVectorization();			return TTIImpl->enableInterleavedAccessVectorization();
	}			}

	▲ Show 20 Lines • Show All 328 Lines • Show Last 20 Lines

lib/Target/SystemZ/SystemZTargetTransformInfo.h

Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	public:
/// @}		/// @}

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{

unsigned getNumberOfRegisters(bool Vector);		unsigned getNumberOfRegisters(bool Vector);
unsigned getRegisterBitWidth(bool Vector);		unsigned getRegisterBitWidth(bool Vector);

		bool supportsEfficientVectorElementLoadStore() { return true; }
bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }

int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,
Show All 23 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 3,755 Lines • ▼ Show 20 Lines
	/// convenience wrapper for the type-based getScalarizationOverhead API.			/// convenience wrapper for the type-based getScalarizationOverhead API.
	static unsigned getScalarizationOverhead(Instruction *I, unsigned VF,			static unsigned getScalarizationOverhead(Instruction *I, unsigned VF,
	const TargetTransformInfo &TTI) {			const TargetTransformInfo &TTI) {
	if (VF == 1)			if (VF == 1)
	return 0;			return 0;

	unsigned Cost = 0;			unsigned Cost = 0;
	Type *RetTy = ToVectorTy(I->getType(), VF);			Type *RetTy = ToVectorTy(I->getType(), VF);
	if (!RetTy->isVoidTy())			if (!RetTy->isVoidTy() &&
				(!isa<LoadInst>(I) \|\|
				!TTI.supportsEfficientVectorElementLoadStore()))
	Cost += TTI.getScalarizationOverhead(RetTy, true, false);			Cost += TTI.getScalarizationOverhead(RetTy, true, false);

	if (CallInst *CI = dyn_cast<CallInst>(I)) {			if (CallInst *CI = dyn_cast<CallInst>(I)) {
	SmallVector<const Value *, 4> Operands(CI->arg_operands());			SmallVector<const Value *, 4> Operands(CI->arg_operands());
	Cost += TTI.getOperandsScalarizationOverhead(Operands, VF);			Cost += TTI.getOperandsScalarizationOverhead(Operands, VF);
	} else {			}
				else if (!isa<StoreInst>(I) \|\|
				!TTI.supportsEfficientVectorElementLoadStore()) {
	SmallVector<const Value *, 4> Operands(I->operand_values());			SmallVector<const Value *, 4> Operands(I->operand_values());
	Cost += TTI.getOperandsScalarizationOverhead(Operands, VF);			Cost += TTI.getOperandsScalarizationOverhead(Operands, VF);
	}			}

	return Cost;			return Cost;
	}			}

	// Estimate cost of a call instruction CI if it were vectorized with factor VF.			// Estimate cost of a call instruction CI if it were vectorized with factor VF.
	▲ Show 20 Lines • Show All 4,143 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/SystemZ/load-store-scalarization-cost.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt -mtriple=s390x-unknown-linux -mcpu=z13 -loop-vectorize \
				; RUN: -force-vector-width=4 -debug-only=loop-vectorize \
				; RUN: -disable-output -enable-interleaved-mem-accesses=false < %s 2>&1 \| \
				; RUN: FileCheck %s
				;
				; Check that a scalarized load/store does not get a cost for insterts/
				; extracts, since z13 supports element load/store.

				define void @fun(i32* %data, i64 %n) {
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
				%tmp0 = getelementptr inbounds i32, i32* %data, i64 %i
				%tmp1 = load i32, i32* %tmp0, align 4
				%tmp2 = add i32 %tmp1, 1
				store i32 %tmp2, i32* %tmp0, align 4
				%i.next = add nuw nsw i64 %i, 2
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				ret void

				; CHECK: LV: Found an estimated cost of 4 for VF 4 For instruction: %tmp1 = load i32, i32* %tmp0, align 4
				; CHECK: LV: Found an estimated cost of 4 for VF 4 For instruction: store i32 %tmp2, i32* %tmp0, align 4

				; CHECK: LV: Scalarizing: %tmp1 = load i32, i32* %tmp0, align 4
				; CHECK: LV: Scalarizing: store i32 %tmp2, i32* %tmp0, align 4
				}