This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Query TTI when deciding to splat IV
AbandonedPublic

Authored by samparker on Aug 15 2016, 6:46 AM.

Download Raw Diff

Details

Reviewers

wmi
mkuper
mssimpso

Summary

Recently, after D20474, we observed regressions in our internal testing because of changes to induction variable calculations. This patch implements a TargetTransform hook to query the backend for whether splatting a scalar is expensive and returns true for ARM. This enables the vectorization of integer induction phi nodes instead of using a scalar and splatting it for each loop iteration.

Diff Detail

Event Timeline

samparker updated this revision to Diff 68032.Aug 15 2016, 6:46 AM

samparker retitled this revision from to [LoopVectorize] Query TTI when deciding to splat IV.

samparker updated this object.

Herald added subscribers: mzolotukhin, aemerson. · View Herald TranscriptAug 15 2016, 6:46 AM

samparker added a subscriber: llvm-commits.Aug 15 2016, 6:46 AM

Ping.

Would someone be able to review this please?

Adding @wmi, who made the original change.

+michael, he is the author of the code.

Sam, could you add a testcase for the regression? I am curious why such regression is arm specific.

Adding @mssimpso, who improved the logic significantly after my original patch. :-)

And yes, as @wmi said, this needs a regression test. Also, could you explain why this is an ARM-specific issue? I want to make sure we're not talking about a case where we should never be using a scalar IV, regardless of platform.

Added a regression test

Hi Michael,

The splatting of the indvar seems to be problematic for ARM because a VDUP instruction gets generated, which has a long latency,

Hi Everyone,

I'm planning to take a look at this case today. I'll update the thread in a few hours.

Matt.

Hi Sam,

I don't think TTI is the right way to fix this. The underlying issue in the test case is that we're vectorizing instructions marked uniform after vectorization. We should be scalarizing them instead.

This is what happens. We first collect the loop uniforms. The GEPs are marked uniform, as well as the arithmetic producing their indices. Because the IVs have no non-uniform users, they too are marked uniform, which all seems correct to me.

Then we vectorize the IVs. When vectorizing them, we know that we should only need scalar versions (Legal->isScalarAfterVectorization is true because the IVs are uniform). However, to keep WidenMap happy (see D23169) we actually vectorize them with the splat method as well. This is not ideal but also not the underlying issue. Because all users of the IVs should also be scalar, it's OK (but not good) that we vectorize them with the splat method since the vector versions would have no users and would be deleted later on in the clean-up phase.

Of course, the problem with this is that we actually still are vectorizing the uniform users of the IVs, creating uses of the vector IVs that we later can't remove. We shouldn't be doing this. I think the right fix is to have something like the following in each case in vectorizeBlockInLoop (except for PHIs and branches):

if (Legal->isScalarAfterVectorization(&I)) {
  scalarizeInstruction(&I);
  continue;
}

If we've already decided that an instruciton should remain scalar, we shouldn't vectorize it. Does this make sense? We will probably want to wait to do this until after D23169 has been committed to avoid large increases in compile-time. I tested the above approach (without D23169) on your test case and the ugly vector IVs were removed. Would you mind doing the same to see if your performance is restored?

Matt.

Can we please also have this loop explained at a higher level? The IR in the testcase is not really digestible.

In D23509#522258, @mssimpso wrote:
Of course, the problem with this is that we actually still are vectorizing the uniform users of the IVs, creating uses of the vector IVs that we later can't remove. We shouldn't be doing this. I think the right fix is to have something like the following in each case in vectorizeBlockInLoop (except for PHIs and branches):
if (Legal->isScalarAfterVectorization(&I)) {
  scalarizeInstruction(&I);
  continue;
}
If we've already decided that an instruciton should remain scalar, we shouldn't vectorize it. Does this make sense?

It's pretty odd that we don't already do this.
I assume the reason we don't already get a lot of garbage is because we have LICM before LV?

In D23509#522266, @anemet wrote:

Can we please also have this loop explained at a higher level? The IR in the testcase is not really digestible.

I can reproduce this with something like:

void foo(int *p, int *q, short k) {
  for (short i = 0; i < 256; ++i)
    p[i + k] = q[i + k - 1] * 5 + 1;
}

If we only have, say, "i + k" as the IV user, instcombine can clean this up, but it seems like it can't handle more complex cases. Haven't dug into why.

Hi Matt,

I can confirm that your approach works for me :)

In D23509#523061, @samparker wrote:

Hi Matt,

I can confirm that your approach works for me :)

Great! I hope you don't mind that a took a stab at implementing this approach over in D23889. We should now be able to scalarize instructions that LVL marks scalar after vectorization without increasing code size pre-instcombine. To do this we can check within the scalarization logic if an instruction is also uniform, and if so, only generate one scalar value per unroll iteration instead of VF values per iteration.

samparker abandoned this revision.Aug 26 2016, 1:03 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

5 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

ARM/

ARMTargetTransformInfo.h

2 lines

Transforms/

Vectorize/

LoopVectorize.cpp

3 lines

test/

Transforms/

LoopVectorize/

ARM/

dont-splat.ll

104 lines

Diff 68847

include/llvm/Analysis/TargetTransformInfo.h

	Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines

	/// \brief Return true if switches should be turned into lookup tables for the			/// \brief Return true if switches should be turned into lookup tables for the
	/// target.			/// target.
	bool shouldBuildLookupTables() const;			bool shouldBuildLookupTables() const;

	/// \brief Don't restrict interleaved unrolling to small loops.			/// \brief Don't restrict interleaved unrolling to small loops.
	bool enableAggressiveInterleaving(bool LoopHasReductions) const;			bool enableAggressiveInterleaving(bool LoopHasReductions) const;

				bool isSplatExpensive() const;

	/// \brief Enable matching of interleaved access groups.			/// \brief Enable matching of interleaved access groups.
	bool enableInterleavedAccessVectorization() const;			bool enableInterleavedAccessVectorization() const;

	/// \brief Indicate that it is potentially unsafe to automatically vectorize			/// \brief Indicate that it is potentially unsafe to automatically vectorize
	/// floating-point operations because the semantics of vector and scalar			/// floating-point operations because the semantics of vector and scalar
	/// floating-point semantics may differ. For example, ARM NEON v7 SIMD math			/// floating-point semantics may differ. For example, ARM NEON v7 SIMD math
	/// does not support IEEE-754 denormal numbers, while depending on the			/// does not support IEEE-754 denormal numbers, while depending on the
	/// platform, scalar floating-point math does.			/// platform, scalar floating-point math does.
	▲ Show 20 Lines • Show All 184 Lines • ▼ Show 20 Lines
	virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;			virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;
	virtual bool isProfitableToHoist(Instruction *I) = 0;			virtual bool isProfitableToHoist(Instruction *I) = 0;
	virtual bool isTypeLegal(Type *Ty) = 0;			virtual bool isTypeLegal(Type *Ty) = 0;
	virtual unsigned getJumpBufAlignment() = 0;			virtual unsigned getJumpBufAlignment() = 0;
	virtual unsigned getJumpBufSize() = 0;			virtual unsigned getJumpBufSize() = 0;
	virtual bool shouldBuildLookupTables() = 0;			virtual bool shouldBuildLookupTables() = 0;
	virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;			virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;
	virtual bool enableInterleavedAccessVectorization() = 0;			virtual bool enableInterleavedAccessVectorization() = 0;
				virtual bool isSplatExpensive() = 0;
	virtual bool isFPVectorizationPotentiallyUnsafe() = 0;			virtual bool isFPVectorizationPotentiallyUnsafe() = 0;
	virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,			virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
	unsigned BitWidth,			unsigned BitWidth,
	unsigned AddressSpace,			unsigned AddressSpace,
	unsigned Alignment,			unsigned Alignment,
	bool *Fast) = 0;			bool *Fast) = 0;
	virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;			virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;
	virtual bool haveFastSqrt(Type *Ty) = 0;			virtual bool haveFastSqrt(Type *Ty) = 0;
	▲ Show 20 Lines • Show All 156 Lines • ▼ Show 20 Lines
	return Impl.shouldBuildLookupTables();			return Impl.shouldBuildLookupTables();
	}			}
	bool enableAggressiveInterleaving(bool LoopHasReductions) override {			bool enableAggressiveInterleaving(bool LoopHasReductions) override {
	return Impl.enableAggressiveInterleaving(LoopHasReductions);			return Impl.enableAggressiveInterleaving(LoopHasReductions);
	}			}
	bool enableInterleavedAccessVectorization() override {			bool enableInterleavedAccessVectorization() override {
	return Impl.enableInterleavedAccessVectorization();			return Impl.enableInterleavedAccessVectorization();
	}			}
				bool isSplatExpensive() override { return Impl.isSplatExpensive(); }

	bool isFPVectorizationPotentiallyUnsafe() override {			bool isFPVectorizationPotentiallyUnsafe() override {
	return Impl.isFPVectorizationPotentiallyUnsafe();			return Impl.isFPVectorizationPotentiallyUnsafe();
	}			}
	bool allowsMisalignedMemoryAccesses(LLVMContext &Context,			bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
	unsigned BitWidth, unsigned AddressSpace,			unsigned BitWidth, unsigned AddressSpace,
	unsigned Alignment, bool *Fast) override {			unsigned Alignment, bool *Fast) override {
	return Impl.allowsMisalignedMemoryAccesses(Context, BitWidth, AddressSpace,			return Impl.allowsMisalignedMemoryAccesses(Context, BitWidth, AddressSpace,
	Alignment, Fast);			Alignment, Fast);
	▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

	Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines
	unsigned getJumpBufSize() { return 0; }			unsigned getJumpBufSize() { return 0; }

	bool shouldBuildLookupTables() { return true; }			bool shouldBuildLookupTables() { return true; }

	bool enableAggressiveInterleaving(bool LoopHasReductions) { return false; }			bool enableAggressiveInterleaving(bool LoopHasReductions) { return false; }

	bool enableInterleavedAccessVectorization() { return false; }			bool enableInterleavedAccessVectorization() { return false; }

				bool isSplatExpensive() { return false; }

	bool isFPVectorizationPotentiallyUnsafe() { return false; }			bool isFPVectorizationPotentiallyUnsafe() { return false; }

	bool allowsMisalignedMemoryAccesses(LLVMContext &Context,			bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
	unsigned BitWidth,			unsigned BitWidth,
	unsigned AddressSpace,			unsigned AddressSpace,
	unsigned Alignment,			unsigned Alignment,
	bool *Fast) { return false; }			bool *Fast) { return false; }

	▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines
	bool TargetTransformInfo::enableAggressiveInterleaving(bool LoopHasReductions) const {			bool TargetTransformInfo::enableAggressiveInterleaving(bool LoopHasReductions) const {
	return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);			return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);
	}			}

	bool TargetTransformInfo::enableInterleavedAccessVectorization() const {			bool TargetTransformInfo::enableInterleavedAccessVectorization() const {
	return TTIImpl->enableInterleavedAccessVectorization();			return TTIImpl->enableInterleavedAccessVectorization();
	}			}

				bool TargetTransformInfo::isSplatExpensive() const {
				return TTIImpl->isSplatExpensive();
				}

	bool TargetTransformInfo::isFPVectorizationPotentiallyUnsafe() const {			bool TargetTransformInfo::isFPVectorizationPotentiallyUnsafe() const {
	return TTIImpl->isFPVectorizationPotentiallyUnsafe();			return TTIImpl->isFPVectorizationPotentiallyUnsafe();
	}			}

	bool TargetTransformInfo::allowsMisalignedMemoryAccesses(LLVMContext &Context,			bool TargetTransformInfo::allowsMisalignedMemoryAccesses(LLVMContext &Context,
	unsigned BitWidth,			unsigned BitWidth,
	unsigned AddressSpace,			unsigned AddressSpace,
	unsigned Alignment,			unsigned Alignment,
	▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	public:
ARMTTIImpl(const ARMTTIImpl &Arg)		ARMTTIImpl(const ARMTTIImpl &Arg)
: BaseT(static_cast<const BaseT &>(Arg)), ST(Arg.ST), TLI(Arg.TLI) {}		: BaseT(static_cast<const BaseT &>(Arg)), ST(Arg.ST), TLI(Arg.TLI) {}
ARMTTIImpl(ARMTTIImpl &&Arg)		ARMTTIImpl(ARMTTIImpl &&Arg)
: BaseT(std::move(static_cast<BaseT &>(Arg))), ST(std::move(Arg.ST)),		: BaseT(std::move(static_cast<BaseT &>(Arg))), ST(std::move(Arg.ST)),
TLI(std::move(Arg.TLI)) {}		TLI(std::move(Arg.TLI)) {}

bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }

		bool isSplatExpensive() const { return true; }

/// Floating-point computation using ARMv8 AArch32 Advanced		/// Floating-point computation using ARMv8 AArch32 Advanced
/// SIMD instructions remains unchanged from ARMv7. Only AArch64 SIMD		/// SIMD instructions remains unchanged from ARMv7. Only AArch64 SIMD
/// is IEEE-754 compliant, but it's not covered in this target.		/// is IEEE-754 compliant, but it's not covered in this target.
bool isFPVectorizationPotentiallyUnsafe() {		bool isFPVectorizationPotentiallyUnsafe() {
return !ST->isTargetDarwin();		return !ST->isTargetDarwin();
}		}

/// \name Scalar TTI Implementations		/// \name Scalar TTI Implementations
▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

	Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines
	// get it now.			// get it now.
	if (ID.getConstIntStepValue())			if (ID.getConstIntStepValue())
	Step = ID.getConstIntStepValue();			Step = ID.getConstIntStepValue();

	// Try to create a new independent vector induction variable. If we can't			// Try to create a new independent vector induction variable. If we can't
	// create the phi node, we will splat the scalar induction variable in each			// create the phi node, we will splat the scalar induction variable in each
	// loop iteration.			// loop iteration.
	if (VF > 1 && IV->getType() == Induction->getType() && Step &&			if (VF > 1 && IV->getType() == Induction->getType() && Step &&
	!Legal->isScalarAfterVectorization(EntryVal)) {			(!Legal->isScalarAfterVectorization(EntryVal) \|\|
				TTI->isSplatExpensive())) {
	createVectorIntInductionPHI(ID, Entry, TruncType);			createVectorIntInductionPHI(ID, Entry, TruncType);
	VectorizedIV = true;			VectorizedIV = true;
	}			}

	// If we haven't yet vectorized the induction variable, or if we will create			// If we haven't yet vectorized the induction variable, or if we will create
	// a scalar one, we need to define the scalar induction variable and step			// a scalar one, we need to define the scalar induction variable and step
	// values. If we were given a truncation type, truncate the canonical			// values. If we were given a truncation type, truncate the canonical
	// induction variable and constant step. Otherwise, derive these values from			// induction variable and constant step. Otherwise, derive these values from
	▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/ARM/dont-splat.ll

This file was added.

				; RUN: opt -loop-vectorize -dce -instcombine -simplifycfg -S < %s \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "armv7-arm-none-eabi"

				;CHECK-LABEL: vector.body:
				;CHECK: phi i32
				;CHECK-NOT: insertelement <16 x i32>
				;CHECK-NOT: shufflevector <16 x i32>
				;CHECK: phi <16 x i32>
				;CHECK: add nuw nsw <16 x i32>

				define void @test(i16 zeroext %xMax, i16 zeroext %yMax, i8* noalias nocapture readonly %input, i8* noalias nocapture %output) #0 {
				entry:
				%conv1 = zext i16 %yMax to i32
				%sub = add nsw i32 %conv1, -1
				%cmp99 = icmp sgt i32 %sub, 1
				br i1 %cmp99, label %for.cond3.preheader.lr.ph, label %for.cond.cleanup

				for.cond3.preheader.lr.ph: ; preds = %entry
				%conv5 = zext i16 %xMax to i32
				%sub6 = add nsw i32 %conv5, -1
				%cmp796 = icmp sgt i32 %sub6, 1
				br label %for.cond3.preheader

				for.cond3.preheader: ; preds = %for.cond3.preheader.lr.ph, %for.cond.cleanup9
				%conv101 = phi i32 [ 1, %for.cond3.preheader.lr.ph ], [ %conv, %for.cond.cleanup9 ]
				%y.0100 = phi i16 [ 1, %for.cond3.preheader.lr.ph ], [ %inc64, %for.cond.cleanup9 ]
				br i1 %cmp796, label %for.body10.lr.ph, label %for.cond.cleanup9

				for.body10.lr.ph: ; preds = %for.cond3.preheader
				%mul = mul nuw nsw i32 %conv101, %conv5
				br label %for.body10

				for.cond.cleanup: ; preds = %for.cond.cleanup9, %entry
				ret void

				for.cond.cleanup9: ; preds = %for.body10, %for.cond3.preheader
				%inc64 = add i16 %y.0100, 1
				%conv = zext i16 %inc64 to i32
				%cmp = icmp slt i32 %conv, %sub
				br i1 %cmp, label %for.cond3.preheader, label %for.cond.cleanup

				for.body10: ; preds = %for.body10.lr.ph, %for.body10
				%conv498 = phi i32 [ 1, %for.body10.lr.ph ], [ %conv4, %for.body10 ]
				%x.097 = phi i16 [ 1, %for.body10.lr.ph ], [ %inc, %for.body10 ]
				%add = add nuw nsw i32 %conv498, %mul
				%sub15 = sub nsw i32 %add, %conv5
				%sub16 = add i32 %sub15, -1
				%arrayidx = getelementptr inbounds i8, i8* %input, i32 %sub16
				%0 = load i8, i8* %arrayidx, align 1
				%conv17 = zext i8 %0 to i32
				%mul18 = mul nuw nsw i32 %conv17, 3
				%arrayidx21 = getelementptr inbounds i8, i8* %input, i32 %sub15
				%1 = load i8, i8* %arrayidx21, align 1
				%conv22 = zext i8 %1 to i32
				%mul23 = mul nuw nsw i32 %conv22, 5
				%add24 = add nuw nsw i32 %mul23, %mul18
				%add27 = add i32 %sub15, 1
				%arrayidx28 = getelementptr inbounds i8, i8* %input, i32 %add27
				%2 = load i8, i8* %arrayidx28, align 1
				%conv29 = zext i8 %2 to i32
				%mul30 = mul nuw nsw i32 %conv29, 7
				%add31 = add nuw nsw i32 %add24, %mul30
				%sub32 = add nsw i32 %add, -1
				%arrayidx33 = getelementptr inbounds i8, i8* %input, i32 %sub32
				%3 = load i8, i8* %arrayidx33, align 1
				%conv34 = zext i8 %3 to i32
				%mul35 = mul nuw nsw i32 %conv34, 9
				%add36 = add nuw nsw i32 %add31, %mul35
				%arrayidx37 = getelementptr inbounds i8, i8* %input, i32 %add
				%4 = load i8, i8* %arrayidx37, align 1
				%conv38 = zext i8 %4 to i32
				%mul39 = mul nuw nsw i32 %conv38, 11
				%add40 = add nuw nsw i32 %add36, %mul39
				%add41 = add nuw i32 %add, 1
				%arrayidx42 = getelementptr inbounds i8, i8* %input, i32 %add41
				%5 = load i8, i8* %arrayidx42, align 1
				%conv43 = zext i8 %5 to i32
				%mul44 = mul nuw nsw i32 %conv43, 13
				%add45 = add nuw nsw i32 %add40, %mul44
				%add47 = add nuw i32 %add, %conv5
				%sub48 = add i32 %add47, -1
				%arrayidx49 = getelementptr inbounds i8, i8* %input, i32 %sub48
				%6 = load i8, i8* %arrayidx49, align 1
				%conv50 = zext i8 %6 to i32
				%mul51 = mul nuw nsw i32 %conv50, 15
				%add52 = add nsw i32 %add45, %mul51
				%arrayidx55 = getelementptr inbounds i8, i8* %input, i32 %add47
				%7 = load i8, i8* %arrayidx55, align 1
				%conv56 = zext i8 %7 to i32
				%mul57 = mul nuw nsw i32 %conv56, 17
				%add58 = add nsw i32 %add52, %mul57
				%8 = lshr i32 %add58, 8
				%conv61 = trunc i32 %8 to i8
				%arrayidx62 = getelementptr inbounds i8, i8* %output, i32 %add
				store i8 %conv61, i8* %arrayidx62, align 1
				%inc = add i16 %x.097, 1
				%conv4 = zext i16 %inc to i32
				%cmp7 = icmp slt i32 %conv4, %sub6
				br i1 %cmp7, label %for.body10, label %for.cond.cleanup9
				}

				attributes #0 = { norecurse nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="cortex-a9" "target-features"="+dsp,+fp16,+neon,+strict-align,+vfp3" "unsafe-fp-math"="false" "use-soft-float"="false" }