This is an archive of the discontinued LLVM Phabricator instance.

Differential D20605

[NVPTX] Allow load/store vectorization.
AbandonedPublic

Authored by jlebar on May 24 2016, 4:09 PM.

Download Raw Diff

Details

Reviewers

tra
hfinkel

Summary

This patch adds enough information to the NVPTX TTI so that the SLP
vectorizer will fire.

This gets us to parity with NVCC on the Eigen benchmark suite. (Without these
changes, we're 30+% slower on many benchmarks.)

Diff Detail

Event Timeline

jlebar updated this revision to Diff 58348.May 24 2016, 4:09 PM

jlebar retitled this revision from to [NVPTX] Allow load/store vectorization..

jlebar updated this object.

jlebar added reviewers: tra, hfinkel.

jlebar added a subscriber: jholewinski.

hfinkel added inline comments.May 24 2016, 4:21 PM

lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
130	This should say 'FIXME', because we should fix that behavior of the SLP vectorizer (I actually thought we had, but maybe we only had in the loop vectorizer).
146	Same is true for ExtractElement?
lib/Target/NVPTX/NVPTXTargetTransformInfo.h
59	Please comment on why 1 register.

tra added inline comments.May 24 2016, 4:21 PM

test/Transforms/SLPVectorizer/NVPTX/simple.ll
48	IR above does not seem to have much to do with load/store vectorization. Perhaps it can be trimmed down.
51	Does this change vectorize loads as well? If so, it would be nice to add that, too.

Add test checking load vectorization, and make vectorization of very small
functions work.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptMay 24 2016, 5:50 PM

jlebar added inline comments.May 24 2016, 5:51 PM

lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
130	Added the FIXME. http://llvm.org/docs/doxygen/html/SLPVectorizer_8cpp_source.html#l01638
146	Done and added a test, thank you.
test/Transforms/SLPVectorizer/NVPTX/simple.ll
48	Interestingly, toy cases are a special case, so we need to test both. Added a fix to make toy cases work, and a test.

We cannot currently vectorize loads across calls, unless those calls are vectorizable intrinsics.

This seems sort of broken to me even outside the context of nvptx, because it means that if you do something like

load 4 floats
do a bunch of vectorizable math
call a non-vectorizable function on each element of your bundle
store the 4 results

then we won't vectorize this (except maybe the store). It seems to me that we should calculate the cost of vectorizing this if we un-vectorize the calls. I dunno if that would be a useful optimization anywhere other than on nvptx, though.

The case above is important because none of our intrinsics are vectorizable, so this is basically any kernel that doesn't do exclusively +-/*. But I think I'm happy to look at that in a separate patch, since this is a substantial win as-is.

Fix comment.

hfinkel added inline comments.May 24 2016, 6:09 PM

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
59	I'm still wondering why this is set to 1.

(Accidental dup comment removed; I cannot figure out phabricator)

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
59	Sorry I missed this one. The default TTI behavior is to say we have 1 general purpose register and 0 vector regs. Changing it to 1 and 1 seems like the minimal change. The virtual/physical reg config in the NVPTX backend is kind of weird. Although we never do register allocation, we still define 4 physical NVPTX registers for each type... Except vector types; we have no virtual vector-type registers. That's OK here because this is just an upper bound on the number of vector registers of any type.

LGTM. I'll defer approval to Hal.

Abandoning this in favor of D19501, which does a *much* better job at doing what we want.

Revision Contents

Path

Size

lib/

Target/

NVPTX/

NVPTXTargetTransformInfo.h

7 lines

NVPTXTargetTransformInfo.cpp

37 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

11 lines

test/

Transforms/

SLPVectorizer/

NVPTX/

lit.local.cfg

3 lines

simple.ll

92 lines

Diff 58366

lib/Target/NVPTX/NVPTXTargetTransformInfo.h

Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	public:
bool hasBranchDivergence() { return true; }		bool hasBranchDivergence() { return true; }

bool isSourceOfDivergence(const Value *V);		bool isSourceOfDivergence(const Value *V);

// Increase the inlining cost threshold by a factor of 5, reflecting that		// Increase the inlining cost threshold by a factor of 5, reflecting that
// calls are particularly expensive in NVPTX.		// calls are particularly expensive in NVPTX.
unsigned getInliningThresholdMultiplier() { return 5; }		unsigned getInliningThresholdMultiplier() { return 5; }

		unsigned getNumberOfRegisters(bool /Vector/) const { return 1; }
		hfinkelUnsubmitted Not Done Reply Inline Actions Please comment on why 1 register. hfinkel: Please comment on why 1 register.
		hfinkelUnsubmitted Not Done Reply Inline Actions I'm still wondering why this is set to 1. hfinkel: I'm still wondering why this is set to 1.
		jlebarAuthorUnsubmitted Not Done Reply Inline Actions Sorry I missed this one. The default TTI behavior is to say we have 1 general purpose register and 0 vector regs. Changing it to 1 and 1 seems like the minimal change. The virtual/physical reg config in the NVPTX backend is kind of weird. Although we never do register allocation, we still define 4 physical NVPTX registers for each type... Except vector types; we have no virtual vector-type registers. That's OK here because this is just an upper bound on the number of vector registers of any type. jlebar: Sorry I missed this one. The default TTI behavior is to say we have 1 general purpose register…
		unsigned getRegisterBitWidth(bool Vector) const { return Vector ? 128 : 64; }

		int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
		unsigned AddressSpace);
		int getVectorInstrCost(unsigned Opcode, Type *Ty, unsigned Index);

int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None);		TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None);

void getUnrollingPreferences(Loop *L, TTI::UnrollingPreferences &UP);		void getUnrollingPreferences(Loop *L, TTI::UnrollingPreferences &UP);
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp

Show First 20 Lines • Show All 112 Lines • ▼ Show 20 Lines	case ISD::AND:
if (LT.second.SimpleTy == MVT::i64)		if (LT.second.SimpleTy == MVT::i64)
return 2 * LT.first;		return 2 * LT.first;
// Delegate other cases to the basic TTI.		// Delegate other cases to the basic TTI.
return BaseT::getArithmeticInstrCost(Opcode, Ty, Opd1Info, Opd2Info,		return BaseT::getArithmeticInstrCost(Opcode, Ty, Opd1Info, Opd2Info,
Opd1PropInfo, Opd2PropInfo);		Opd1PropInfo, Opd2PropInfo);
}		}
}		}

		int NVPTXTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
		unsigned Alignment, unsigned AddressSpace) {
		int Cost = BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace);

		// Model vector loads and stores (of vector types that ptx supports) as half
		// the cost of the corresponding set of scalar loads and stores. This is a
		// bit optimistic, but it encourages the SLP optimizer to use vectorized loads
		// and stores, which we want.
		//
		// FIXME: We ignore the Alignment arg, even though PTX can only handle vector
		hfinkelUnsubmitted Done Reply Inline Actions This should say 'FIXME', because we should fix that behavior of the SLP vectorizer (I actually thought we had, but maybe we only had in the loop vectorizer). hfinkel: This should say 'FIXME', because we should fix that behavior of the SLP vectorizer (I actually…
		jlebarAuthorUnsubmitted Not Done Reply Inline Actions Added the FIXME. http://llvm.org/docs/doxygen/html/SLPVectorizer_8cpp_source.html#l01638 jlebar: Added the FIXME. http://llvm.org/docs/doxygen/html/SLPVectorizer_8cpp_source.html#l01638
		// loads/stores that are aligned to the vector's width, because the SLP
		// vectorizer queries us with an alignment of 1.
		if (Src->isVectorTy()) {
		int N = Src->getVectorNumElements();
		int SZ = Src->getScalarSizeInBits();
		if ((SZ <= 64 && N == 2) \|\| (SZ <= 32 && N == 4)) {
		return Cost / 2;
		}
		}
		return Cost;
		}

		int NVPTXTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Ty,
		unsigned Index) {
		switch (Opcode) {
		case Instruction::InsertElement:
		hfinkelUnsubmitted Not Done Reply Inline Actions Same is true for ExtractElement? hfinkel: Same is true for ExtractElement?
		jlebarAuthorUnsubmitted Not Done Reply Inline Actions Done and added a test, thank you. jlebar: Done and added a test, thank you.
		case Instruction::ExtractElement:
		// Model vector insertions and extractions as free. PTX only supports
		// vector loads and stores, and in those you can specify a list of
		// general-purpose registers, {a, b, c, d}. So vector
		// insertions/extractions get optimized away when we lower to PTX.
		return 0;
		default:
		return BaseT::getVectorInstrCost(Opcode, Ty, Index);
		}
		}

void NVPTXTTIImpl::getUnrollingPreferences(Loop *L,		void NVPTXTTIImpl::getUnrollingPreferences(Loop *L,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
BaseT::getUnrollingPreferences(L, UP);		BaseT::getUnrollingPreferences(L, UP);

// Enable partial unrolling and runtime unrolling, but reduce the		// Enable partial unrolling and runtime unrolling, but reduce the
// threshold. This partially unrolls small loops which are often		// threshold. This partially unrolls small loops which are often
// unrolled by the PTX to SASS compiler and unrolling earlier can be		// unrolled by the PTX to SASS compiler and unrolling earlier can be
// beneficial.		// beneficial.
UP.Partial = UP.Runtime = true;		UP.Partial = UP.Runtime = true;
UP.PartialThreshold = UP.Threshold / 4;		UP.PartialThreshold = UP.Threshold / 4;
}		}

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 1,791 Lines • ▼ Show 20 Lines	case Instruction::ShuffleVector: {
return VecCost - ScalarCost;		return VecCost - ScalarCost;
}		}
default:		default:
llvm_unreachable("Unknown instruction");		llvm_unreachable("Unknown instruction");
}		}
}		}

bool BoUpSLP::isFullyVectorizableTinyTree() {		bool BoUpSLP::isFullyVectorizableTinyTree() {
DEBUG(dbgs() << "SLP: Check whether the tree with height " <<		DEBUG(dbgs() << "SLP: Check whether the tree with height "
VectorizableTree.size() << " is fully vectorizable .\n");		<< VectorizableTree.size() << " is fully vectorizable.\n");

// We only handle trees of height 2.		// We only handle trees of height 2.
if (VectorizableTree.size() != 2)		if (VectorizableTree.size() != 2)
return false;		return false;

// Handle splat and all-constants stores.		// Handle splat and all-constants stores.
if (!VectorizableTree[0].NeedToGather &&		if (!VectorizableTree[0].NeedToGather &&
(allConstant(VectorizableTree[1].Scalars) \|\|		(allConstant(VectorizableTree[1].Scalars) \|\|
isSplat(VectorizableTree[1].Scalars)))		isSplat(VectorizableTree[1].Scalars)))
return true;		return true;

// Gathering cost would be too much for tiny trees.		// Gathering cost would be too much for tiny trees, unless gathers are free.
if (VectorizableTree[0].NeedToGather \|\| VectorizableTree[1].NeedToGather)		for (TreeEntry &TE : VectorizableTree)
		if (TE.NeedToGather && getGatherCost(TE.Scalars[0]) > 0)
return false;		return false;

return true;		return true;
}		}

int BoUpSLP::getSpillCost() {		int BoUpSLP::getSpillCost() {
// Walk from the bottom of the tree to the top, tracking which values are		// Walk from the bottom of the tree to the top, tracking which values are
// live. When we see a call instruction that is not part of our tree,		// live. When we see a call instruction that is not part of our tree,
// query TTI to see if there is a cost to keeping values live over it		// query TTI to see if there is a cost to keeping values live over it
▲ Show 20 Lines • Show All 2,893 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/NVPTX/lit.local.cfg

This file was added.

				if not 'NVPTX' in config.root.targets:
				config.unsupported = True

test/Transforms/SLPVectorizer/NVPTX/simple.ll

This file was added.

				; RUN: opt < %s -basicaa -slp-vectorizer -S \| FileCheck %s

				target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
				target triple = "nvptx64-nvidia-cuda"

				declare float @llvm.nvvm.ex2.approx.ftz.f(float) readnone norecurse nounwind
				declare float @llvm.nvvm.lg2.approx.ftz.f(float) readnone norecurse nounwind
				declare <4 x float > @llvm.nvvm.ldg.global.f.v4f32.p0v4f32(<4 x float >*, i32) readonly argmemonly norecurse nounwind

				; Check that we vectorize loads and stores in a trivial function.
				; CHECK-LABEL: @small_fn
				define void @small_fn(float* %in, float* %out) {
				%p1 = getelementptr inbounds float, float* %in, i64 0
				%in1 = load float, float* %p1, align 16
				%p2 = getelementptr inbounds float, float* %in, i64 1
				%in2 = load float, float* %p2, align 4
				%p3 = getelementptr inbounds float, float* %in, i64 2
				%in3 = load float, float* %p3, align 8
				%p4 = getelementptr inbounds float, float* %in, i64 3
				%in4 = load float, float* %p4, align 4
				; CHECK: load <4 x float>, <4 x float>* %{{[0-9]+}}, align 16

				%t1 = fadd float %in1, 1.0
				%t2 = fadd float %in2, 2.0
				%t3 = fadd float %in3, 3.0
				%t4 = fadd float %in4, 4.0

				%o1 = getelementptr inbounds float, float* %out, i64 0
				store float %t1, float* %o1, align 16
				%o2 = getelementptr inbounds float, float* %out, i64 1
				store float %t2, float* %o2, align 4
				%o3 = getelementptr inbounds float, float* %out, i64 2
				store float %t3, float* %o3, align 8
				%o4 = getelementptr inbounds float, float* %out, i64 3
				; CHECK: store <4 x float> %{{[0-9]+}}, <4 x float>* %{{[0-9]+}}, align 16
				store float %t4, float* %o4, align 4
				ret void
				}

				; Check that we vectorize stores in a bigger function. We don't currently
				; vectorize the loads in this function because the loads are followed by a
				; non-vectorizable function call.
				;
				; CHECK-LABEL: @big_fn
				define void @big_fn(float* %in1, i64 %in1_idx, <4 x float>* %in2,
				float* %out, i64 %out_idx) {
				%1 = getelementptr inbounds float, float* %in1, i64 0
				%2 = load float, float* %1, align 16
				traUnsubmitted Done Reply Inline Actions IR above does not seem to have much to do with load/store vectorization. Perhaps it can be trimmed down. tra: IR above does not seem to have much to do with load/store vectorization. Perhaps it can be…
				jlebarAuthorUnsubmitted Not Done Reply Inline Actions Interestingly, toy cases are a special case, so we need to test both. Added a fix to make toy cases work, and a test. jlebar: Interestingly, toy cases are a special case, so we need to test both. Added a fix to make toy…
				%p2 = getelementptr inbounds float, float* %in1, i64 1
				%3 = load float, float* %p2, align 4
				%p3 = getelementptr inbounds float, float* %in1, i64 2
				traUnsubmitted Done Reply Inline Actions Does this change vectorize loads as well? If so, it would be nice to add that, too. tra: Does this change vectorize loads as well? If so, it would be nice to add that, too.
				%4 = load float, float* %p3, align 8
				%p4 = getelementptr inbounds float, float* %in1, i64 3
				%5 = load float, float* %p4, align 4

				%6 = fmul float %3, 0x3FF7154760000000
				%7 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %6)
				%8 = fmul float %4, 0x3FF7154760000000
				%9 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %8)
				%10 = fmul float %5, 0x3FF7154760000000
				%11 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %10)
				%12 = fmul float %2, 0x3FF7154760000000
				%13 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %12)

				%14 = tail call <4 x float> @llvm.nvvm.ldg.global.f.v4f32.p0v4f32(<4 x float>* %in2, i32 16)
				%15 = extractelement <4 x float> %14, i32 0
				%16 = extractelement <4 x float> %14, i32 1
				%17 = extractelement <4 x float> %14, i32 2
				%18 = extractelement <4 x float> %14, i32 3
				%19 = tail call float @llvm.nvvm.lg2.approx.ftz.f(float %16)
				%20 = fmul float %19, 0x3FE62E4300000000
				%21 = tail call float @llvm.nvvm.lg2.approx.ftz.f(float %17)
				%22 = fmul float %21, 0x3FE62E4300000000
				%23 = tail call float @llvm.nvvm.lg2.approx.ftz.f(float %18)
				%24 = fmul float %23, 0x3FE62E4300000000
				%25 = tail call float @llvm.nvvm.lg2.approx.ftz.f(float %15)
				%26 = fmul float %25, 0x3FE62E4300000000
				%27 = fadd float %7, %20
				%28 = fadd float %9, %22
				%29 = fadd float %11, %24
				%30 = fadd float %13, %26
				%31 = getelementptr inbounds float, float* %out, i64 %out_idx
				store float %27, float* %31, align 16
				%32 = getelementptr inbounds float, float* %31, i64 1
				store float %28, float* %32, align 4
				%33 = getelementptr inbounds float, float* %31, i64 2
				store float %29, float* %33, align 8
				%34 = getelementptr inbounds float, float* %31, i64 3
				; CHECK: store <4 x float> %{{[0-9]+}}, <4 x float>* %{{[0-9]+}}, align 16
				store float %30, float* %34, align 4
				ret void
				}