This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Turn on Loop/SLP vectorization
ClosedPublic

Authored by bkramer on Apr 26 2018, 9:17 AM.

Download Raw Diff

Details

Reviewers

jlebar
tra
javed.absar

Commits

rG733c7fc55d0d: [NVPTX] Turn on Loop/SLP vectorization
rL331035: [NVPTX] Turn on Loop/SLP vectorization

Summary

Since PTX has grown a <2 x half> datatype vectorization has become more
important. The late LoadStoreVectorizer intentionally only does loads
and stores, but now arithmetic has to be vectorized for optimal
throughput too.

This is still very limited, SLP vectorization happily creates <2 x half>
if it's a legal type but there's still a lot of register moving
happening to get that fed into a vectorized store. Overall it's a small
performance win by reducing the amount of arithmetic instructions.

I haven't really checked what the loop vectorizer does to PTX code, the
cost model there might need some more tweaks. I didn't see it causing
harm though.

Diff Detail

Repository: rL LLVM

Event Timeline

bkramer created this revision.Apr 26 2018, 9:17 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptApr 26 2018, 9:17 AM

Herald added subscribers: kristof.beyls, jholewinski. · View Herald Transcript

Harbormaster completed remote builds in B17452: Diff 144136.Apr 26 2018, 9:19 AM

tra accepted this revision.Apr 26 2018, 9:35 AM

This revision is now accepted and ready to land.Apr 26 2018, 9:35 AM

jlebar added inline comments.Apr 26 2018, 10:49 AM

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗	(On Diff #144136)	Does 1 have specific meaning? I don't see this in any of the comments, and that would be a pretty weird API... (Like, did you mean -1?)

bkramer added inline comments.Apr 26 2018, 10:51 AM

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗	(On Diff #144136)	1 is the default of the generic implementation, I just copied that and removed the check for vector. I'm not even sure if anyone ever checks the value or just compares it against zero.

jlebar added inline comments.Apr 26 2018, 10:59 AM

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗	(On Diff #144136)	I see a few places using the actual value returned: LoopStrengthReduce, LoopVectorize...

bkramer added inline comments.Apr 26 2018, 11:26 AM

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗	(On Diff #144136)	Right, LoopStrengthReduce calls it with (false), which was returning '1' before and after :) I'm not sure about LoopVectorize, but conservatively limiting its vectorization factor is probably a good thing.

jlebar added inline comments.Apr 26 2018, 11:28 AM

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗	(On Diff #144136)	OK, I think I see what you're after: You want to make the minimal/safest change that gets vectorization on your testcases? That sounds good to me, but can we make the comment on "return 1" express this motivation?

Extend getNumberOfRegisters comment.

Harbormaster completed remote builds in B17460: Diff 144172.Apr 26 2018, 11:45 AM

jlebar accepted this revision.Apr 26 2018, 1:25 PM

hfinkel added a subscriber: hfinkel.Apr 26 2018, 1:59 PM

hfinkel added inline comments.

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗	(On Diff #144136)	LoopVectorize is using this number to control the amount of interleaving/unrolling it does. The idea being that inverleaving is beneficial until you create too much register pressure (i.e., until you start spilling). I don't think that setting this to 1 is a good idea. Perfectly reasonable changes to the loop vectorizer in the future could cause this to completely disable vectorizer. You should set this, I suspect, to the maximum number of registers you can have per thread at full occupancy.

bkramer added inline comments.Apr 27 2018, 1:54 AM

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗	(On Diff #144136)	This is intentional. I want to keep LoopVectorizer as conservative as possible for now. I have benchmarks where SLP makes a clear improvement, for LV it's much less clear and it has the potential of causing huge regressions, especially since NVPTX TTI is not yet tuned for it. Happy to reword the comment more in case that's still unclear and put in a note to revisit making the LV more aggressive. WDYT?

Closed by commit rL331035: [NVPTX] Turn on Loop/SLP vectorization (authored by d0k). · Explain WhyApr 27 2018, 6:39 AM

This revision was automatically updated to reflect the committed changes.

hfinkel added inline comments.Apr 27 2018, 6:50 AM

lib/Target/NVPTX/NVPTXTargetTransformInfo.h
53 ↗	(On Diff #144136)	Please give the TTI functions meaningful values. It's perfectly plausible that this can be used for something else in the future, and you're just asking for trouble when we do need to change this later because the delta is bound to be large. That having been said, we certainly do want the LV to be conservative in terms of increasing register pressure. Luckily, we have a separate TTI function for that: unsigned getMaxInterleaveFactor(unsigned VF) const; make this function return 1 for all values of VF. I believe this is the default, but in this case, I recommend explicitly overriding it with a comment about explicitly wanting to be conservative about reducing register pressure. In terms of the LV, this should have the same current effect. As I recall, the register file size on a Volta, for example, is 256kB/SM and you can have 2048 threads/SM, so that leaves 32 32-bit registers per thread at full occupancy. Thus, I'd recommend setting this number to 32.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

NVPTX/

NVPTXTargetTransformInfo.h

12 lines

test/

Transforms/

SLPVectorizer/

NVPTX/

lit.local.cfg

2 lines

v2f16.ll

40 lines

Diff 144331

llvm/trunk/lib/Target/NVPTX/NVPTXTargetTransformInfo.h

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	public:
bool hasBranchDivergence() { return true; }		bool hasBranchDivergence() { return true; }

bool isSourceOfDivergence(const Value *V);		bool isSourceOfDivergence(const Value *V);

unsigned getFlatAddressSpace() const {		unsigned getFlatAddressSpace() const {
return AddressSpace::ADDRESS_SPACE_GENERIC;		return AddressSpace::ADDRESS_SPACE_GENERIC;
}		}

		// NVPTX has infinite registers of all kinds, but the actual machine doesn't.
		// We conservatively return 1 here which is just enough to enable the
		// vectorizers but disables heuristics based on the number of registers.
		// FIXME: Return a more reasonable number, while keeping an eye on
		// LoopVectorizer's unrolling heuristics.
		unsigned getNumberOfRegisters(bool Vector) const { return 1; }

		// Only <2 x half> should be vectorized, so always return 32 for the vector
		// register size.
		unsigned getRegisterBitWidth(bool Vector) const { return 32; }
		unsigned getMinVectorRegisterBitWidth() const { return 32; }

// Increase the inlining cost threshold by a factor of 5, reflecting that		// Increase the inlining cost threshold by a factor of 5, reflecting that
// calls are particularly expensive in NVPTX.		// calls are particularly expensive in NVPTX.
unsigned getInliningThresholdMultiplier() { return 5; }		unsigned getInliningThresholdMultiplier() { return 5; }

int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
Show All 27 Lines

llvm/trunk/test/Transforms/SLPVectorizer/NVPTX/lit.local.cfg

				if not 'NVPTX' in config.root.targets:
				config.unsupported = True

llvm/trunk/test/Transforms/SLPVectorizer/NVPTX/v2f16.ll

				; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=nvptx64-nvidia-cuda -mcpu=sm_70 \| FileCheck %s
				; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=nvptx64-nvidia-cuda -mcpu=sm_40 \| FileCheck %s -check-prefix=NOVECTOR

				; CHECK-LABEL: @fusion
				; CHECK: load <2 x half>, <2 x half>*
				; CHECK: fmul fast <2 x half>
				; CHECK: fadd fast <2 x half>
				; CHECK: store <2 x half> %4, <2 x half>

				; NOVECTOR-LABEL: @fusion
				; NOVECTOR: load half
				; NOVECTOR: fmul fast half
				; NOVECTOR: fadd fast half
				; NOVECTOR: fmul fast half
				; NOVECTOR: fadd fast half
				; NOVECTOR: store half
				define void @fusion(i8* noalias nocapture align 256 dereferenceable(19267584) %arg, i8* noalias nocapture readonly align 256 dereferenceable(19267584) %arg1, i32 %arg2, i32 %arg3) local_unnamed_addr #0 {
				%tmp = shl nuw nsw i32 %arg2, 6
				%tmp4 = or i32 %tmp, %arg3
				%tmp5 = shl nuw nsw i32 %tmp4, 2
				%tmp6 = zext i32 %tmp5 to i64
				%tmp7 = or i64 %tmp6, 1
				%tmp10 = bitcast i8* %arg1 to half*
				%tmp11 = getelementptr inbounds half, half* %tmp10, i64 %tmp6
				%tmp12 = load half, half* %tmp11, align 8
				%tmp13 = fmul fast half %tmp12, 0xH5380
				%tmp14 = fadd fast half %tmp13, 0xH57F0
				%tmp15 = bitcast i8* %arg to half*
				%tmp16 = getelementptr inbounds half, half* %tmp15, i64 %tmp6
				store half %tmp14, half* %tmp16, align 8
				%tmp17 = getelementptr inbounds half, half* %tmp10, i64 %tmp7
				%tmp18 = load half, half* %tmp17, align 2
				%tmp19 = fmul fast half %tmp18, 0xH5380
				%tmp20 = fadd fast half %tmp19, 0xH57F0
				%tmp21 = getelementptr inbounds half, half* %tmp15, i64 %tmp7
				store half %tmp20, half* %tmp21, align 2
				ret void
				}

				attributes #0 = { nounwind }