This is an archive of the discontinued LLVM Phabricator instance.

[LV] Clamp MaxVF to power of 2
ClosedPublic

Authored by Ayal on May 24 2020, 9:16 AM.

Download Raw Diff

Details

Reviewers

fhahn
gilr

Commits

rG840450549c91: [LV] Clamp MaxVF to power of 2.

Summary

If a loop has a constant trip count known to be a multiple of MaxVF (times user UF), LV infers that no tail will be generated for any chosen VF. This relies on the chosen VF's being powers of 2 bound by MaxVF, and assumes MaxVF is a power of 2.
Make sure the latter holds, in particular when MaxVF is set by a memory dependence distance which may not be a power of 2.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Ayal created this revision.May 24 2020, 9:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 24 2020, 9:16 AM

Herald added subscribers: llvm-commits, rkruppe, hiraditya. · View Herald Transcript

LGTM, thanks!

llvm/test/Transforms/LoopVectorize/memdep-fold-tail.ll
109	nit both loop and vectorize.enable metadata not needed?

This revision is now accepted and ready to land.May 24 2020, 10:25 AM

Harbormaster completed remote builds in B57752: Diff 265933.May 24 2020, 11:12 AM

Ayal marked an inline comment as done.May 24 2020, 3:34 PM

Ayal added inline comments.

llvm/test/Transforms/LoopVectorize/memdep-fold-tail.ll
109	they are needed in order to get the loop vectorized, eventually with VF=2 and a folded tail.

Closed by commit rG840450549c91: [LV] Clamp MaxVF to power of 2. (authored by Ayal). · Explain WhyMay 25 2020, 1:34 AM

This revision was automatically updated to reflect the committed changes.

Ayal mentioned this in D80870: [LV] Make sure smallest/widest type sizes are powers-of-2..May 30 2020, 1:56 PM

bjope added a subscriber: bjope.Jun 1 2020, 7:06 AM

bjope added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5056	This has caused some problems for us (downstream) and I'm not really sure how to deal with it. Maybe this never happens for in-tree targets, but our target will return 160 for PowerOf2Floor. And I haven't really seen anything that says that TTI.getRegisterBitWidth(true) must return a power of 2 number. Another, perhaps abnormal, thing is that our frontend support i40 types. So SmallestType/WidestType isn't guaranteed to be power-of-2 either. Even if we actually want to scalarize operations using i40, doing PowerOf2Floor makes WidestRegister=128. And then we won't get a power-of-2 result from this method, since 128/40 isn't a power of 2 (considering that SmallestType/WidestType may end up not being a power-of-2). And then we trigger the new assert in computeMaxVF. So in some sense this patch limits the types allowed in loops to be power-of-2-sized. And returning something that isn't a power-of-two from TTI.getRegisterBitWidth is now stupid. Maybe there are other ways to ensure MaxVF is a power-of-2, or were these implications for getSmallestAndWidestTypes and getRegisterBitWidth intentional?

fhahn added inline comments.Jun 1 2020, 7:44 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5056	There's currently a fix being discussed: D80870

bjope added inline comments.Jun 1 2020, 8:19 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5056	Ah, thanks! I had not seen that one.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

4 lines

test/

Transforms/

LoopVectorize/

memdep-fold-tail.ll

108 lines

Diff 265977

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,997 Lines • ▼ Show 20 Lines	if (!useMaskedInterleavedAccesses(TTI)) {
assert(WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() &&		assert(WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() &&
"No decisions should have been taken at this point");		"No decisions should have been taken at this point");
// Note: There is no need to invalidate any cost modeling decisions here, as		// Note: There is no need to invalidate any cost modeling decisions here, as
// non where taken so far.		// non where taken so far.
InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();		InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();
}		}

unsigned MaxVF = UserVF ? UserVF : computeFeasibleMaxVF(TC);		unsigned MaxVF = UserVF ? UserVF : computeFeasibleMaxVF(TC);
		assert((UserVF \|\| isPowerOf2_32(MaxVF)) && "MaxVF must be a power of 2");
unsigned MaxVFtimesIC = UserIC ? MaxVF * UserIC : MaxVF;		unsigned MaxVFtimesIC = UserIC ? MaxVF * UserIC : MaxVF;
if (TC > 0 && TC % MaxVFtimesIC == 0) {		if (TC > 0 && TC % MaxVFtimesIC == 0) {
// Accept MaxVF if we do not have a tail.		// Accept MaxVF if we do not have a tail.
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");		LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
return MaxVF;		return MaxVF;
}		}

// If we don't know the precise trip count, or if the trip count that we		// If we don't know the precise trip count, or if the trip count that we
Show All 32 Lines	LoopVectorizationCostModel::computeFeasibleMaxVF(unsigned ConstTripCount) {
// Get the maximum safe dependence distance in bits computed by LAA.		// Get the maximum safe dependence distance in bits computed by LAA.
// It is computed by MaxVF * sizeOf(type) * 8, where type is taken from		// It is computed by MaxVF * sizeOf(type) * 8, where type is taken from
// the memory accesses that is most restrictive (involved in the smallest		// the memory accesses that is most restrictive (involved in the smallest
// dependence distance).		// dependence distance).
unsigned MaxSafeRegisterWidth = Legal->getMaxSafeRegisterWidth();		unsigned MaxSafeRegisterWidth = Legal->getMaxSafeRegisterWidth();

WidestRegister = std::min(WidestRegister, MaxSafeRegisterWidth);		WidestRegister = std::min(WidestRegister, MaxSafeRegisterWidth);

		// Ensure MaxVF is a power of 2; the dependence distance bound may not be.
		WidestRegister = PowerOf2Floor(WidestRegister);
		bjopeUnsubmitted Not Done Reply Inline Actions This has caused some problems for us (downstream) and I'm not really sure how to deal with it. Maybe this never happens for in-tree targets, but our target will return 160 for PowerOf2Floor. And I haven't really seen anything that says that TTI.getRegisterBitWidth(true) must return a power of 2 number. Another, perhaps abnormal, thing is that our frontend support i40 types. So SmallestType/WidestType isn't guaranteed to be power-of-2 either. Even if we actually want to scalarize operations using i40, doing PowerOf2Floor makes WidestRegister=128. And then we won't get a power-of-2 result from this method, since 128/40 isn't a power of 2 (considering that SmallestType/WidestType may end up not being a power-of-2). And then we trigger the new assert in computeMaxVF. So in some sense this patch limits the types allowed in loops to be power-of-2-sized. And returning something that isn't a power-of-two from TTI.getRegisterBitWidth is now stupid. Maybe there are other ways to ensure MaxVF is a power-of-2, or were these implications for getSmallestAndWidestTypes and getRegisterBitWidth intentional? bjope: This has caused some problems for us (downstream) and I'm not really sure how to deal with it.
		fhahnUnsubmitted Not Done Reply Inline Actions There's currently a fix being discussed: D80870 fhahn: There's currently a fix being discussed: D80870
		bjopeUnsubmitted Not Done Reply Inline Actions Ah, thanks! I had not seen that one. bjope: Ah, thanks! I had not seen that one.

unsigned MaxVectorSize = WidestRegister / WidestType;		unsigned MaxVectorSize = WidestRegister / WidestType;

LLVM_DEBUG(dbgs() << "LV: The Smallest and Widest types: " << SmallestType		LLVM_DEBUG(dbgs() << "LV: The Smallest and Widest types: " << SmallestType
<< " / " << WidestType << " bits.\n");		<< " / " << WidestType << " bits.\n");
LLVM_DEBUG(dbgs() << "LV: The Widest register safe to use is: "		LLVM_DEBUG(dbgs() << "LV: The Widest register safe to use is: "
<< WidestRegister << " bits.\n");		<< WidestRegister << " bits.\n");

assert(MaxVectorSize <= 256 && "Did not expect to pack so many elements"		assert(MaxVectorSize <= 256 && "Did not expect to pack so many elements"
▲ Show 20 Lines • Show All 3,033 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/memdep-fold-tail.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -loop-vectorize -vectorize-num-stores-pred=2 -prefer-predicate-over-epilog -S \| FileCheck %s

				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

				; Vectorization with dependence checks.

				; Check that a non-power-of-2 MaxVF, calculated based on maximum safe distance,
				; does not lead fold-tail to think that no tail will be generated for any chosen
				; (power of 2) VF.
				; Dependence distance here is 3 iterations.
				; Tiny trip count of 15 divides 3, but any (even) VF will have a tail.

				;unsigned char a [15+3];
				;void maxvf3(){
				; for (int j = 0; j < 15; ++j) {
				; a[j] = 69;
				; a[j+3] = 7;
				; }
				;}

				@a = common local_unnamed_addr global [18 x i8] zeroinitializer, align 16

				define void @maxvf3() {
				; CHECK-LABEL: @maxvf3(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_STORE_CONTINUE6:%.*]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <2 x i32> [ <i32 0, i32 1>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_STORE_CONTINUE6]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = icmp ule <2 x i32> [[VEC_IND]], <i32 14, i32 14>
				; CHECK-NEXT: [[TMP1:%.*]] = extractelement <2 x i1> [[TMP0]], i32 0
				; CHECK-NEXT: br i1 [[TMP1]], label [[PRED_STORE_IF:%.]], label [[PRED_STORE_CONTINUE:%.]]
				; CHECK: pred.store.if:
				; CHECK-NEXT: [[TMP2:%.*]] = add i32 [[INDEX]], 0
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds [18 x i8], [18 x i8] @a, i32 0, i32 [[TMP2]]
				; CHECK-NEXT: store i8 69, i8* [[TMP3]], align 8
				; CHECK-NEXT: br label [[PRED_STORE_CONTINUE]]
				; CHECK: pred.store.continue:
				; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x i1> [[TMP0]], i32 1
				; CHECK-NEXT: br i1 [[TMP4]], label [[PRED_STORE_IF1:%.]], label [[PRED_STORE_CONTINUE2:%.]]
				; CHECK: pred.store.if1:
				; CHECK-NEXT: [[TMP5:%.*]] = add i32 [[INDEX]], 1
				; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds [18 x i8], [18 x i8] @a, i32 0, i32 [[TMP5]]
				; CHECK-NEXT: store i8 69, i8* [[TMP6]], align 8
				; CHECK-NEXT: br label [[PRED_STORE_CONTINUE2]]
				; CHECK: pred.store.continue2:
				; CHECK-NEXT: [[TMP7:%.*]] = add nuw nsw <2 x i32> <i32 3, i32 3>, [[VEC_IND]]
				; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i1> [[TMP0]], i32 0
				; CHECK-NEXT: br i1 [[TMP8]], label [[PRED_STORE_IF3:%.]], label [[PRED_STORE_CONTINUE4:%.]]
				; CHECK: pred.store.if3:
				; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x i32> [[TMP7]], i32 0
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds [18 x i8], [18 x i8] @a, i32 0, i32 [[TMP9]]
				; CHECK-NEXT: store i8 7, i8* [[TMP10]], align 8
				; CHECK-NEXT: br label [[PRED_STORE_CONTINUE4]]
				; CHECK: pred.store.continue4:
				; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i1> [[TMP0]], i32 1
				; CHECK-NEXT: br i1 [[TMP11]], label [[PRED_STORE_IF5:%.*]], label [[PRED_STORE_CONTINUE6]]
				; CHECK: pred.store.if5:
				; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i32> [[TMP7]], i32 1
				; CHECK-NEXT: [[TMP13:%.]] = getelementptr inbounds [18 x i8], [18 x i8] @a, i32 0, i32 [[TMP12]]
				; CHECK-NEXT: store i8 7, i8* [[TMP13]], align 8
				; CHECK-NEXT: br label [[PRED_STORE_CONTINUE6]]
				; CHECK: pred.store.continue6:
				; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 2
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], <i32 2, i32 2>
				; CHECK-NEXT: [[TMP14:%.*]] = icmp eq i32 [[INDEX_NEXT]], 16
				; CHECK-NEXT: br i1 [[TMP14]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[J:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[J_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[AJ:%.]] = getelementptr inbounds [18 x i8], [18 x i8] @a, i32 0, i32 [[J]]
				; CHECK-NEXT: store i8 69, i8* [[AJ]], align 8
				; CHECK-NEXT: [[JP3:%.*]] = add nuw nsw i32 3, [[J]]
				; CHECK-NEXT: [[AJP3:%.]] = getelementptr inbounds [18 x i8], [18 x i8] @a, i32 0, i32 [[JP3]]
				; CHECK-NEXT: store i8 7, i8* [[AJP3]], align 8
				; CHECK-NEXT: [[J_NEXT]] = add nuw nsw i32 [[J]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[J_NEXT]], 15
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !2
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body:
				%j = phi i32 [ 0, %entry ], [ %j.next, %for.body ]
				%aj = getelementptr inbounds [18 x i8], [18 x i8]* @a, i32 0, i32 %j
				store i8 69, i8* %aj, align 8
				%jp3 = add nuw nsw i32 3, %j
				%ajp3 = getelementptr inbounds [18 x i8], [18 x i8]* @a, i32 0, i32 %jp3
				store i8 7, i8* %ajp3, align 8
				%j.next = add nuw nsw i32 %j, 1
				%exitcond = icmp eq i32 %j.next, 15
				br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !0

				for.end:
				ret void
				}

				!0 = distinct !{!0, !1}
				!1 = !{!"llvm.loop.vectorize.enable", i1 true}
				fhahnUnsubmitted Not Done Reply Inline Actions nit both loop and vectorize.enable metadata not needed? fhahn: nit both loop and vectorize.enable metadata not needed?
				AyalAuthorUnsubmitted Done Reply Inline Actions they are needed in order to get the loop vectorized, eventually with VF=2 and a folded tail. Ayal: they are needed in order to get the loop vectorized, eventually with VF=2 and a folded tail.