This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Increase the user cost of vector instructions by their legalization cost
ClosedPublic

Authored by bnemanich on Oct 16 2017, 10:13 AM.

Download Raw Diff

Details

Reviewers

hfinkel
echristo

Commits

rG488782efa3ea: The cost of splitting a large vector instruction is not being taken into…
rL316174: The cost of splitting a large vector instruction is not being taken into…

Summary

The cost of splitting a large vector instruction is not being taken into account by the getUserCost function. This was leading to some loops being over unrolled. The cost of a vector instruction is now being multiplied by the cost of the type legalization. This will return a more accurate cost.

Diff Detail

Event Timeline

bnemanich created this revision.Oct 16 2017, 10:13 AM

Herald added subscribers: kbarton, nemanjai. · View Herald TranscriptOct 16 2017, 10:13 AM

echristo added inline comments.Oct 16 2017, 1:08 PM

lib/Target/PowerPC/PPCTargetTransformInfo.cpp
194–197	I have concerns about this being Power only. And to what costs we're applying this to, etc. Thoughts?

bnemanich added inline comments.Oct 17 2017, 8:55 AM

lib/Target/PowerPC/PPCTargetTransformInfo.cpp
194–197	It could be done for all platforms, but I was worried that it would hurt some tuning that I wasn't able to test. This should only affect cases where there are vector instructions in the IR that are larger than the machine can handle. In those cases, getUserCost will not give a more accurate description of the cost. For example, if there is a 16 wide add of i32s, getUserCost will now return a value of 4 instead of 1, since it will eventually be split into 4 instructions on a machine that only has 128bit wide vector instructions.

I guess number of instructions is rather hard coded in. I'd definitely like to see this across other targets, but it works for now.

Thanks!

-eric

This revision is now accepted and ready to land.Oct 18 2017, 12:30 PM

Closed by commit rL316174: The cost of splitting a large vector instruction is not being taken into… (authored by gyiu). · Explain WhyOct 19 2017, 11:16 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

PowerPC/

PPCTargetTransformInfo.h

2 lines

PPCTargetTransformInfo.cpp

11 lines

test/

Transforms/

LoopUnroll/

PowerPC/

p8-unrolling-legalize-vectors.ll

74 lines

Diff 119171

lib/Target/PowerPC/PPCTargetTransformInfo.h

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	public:

using BaseT::getIntImmCost;		using BaseT::getIntImmCost;
int getIntImmCost(const APInt &Imm, Type *Ty);		int getIntImmCost(const APInt &Imm, Type *Ty);

int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);		int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);
int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);

		unsigned getUserCost(const User U, ArrayRef<const Value > Operands);

TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth);		TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP);		TTI::UnrollingPreferences &UP);

/// @}		/// @}

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{
Show All 36 Lines

lib/Target/PowerPC/PPCTargetTransformInfo.cpp

Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines	if (Idx == ImmIdx && Imm.getBitWidth() <= 64) {

if (ShiftedFree && (Imm.getZExtValue() & 0xFFFF) == 0)		if (ShiftedFree && (Imm.getZExtValue() & 0xFFFF) == 0)
return TTI::TCC_Free;		return TTI::TCC_Free;
}		}

return PPCTTIImpl::getIntImmCost(Imm, Ty);		return PPCTTIImpl::getIntImmCost(Imm, Ty);
}		}

		unsigned PPCTTIImpl::getUserCost(const User *U,
		ArrayRef<const Value *> Operands) {
		if (U->getType()->isVectorTy()) {
		// Instructions that need to be split should cost more.
		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, U->getType());
		return LT.first * BaseT::getUserCost(U, Operands);
		echristoUnsubmitted Not Done Reply Inline Actions I have concerns about this being Power only. And to what costs we're applying this to, etc. Thoughts? echristo: I have concerns about this being Power only. And to what costs we're applying this to, etc.
		bnemanichAuthorUnsubmitted Not Done Reply Inline Actions It could be done for all platforms, but I was worried that it would hurt some tuning that I wasn't able to test. This should only affect cases where there are vector instructions in the IR that are larger than the machine can handle. In those cases, getUserCost will not give a more accurate description of the cost. For example, if there is a 16 wide add of i32s, getUserCost will now return a value of 4 instead of 1, since it will eventually be split into 4 instructions on a machine that only has 128bit wide vector instructions. bnemanich: It could be done for all platforms, but I was worried that it would hurt some tuning that I…
		}

		return BaseT::getUserCost(U, Operands);
		}

void PPCTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void PPCTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
if (ST->getDarwinDirective() == PPC::DIR_A2) {		if (ST->getDarwinDirective() == PPC::DIR_A2) {
// The A2 is in-order with a deep pipeline, and concatenation unrolling		// The A2 is in-order with a deep pipeline, and concatenation unrolling
// helps expose latency-hiding opportunities to the instruction scheduler.		// helps expose latency-hiding opportunities to the instruction scheduler.
UP.Partial = UP.Runtime = true;		UP.Partial = UP.Runtime = true;

// We unroll a lot on the A2 (hundreds of instructions), and the benefits		// We unroll a lot on the A2 (hundreds of instructions), and the benefits
▲ Show 20 Lines • Show All 264 Lines • Show Last 20 Lines

test/Transforms/LoopUnroll/PowerPC/p8-unrolling-legalize-vectors.ll

This file was added.

				; RUN: opt < %s -S -mtriple=powerpc64le-unknown-linux-gnu -mcpu=pwr8 -loop-unroll \| FileCheck %s
				; RUN: opt < %s -S -mtriple=powerpc64le-unknown-linux-gnu -mcpu=pwr9 -loop-unroll \| FileCheck %s

				target datalayout = "e-m:e-i64:64-n32:64"
				target triple = "powerpc64le-unknown-linux-gnu"

				; Function Attrs: norecurse nounwind
				define i8* @f(i8* returned %s, i32 zeroext %x, i32 signext %k) local_unnamed_addr #0 {
				entry:
				%cmp10 = icmp sgt i32 %k, 0
				br i1 %cmp10, label %for.body.lr.ph, label %for.end

				for.body.lr.ph: ; preds = %entry
				%wide.trip.count = zext i32 %k to i64
				%min.iters.check = icmp ult i32 %k, 16
				br i1 %min.iters.check, label %for.body.preheader, label %vector.ph

				vector.ph: ; preds = %for.body.lr.ph
				%n.vec = and i64 %wide.trip.count, 4294967280
				%broadcast.splatinsert = insertelement <16 x i32> undef, i32 %x, i32 0
				%broadcast.splat = shufflevector <16 x i32> %broadcast.splatinsert, <16 x i32> undef, <16 x i32> zeroinitializer
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%vec.ind12 = phi <16 x i32> [ <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>, %vector.ph ], [ %vec.ind.next13, %vector.body ]
				%0 = shl <16 x i32> <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>, %vec.ind12
				%1 = and <16 x i32> %0, %broadcast.splat
				%2 = icmp eq <16 x i32> %1, zeroinitializer
				%3 = select <16 x i1> %2, <16 x i8> <i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48>, <16 x i8> <i8 49, i8 49, i8 49, i8 49, i8 49, i8 49, i8 49, i8 49, i8 49, i8 49, i8 49, i8 49, i8 49, i8 49, i8 49, i8 49>
				%4 = getelementptr inbounds i8, i8* %s, i64 %index
				%5 = bitcast i8* %4 to <16 x i8>*
				store <16 x i8> %3, <16 x i8>* %5, align 1
				%index.next = add i64 %index, 16
				%vec.ind.next13 = add <16 x i32> %vec.ind12, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
				%6 = icmp eq i64 %index.next, %n.vec
				br i1 %6, label %middle.block, label %vector.body

				middle.block: ; preds = %vector.body
				%cmp.n = icmp eq i64 %n.vec, %wide.trip.count
				br i1 %cmp.n, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %middle.block, %for.body.lr.ph
				%indvars.iv.ph = phi i64 [ 0, %for.body.lr.ph ], [ %n.vec, %middle.block ]
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %indvars.iv.ph, %for.body.preheader ]
				%7 = trunc i64 %indvars.iv to i32
				%shl = shl i32 1, %7
				%and = and i32 %shl, %x
				%tobool = icmp eq i32 %and, 0
				%conv = select i1 %tobool, i8 48, i8 49
				%arrayidx = getelementptr inbounds i8, i8* %s, i64 %indvars.iv
				store i8 %conv, i8* %arrayidx, align 1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %middle.block, %entry
				%idxprom1 = sext i32 %k to i64
				%arrayidx2 = getelementptr inbounds i8, i8* %s, i64 %idxprom1
				store i8 0, i8* %arrayidx2, align 1
				ret i8* %s
				}


				; CHECK-LABEL: vector.body
				; CHECK: shl
				; CHECK-NEXT: and
				; CHECK: shl
				; CHECK-NEXT: and
				; CHECK: label %vector.body