This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
-
PPCTargetTransformInfo.h
2/3
PPCTargetTransformInfo.cpp
-
test/Transforms/LoopStrengthReduce/PowerPC/
-
Transforms/
-
LoopStrengthReduce/
-
PowerPC/
4/4
lsr-insns-3.ll

Differential D40855

[PowerPC] LSR tunings for PowerPC
ClosedPublic

Authored by stefanp on Dec 5 2017, 2:23 PM.

Download Raw Diff

Details

Reviewers

hfinkel
power-llvm-team
bnemanich

Commits

rGf8438e8e5918: [PowerPC] LSR tunings for PowerPC
rL326906: [PowerPC] LSR tunings for PowerPC

Summary

The purpose of this patch is to make LSR generate better code for Power. This is done by overriding isLSRCostLess to check instruction counts like X86 does. This was shown to show a 0.10% improvement in the geometric mean of SPEC 2006 tests on Power systems.

Diff Detail

Event Timeline

bnemanich created this revision.Dec 5 2017, 2:23 PM

Herald added subscribers: kbarton, nemanjai. · View Herald TranscriptDec 5 2017, 2:23 PM

echristo added inline comments.Dec 5 2017, 2:30 PM

lib/Target/PowerPC/PPCTargetTransformInfo.cpp
262–263	Probably best to explain why this is useful and you can remove the "PowerPC specific" bit.
test/Transforms/LoopStrengthReduce/PowerPC/lsr-insns-3.ll
4	Good to have a comment here of exactly what you're looking for as part of the test to make updating it easier. Right now I'd have no idea.
70	Can limit the attributes of all the functions as well?

hfinkel added inline comments.Dec 5 2017, 2:36 PM

lib/Target/PowerPC/PPCTargetTransformInfo.cpp
262–263	Also, why this formula? If C1.Insns < C2.Insns, we're deferring to the base implementation, but not if C1.Insns > C2.Insns. That seems different from what X86 does, where it defers whenever the instruction counts aren't equal: ol X86TTIImpl::isLSRCostLess(TargetTransformInfo::LSRCost &C1, TargetTransformInfo::LSRCost &C2) { // X86 specific here are "instruction number 1st priority". return std::tie(C1.Insns, C1.NumRegs, C1.AddRecCost, C1.NumIVMuls, C1.NumBaseAdds, C1.ScaleCost, C1.ImmCost, C1.SetupCost) < std::tie(C2.Insns, C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds, C2.ScaleCost, C2.ImmCost, C2.SetupCost); }

jtony added a subscriber: jtony.Dec 6 2017, 8:14 AM

jtony added inline comments.

test/Transforms/LoopStrengthReduce/PowerPC/lsr-insns-3.ll
57	We could remove this as well.
71	I agree with Eric, most of these `attribute #` could be cleaned up (removed). We want to only keep the necessary info in the test to make the test case as short as possible.

I added some more comments about the cost calculation and updated the test to simplify it and explain what it was looking for.

bnemanich marked 6 inline comments as done.Dec 12 2017, 9:59 AM

hfinkel added inline comments.Dec 12 2017, 9:14 PM

lib/Target/PowerPC/PPCTargetTransformInfo.cpp
266	This is a proper partial ordering: isLSRCostLess(C1, C2) && isLSRCostLess(C2, C1) is always false. However, I'm not sure the meaning is obvious... For example, imagine that BaseT::isLSRCostLess(C1, C2) is true and BaseT::isLSRCostLess(C2, C1) is false. In this case, isLSRCostLess is either equivalent to BaseT::isLSRCostLess(C1, C2), if (C1.Insns <= C2.Insns), or isLSRCostLess implies equality (if C1.Insns > C2.Insns), because in this latter case, isLSRCostLess(C1, C2) is false and isLSRCostLess(C2, C1) is also false. In other words, you've taken all {C1, C2} for which C1 < C2 (by BaseT::isLSRCostLess) and C1.Insns > C2.Insns, and made them equal in LSR cost. You say that this is a more conservative selection, but it just seems more arbitrary to me (you've enlarged the class of equal things). If this is what you intended, it should have a specific explanation.

junbuml added a subscriber: junbuml.Dec 13 2017, 6:50 AM

I'm going to start looking at this patch now.

@hfinkel
Hi Hal,
Yes that is correct. We are increasing the size of the class of equal elements.
The idea is that LSR will only replace one formula with another if it finds something with a lower cost. By returning false more often LSR will do fewer replacements and that is what we mean by more conservative. Maybe that's not really clear in the comment and I can update the comment to make it more clear.

I think you are looking for a theoretical reason as to why we picked this formula over others. However, the reason we settled on this formula as opposed to what is on x86 for example was based on the performance testing that we did. We tried a number of options and this one came out the best. We used this formula as an opportunity to tune LSR.

Modified the tuning formula to take into consideration the idea that we shouldn’t be enlarging the class of equal items.
After looking into the LSR algorithm we discovered that if we enlarge the space of equal items we do not favor unchanged code as we had initially desired.
Therefore, this change does almost exactly what the default isLSRCostLess function does (from the base class) except that number of instructions has been added as the most important category to compare.

LGTM

This revision is now accepted and ready to land.Mar 7 2018, 1:53 AM

Closed by commit rL326906: [PowerPC] LSR tunings for PowerPC (authored by stefanp). · Explain WhyMar 7 2018, 8:56 AM

This revision was automatically updated to reflect the committed changes.

One inline grammar-o.

That said, I'm seeing some failures internally here after this - I'm trying to get a testcase for you though. Almost assuredly a latent bug.

llvm/trunk/test/Transforms/LoopStrengthReduce/PowerPC/lsr-insns-3.ll
3 ↗	(On Diff #137405)	less -> fewer :)

@echristo
Hi Eric,
Would you like me to pull this change out and fix the bug before I put it back in?

It's be great if you wouldn't mind. Still working on getting you a test
case.

I've reverted the change in rL327143.

Revision Contents

Path

Size

lib/

Target/

PowerPC/

PPCTargetTransformInfo.h

3 lines

PPCTargetTransformInfo.cpp

13 lines

test/

Transforms/

LoopStrengthReduce/

PowerPC/

lsr-insns-3.ll

57 lines

Diff 132449

lib/Target/PowerPC/PPCTargetTransformInfo.h

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);

unsigned getUserCost(const User U, ArrayRef<const Value > Operands);		unsigned getUserCost(const User U, ArrayRef<const Value > Operands);

TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth);		TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP);		TTI::UnrollingPreferences &UP);

		bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
		TargetTransformInfo::LSRCost &C2);

/// @}		/// @}

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{
bool useColdCCForColdCall(Function &F);		bool useColdCCForColdCall(Function &F);
bool enableAggressiveInterleaving(bool LoopHasReductions);		bool enableAggressiveInterleaving(bool LoopHasReductions);
const TTI::MemCmpExpansionOptions *enableMemCmpExpansion(		const TTI::MemCmpExpansionOptions *enableMemCmpExpansion(
bool IsZeroCmp) const;		bool IsZeroCmp) const;
Show All 33 Lines

lib/Target/PowerPC/PPCTargetTransformInfo.cpp

Show First 20 Lines • Show All 250 Lines • ▼ Show 20 Lines	PPCTTIImpl::enableMemCmpExpansion(bool IsZeroCmp) const {
}();		}();
return &Options;		return &Options;
}		}

bool PPCTTIImpl::enableInterleavedAccessVectorization() {		bool PPCTTIImpl::enableInterleavedAccessVectorization() {
return true;		return true;
}		}

		bool PPCTTIImpl::isLSRCostLess(TargetTransformInfo::LSRCost &C1,
		TargetTransformInfo::LSRCost &C2) {
		// This is mainly the default cost calculation. The only difference
		// is that now the number of instructions is the most important
		// metric.
		echristoUnsubmitted Done Reply Inline Actions Probably best to explain why this is useful and you can remove the "PowerPC specific" bit. echristo: Probably best to explain why this is useful and you can remove the "PowerPC specific" bit.
		hfinkelUnsubmitted Done Reply Inline Actions Also, why this formula? If C1.Insns < C2.Insns, we're deferring to the base implementation, but not if C1.Insns > C2.Insns. That seems different from what X86 does, where it defers whenever the instruction counts aren't equal: ol X86TTIImpl::isLSRCostLess(TargetTransformInfo::LSRCost &C1, TargetTransformInfo::LSRCost &C2) { // X86 specific here are "instruction number 1st priority". return std::tie(C1.Insns, C1.NumRegs, C1.AddRecCost, C1.NumIVMuls, C1.NumBaseAdds, C1.ScaleCost, C1.ImmCost, C1.SetupCost) < std::tie(C2.Insns, C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds, C2.ScaleCost, C2.ImmCost, C2.SetupCost); } hfinkel: Also, why this formula? If C1.Insns < C2.Insns, we're deferring to the base implementation, but…
		return std::tie(C1.Insns, C1.NumRegs, C1.AddRecCost,
		C1.NumIVMuls, C1.NumBaseAdds,
		C1.ScaleCost, C1.ImmCost, C1.SetupCost) <
		hfinkelUnsubmitted Not Done Reply Inline Actions This is a proper partial ordering: isLSRCostLess(C1, C2) && isLSRCostLess(C2, C1) is always false. However, I'm not sure the meaning is obvious... For example, imagine that BaseT::isLSRCostLess(C1, C2) is true and BaseT::isLSRCostLess(C2, C1) is false. In this case, isLSRCostLess is either equivalent to BaseT::isLSRCostLess(C1, C2), if (C1.Insns <= C2.Insns), or isLSRCostLess implies equality (if C1.Insns > C2.Insns), because in this latter case, isLSRCostLess(C1, C2) is false and isLSRCostLess(C2, C1) is also false. In other words, you've taken all {C1, C2} for which C1 < C2 (by BaseT::isLSRCostLess) and C1.Insns > C2.Insns, and made them equal in LSR cost. You say that this is a more conservative selection, but it just seems more arbitrary to me (you've enlarged the class of equal things). If this is what you intended, it should have a specific explanation. hfinkel: This is a proper partial ordering: isLSRCostLess(C1, C2) && isLSRCostLess(C2, C1) is always…
		std::tie(C2.Insns, C2.NumRegs, C2.AddRecCost,
		C2.NumIVMuls, C2.NumBaseAdds,
		C2.ScaleCost, C2.ImmCost, C2.SetupCost);
		}

unsigned PPCTTIImpl::getNumberOfRegisters(bool Vector) {		unsigned PPCTTIImpl::getNumberOfRegisters(bool Vector) {
if (Vector && !ST->hasAltivec() && !ST->hasQPX())		if (Vector && !ST->hasAltivec() && !ST->hasQPX())
return 0;		return 0;
return ST->hasVSX() ? 64 : 32;		return ST->hasVSX() ? 64 : 32;
}		}

unsigned PPCTTIImpl::getRegisterBitWidth(bool Vector) const {		unsigned PPCTTIImpl::getRegisterBitWidth(bool Vector) const {
if (Vector) {		if (Vector) {
▲ Show 20 Lines • Show All 229 Lines • Show Last 20 Lines

test/Transforms/LoopStrengthReduce/PowerPC/lsr-insns-3.ll

This file was added.

				; RUN: llc < %s -O2 -mtriple=powerpc64le-unknown-linux-gnu \| FileCheck %s

				; LLC checks that LSR prefers less instructions to less induction variables
				; Without the PPC specific LSR cost model, extra addition instructions
				echristoUnsubmitted Done Reply Inline Actions Good to have a comment here of exactly what you're looking for as part of the test to make updating it easier. Right now I'd have no idea. echristo: Good to have a comment here of exactly what you're looking for as part of the test to make…
				; will occur within the loop before the call to _ZN6myTypeC1Ev.

				target datalayout = "e-m:e-i64:64-n32:64"
				target triple = "powerpc64le-unknown-linux-gnu"

				%struct.myType2 = type <{ i32, i8, %struct.myType, [2 x i8] }>
				%struct.myType = type { i8 }

				define nonnull %struct.myType2* @_Z6myIniti(i64 signext %n) local_unnamed_addr #0 personality i8* bitcast (i32 (...)* @__gxx_personality_v0 to i8*) {
				; CHECK-LABEL: _Z6myIniti:
				; CHECK: [[LABEL1:.LBB[0-9A-Z_]+]]:
				; CHECK: mr {{[0-9]+}}, [[REG1:[0-9]+]]
				; CHECK-NEXT: bl _ZN6myTypeC1Ev
				; CHECK: addi [[REG2:[0-9]+]], [[REG2]], -8
				; CHECK-NEXT: addi [[REG1]], [[REG1]], 8
				; CHECK-NEXT: cmpldi [[REG2]], 0
				; CHECK-NEXT: bne 0, [[LABEL1]]

				entry:
				%call = tail call i8* @_Znam(i64 %n) #5
				%cast = bitcast i8* %call to %struct.myType2*
				%arrayctor.end = getelementptr inbounds %struct.myType2, %struct.myType2* %cast, i64 %n
				br label %arrayctor.loop

				arrayctor.loop: ; preds = %invoke.cont, %new.ctorloop
				%arrayctor.cur = phi %struct.myType2* [ %cast, %entry ], [ %arrayctor.next, %invoke.cont ]
				%x.i = getelementptr inbounds %struct.myType2, %struct.myType2* %arrayctor.cur, i64 0, i32 2
				invoke void @_ZN6myTypeC1Ev(%struct.myType* nonnull %x.i)
				to label %invoke.cont unwind label %lpad

				invoke.cont: ; preds = %arrayctor.loop
				%arrayctor.next = getelementptr inbounds %struct.myType2, %struct.myType2* %arrayctor.cur, i64 1
				%arrayctor.done = icmp eq %struct.myType2* %arrayctor.next, %arrayctor.end
				br i1 %arrayctor.done, label %arrayctor.cont, label %arrayctor.loop

				arrayctor.cont: ; preds = %invoke.cont, %entry
				ret %struct.myType2* %cast

				lpad: ; preds = %arrayctor.loop
				%landing = landingpad { i8*, i32 }
				cleanup
				tail call void @_ZdaPv(i8* nonnull %call) #6
				resume { i8*, i32 } %landing
				}

				declare noalias nonnull i8* @_Znam(i64) local_unnamed_addr #2

				declare i32 @__gxx_personality_v0(...)

				declare void @_ZdaPv(i8*) local_unnamed_addr #3

				declare void @_ZN6myTypeC1Ev(%struct.myType*) unnamed_addr #4

				echristoUnsubmitted Done Reply Inline Actions Can limit the attributes of all the functions as well? echristo: Can limit the attributes of all the functions as well?
				jtonyUnsubmitted Done Reply Inline Actions I agree with Eric, most of these `attribute #` could be cleaned up (removed). We want to only keep the necessary info in the test to make the test case as short as possible. jtony: I agree with Eric, most of these `attribute #` could be cleaned up (removed). We want to only…
				jtonyUnsubmitted Done Reply Inline Actions We could remove this as well. jtony: We could remove this as well.