This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Assign cost of scaling used in addressing mode for ARM cores
ClosedPublic

Authored by javed.absar on Sep 23 2016, 3:22 AM.

Download Raw Diff

Details

Reviewers

rengolin
t.p.northover
jmolloy

Commits

rG85874a936033: [ARM]: Assign cost of scaling used in addressing mode for ARM cores
rL284127: [ARM]: Assign cost of scaling used in addressing mode for ARM cores

Summary

This patch assigns cost of the scaling used in addressing. On many ARM cores, a negated register offset takes longer than a non-negated register offset in a register-offset addressing mode. For instance:

LDR R0, [R1, R2 LSL #2]
LDR R0, [R1, -R2 LSL #2]

Above, (1) takes less cycles than (2).

By assigning appropriate scaling factor cost, we enable the LLVM to make the right trade-offs in the optimization and code-selection phase.

The patch improves the performance as follows –

Cortex-A53 :   spec.twolf:  2.4%, ShootoutC++_matrix: 28.4%,  Stanford/Puzzle:  12.4%, IndirectAddressing-dbl:  5.49%
Cortex-A57 :  spec2006.hmmer: 1.5% , spec2006.lbm: 1.1%

The patch also improves performance on other third-party benchmarks

Diff Detail

Repository: rL LLVM

Event Timeline

javed.absar updated this revision to Diff 72247.Sep 23 2016, 3:22 AM

javed.absar retitled this revision from to Assign cost of scaling used in addressing mode for ARM cores.

javed.absar updated this object.

javed.absar added reviewers: rengolin, t.p.northover.

javed.absar added a subscriber: llvm-commits.

Herald added subscribers: samparker, rengolin, aemerson. · View Herald TranscriptSep 23 2016, 3:22 AM

jmolloy added a subscriber: jmolloy.Sep 23 2016, 3:25 AM

jmolloy added inline comments.

lib/Target/ARM/ARMISelLowering.cpp
11434 ↗	(On Diff #72247)	Where did "2" come from here? Why not "1"? (a rationale would be useful) This seems very broad-brush. I'd like to see it restricted down to a number of cores where this actually costs cycles. I have no idea if Apple's cores have this property, for example. And do our large cores, like Cortex-A72?

Hi James:
Thanks for the review and feedback.
I have made changes based on your comments:

Have reduced the scope to Cortex-A53 and Cortex-A57 where I can confirm the performance gains currently based on benchmark runs (results shared previously).
Have adjusted the cost from '2' to '1' as this is sufficient to differentiate.

Thanks

Hi Javed,

The improvements are indeed impressive, but the implementation is in the wrong place. We're moving away from using CPU flags and CPU specific code gen decisions. We need to move this into a proper cost model.

cheers,
--renato

lib/Target/ARM/ARMSubtarget.h
415 ↗	(On Diff #72473)	We don't want these at all. Please, make sure this is a feature flag, or a cost model / scheduler decision.

This revision now requires changes to proceed.Sep 26 2016, 9:18 AM

Also, you mention all the improvements, but not regressions. Can you share the full picture of the benchmark results?

Hi Renato:
Here is the full list of performance regressions and improvements.

Cortex-A53
--Performance Regressions
----MultiSource/Applications/SIBsim4/SIBsim4.............................................................................. 2.00%
---MultiSource/Applications/sqlite3/sqlite3......................................................................................1.55%

++Performance Improvements
++++SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix..........................28.42%
++++SingleSource/Benchmarks/Stanford/Puzzle......................................................................12.40%
++++MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.....5.49%
++++SingleSource/Benchmarks/Misc/ReedSolomon.................................................................4.32%
++++MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt..3.39%
++++MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt..........................2.05%
++++SingleSource/Benchmarks/Shootout/matrix .......................................................................1.95%
++++MultiSource/Benchmarks/mafft/pairlocalalign....................................................................1.63%
++++MultiSource/Applications/obsequi/Obsequi........................................................................1.41%
++++MultiSource/Benchmarks/7zip/7zip-benchmark................................................................1.09%

++++External/SPEC/CINT2000/300.twolf/300.twolf......................................................................2.38%

Cortex-A57
--Performance Regressions
----MultiSource/Applications/hexxagon/hexxagon.....................................................................5.59%
----MultiSource/Benchmarks/Ptrdist/anagram/anagram..........................................................5.46%
----SingleSource/Benchmarks/Shootout-C++/Shootout-C++-lists......................................3.15%
----External/SPEC/CINT2000/253.perlbmk/253.perlbmk............................................................1.64%

++Performance Improvements
++++MultiSource/Benchmarks/mafft/pairlocalalign.............................................................12.46%
++++SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.........................5.50%
++++SingleSource/Benchmarks/Shootout/matrix ...................................................................4.88%
++++SingleSource/Benchmarks/Stanford/Puzzle.....................................................................4.36%
++++MultiSource/Benchmarks/Fhourstones/fhourstones.....................................................1.90%
++++MultiSource/Applications/SIBsim4/SIBsim4......................................................................1.56%
++++SingleSource/Benchmarks/Misc/ReedSolomon...............................................................1.27%

++++spec.cpu2006.ref.456_hmmer..................................................................................................1.55%
++++spec.cpu2006.ref.470_lbm.........................................................................................................1.07%

Best Regards
Javed

Hi Renato:

I have now changed the implementation as you requested (sub-target feature flag). Also I have shared the full list of improvements and regressions of the benchmarks. Please let me know if this is all ok now.

Best Regards
Javed

Hi Renato:

I have changed the implementation as you suggested (sub-target feature flag). Also have shared the full list of improvements and regressions for the benchmarks. Please let me know if you are satisfied and I can commit the patch.
Thanks and Best Regards
Javed

rengolin added inline comments.Oct 11 2016, 8:06 AM

lib/Target/ARM/ARMISelLowering.cpp
11604 ↗	(On Diff #73480)	This is not really a cost model... but I don't have a better idea. This is a very specific feature. @jmolloy, are you happy with this?
11608 ↗	(On Diff #73480)	I guess -1 here means illegal, though without seeing the code that is using it, I can't assume.
lib/Target/ARM/ARMISelLowering.h
299 ↗	(On Diff #73480)	Where is this being used?

javed.absar added inline comments.Oct 11 2016, 8:18 AM

lib/Target/ARM/ARMISelLowering.cpp
11608 ↗	(On Diff #73480)	You are right. The same is mentioned in Analysis/TargetTransformInfo.h wherein it is stated "If the AM is not supported, it returns a negative value".
lib/Target/ARM/ARMISelLowering.h
299 ↗	(On Diff #73480)	It is used in LoopStrengthReduce, where decisions to transform a loop is taken based on a cost-model that takes scaling-factor cost into consideration.

Whenever @jmolloy is happy, I'm happy. Thanks for the changes.

lib/Target/ARM/ARMISelLowering.cpp
11608 ↗	(On Diff #73480)	Right, ok.
lib/Target/ARM/ARMISelLowering.h
299 ↗	(On Diff #73480)	Oh, right, I missed the "override".

jmolloy added inline comments.Oct 12 2016, 5:43 AM

lib/Target/ARM/ARMISelLowering.cpp
11604 ↗	(On Diff #73480)	I know you asked for it to be a feature flag, but... I'm not particularly keen on it as a feature flag. Are we moving towards feature flags for cost model decisions now? :/ Wouldn't something like: switch (Subtarget->getCPUName()) { default: return 0; case Cortex-R52: return AM.Scale < 0 ? 1 : 0; } be alright? It's more cost-model-like...

rengolin added inline comments.Oct 12 2016, 5:59 AM

lib/Target/ARM/ARMISelLowering.cpp
11604 ↗	(On Diff #73480)	Good lord, no! We're moving away from core checks. :)

jmolloy accepted this revision.Oct 12 2016, 6:15 AM

jmolloy added a reviewer: jmolloy.

jmolloy added inline comments.

lib/Target/ARM/ARMISelLowering.cpp
11604 ↗	(On Diff #73480)	Huh. Well it LGTM then. The cost model (1 or 0) doesn't seem particularly bad.

Hi James/Renato:

Please could one of you mark this as 'accepted' if you are ok with it?

Thanks

In D24857#568181, @javed.absar wrote:
Hi James/Renato:
Please could one of you mark this as 'accepted' if you are ok with it?
Thanks

I did already.

Closed by commit rL284127: [ARM]: Assign cost of scaling used in addressing mode for ARM cores (authored by javed.absar). · Explain WhyOct 13 2016, 8:06 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

ARM/

8 lines

8 lines

11 lines

4 lines

test/

CodeGen/

ARM/

lsr-scale-addr-mode.ll

6 lines

Diff 74518

llvm/trunk/lib/Target/ARM/ARM.td

Show First 20 Lines • Show All 93 Lines • ▼ Show 20 Lines
def FeatureCrypto : SubtargetFeature<"crypto", "HasCrypto", "true",		def FeatureCrypto : SubtargetFeature<"crypto", "HasCrypto", "true",
"Enable support for Cryptography extensions",		"Enable support for Cryptography extensions",
[FeatureNEON]>;		[FeatureNEON]>;
def FeatureCRC : SubtargetFeature<"crc", "HasCRC", "true",		def FeatureCRC : SubtargetFeature<"crc", "HasCRC", "true",
"Enable support for CRC instructions">;		"Enable support for CRC instructions">;
// Not to be confused with FeatureHasRetAddrStack (return address stack)		// Not to be confused with FeatureHasRetAddrStack (return address stack)
def FeatureRAS : SubtargetFeature<"ras", "HasRAS", "true",		def FeatureRAS : SubtargetFeature<"ras", "HasRAS", "true",
"Enable Reliability, Availability and Serviceability extensions">;		"Enable Reliability, Availability and Serviceability extensions">;
		def FeatureFPAO : SubtargetFeature<"fpao", "HasFPAO", "true",
		"Enable fast computation of positive address offsets">;


// Cyclone has preferred instructions for zeroing VFP registers, which can		// Cyclone has preferred instructions for zeroing VFP registers, which can
// execute in 0 cycles.		// execute in 0 cycles.
def FeatureZCZeroing : SubtargetFeature<"zcz", "HasZeroCycleZeroing", "true",		def FeatureZCZeroing : SubtargetFeature<"zcz", "HasZeroCycleZeroing", "true",
"Has zero-cycle zeroing instructions">;		"Has zero-cycle zeroing instructions">;

// Whether or not it may be profitable to unpredicate certain instructions		// Whether or not it may be profitable to unpredicate certain instructions
▲ Show 20 Lines • Show All 658 Lines • ▼ Show 20 Lines	def : ProcNoItin<"cortex-a35", [ARMv8a, ProcA35,
FeatureHWDivARM,		FeatureHWDivARM,
FeatureCrypto,		FeatureCrypto,
FeatureCRC]>;		FeatureCRC]>;

def : ProcNoItin<"cortex-a53", [ARMv8a, ProcA53,		def : ProcNoItin<"cortex-a53", [ARMv8a, ProcA53,
FeatureHWDiv,		FeatureHWDiv,
FeatureHWDivARM,		FeatureHWDivARM,
FeatureCrypto,		FeatureCrypto,
FeatureCRC]>;		FeatureCRC,
		FeatureFPAO]>;

def : ProcNoItin<"cortex-a57", [ARMv8a, ProcA57,		def : ProcNoItin<"cortex-a57", [ARMv8a, ProcA57,
FeatureHWDiv,		FeatureHWDiv,
FeatureHWDivARM,		FeatureHWDivARM,
FeatureCrypto,		FeatureCrypto,
FeatureCRC]>;		FeatureCRC,
		FeatureFPAO]>;

def : ProcNoItin<"cortex-a72", [ARMv8a, ProcA72,		def : ProcNoItin<"cortex-a72", [ARMv8a, ProcA72,
FeatureHWDiv,		FeatureHWDiv,
FeatureHWDivARM,		FeatureHWDivARM,
FeatureCrypto,		FeatureCrypto,
FeatureCRC]>;		FeatureCRC]>;

def : ProcNoItin<"cortex-a73", [ARMv8a, ProcA73,		def : ProcNoItin<"cortex-a73", [ARMv8a, ProcA73,
▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/ARM/ARMISelLowering.h

Show First 20 Lines • Show All 285 Lines • ▼ Show 20 Lines	public:

bool allowTruncateForTailCall(Type Ty1, Type Ty2) const override;		bool allowTruncateForTailCall(Type Ty1, Type Ty2) const override;


/// isLegalAddressingMode - Return true if the addressing mode represented		/// isLegalAddressingMode - Return true if the addressing mode represented
/// by AM is legal for this target, for a load/store of the specified type.		/// by AM is legal for this target, for a load/store of the specified type.
bool isLegalAddressingMode(const DataLayout &DL, const AddrMode &AM,		bool isLegalAddressingMode(const DataLayout &DL, const AddrMode &AM,
Type *Ty, unsigned AS) const override;		Type *Ty, unsigned AS) const override;

		/// getScalingFactorCost - Return the cost of the scaling used in
		/// addressing mode represented by AM.
		/// If the AM is supported, the return value must be >= 0.
		/// If the AM is not supported, the return value must be negative.
		int getScalingFactorCost(const DataLayout &DL, const AddrMode &AM, Type *Ty,
		unsigned AS) const override;

bool isLegalT2ScaledAddressingMode(const AddrMode &AM, EVT VT) const;		bool isLegalT2ScaledAddressingMode(const AddrMode &AM, EVT VT) const;

/// isLegalICmpImmediate - Return true if the specified immediate is legal		/// isLegalICmpImmediate - Return true if the specified immediate is legal
/// icmp immediate, that is the target has icmp instructions which can		/// icmp immediate, that is the target has icmp instructions which can
/// compare a register against the immediate without having to materialize		/// compare a register against the immediate without having to materialize
/// the immediate into a register.		/// the immediate into a register.
bool isLegalICmpImmediate(int64_t Imm) const override;		bool isLegalICmpImmediate(int64_t Imm) const override;

▲ Show 20 Lines • Show All 402 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,606 Lines • ▼ Show 20 Lines	bool ARMTargetLowering::allowTruncateForTailCall(Type Ty1, Type Ty2) const {

assert(Ty1->getPrimitiveSizeInBits() <= 64 && "i128 is probably not a noop");		assert(Ty1->getPrimitiveSizeInBits() <= 64 && "i128 is probably not a noop");

// Assuming the caller doesn't have a zeroext or signext return parameter,		// Assuming the caller doesn't have a zeroext or signext return parameter,
// truncation all the way down to i1 is valid.		// truncation all the way down to i1 is valid.
return true;		return true;
}		}

		int ARMTargetLowering::getScalingFactorCost(const DataLayout &DL,
		const AddrMode &AM, Type *Ty,
		unsigned AS) const {
		if (isLegalAddressingMode(DL, AM, Ty, AS)) {
		if (Subtarget->hasFPAO())
		return AM.Scale < 0 ? 1 : 0; // positive offsets execute faster
		return 0;
		}
		return -1;
		}


static bool isLegalT1AddressImmediate(int64_t V, EVT VT) {		static bool isLegalT1AddressImmediate(int64_t V, EVT VT) {
if (V < 0)		if (V < 0)
return false;		return false;

unsigned Scale = 1;		unsigned Scale = 1;
switch (VT.getSimpleVT().SimpleTy) {		switch (VT.getSimpleVT().SimpleTy) {
default: return false;		default: return false;
▲ Show 20 Lines • Show All 1,669 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/ARM/ARMSubtarget.h

Show First 20 Lines • Show All 229 Lines • ▼ Show 20 Lines	protected:

/// HasRAS - if true, the processor supports RAS extensions		/// HasRAS - if true, the processor supports RAS extensions
bool HasRAS = false;		bool HasRAS = false;

/// If true, the instructions "vmov.i32 d0, #0" and "vmov.i32 q0, #0" are		/// If true, the instructions "vmov.i32 d0, #0" and "vmov.i32 q0, #0" are
/// particularly effective at zeroing a VFP register.		/// particularly effective at zeroing a VFP register.
bool HasZeroCycleZeroing = false;		bool HasZeroCycleZeroing = false;

		/// HasFPAO - if true, processor does positive address offset computation faster
		bool HasFPAO = false;

/// If true, if conversion may decide to leave some instructions unpredicated.		/// If true, if conversion may decide to leave some instructions unpredicated.
bool IsProfitableToUnpredicate = false;		bool IsProfitableToUnpredicate = false;

/// If true, VMOV will be favored over VGETLNi32.		/// If true, VMOV will be favored over VGETLNi32.
bool HasSlowVGETLNi32 = false;		bool HasSlowVGETLNi32 = false;

/// If true, VMOV will be favored over VDUP.		/// If true, VMOV will be favored over VDUP.
bool HasSlowVDUP32 = false;		bool HasSlowVDUP32 = false;
▲ Show 20 Lines • Show All 202 Lines • ▼ Show 20 Lines	public:
bool useFPVMLx() const { return !SlowFPVMLx; }		bool useFPVMLx() const { return !SlowFPVMLx; }
bool hasVMLxForwarding() const { return HasVMLxForwarding; }		bool hasVMLxForwarding() const { return HasVMLxForwarding; }
bool isFPBrccSlow() const { return SlowFPBrcc; }		bool isFPBrccSlow() const { return SlowFPBrcc; }
bool isFPOnlySP() const { return FPOnlySP; }		bool isFPOnlySP() const { return FPOnlySP; }
bool hasPerfMon() const { return HasPerfMon; }		bool hasPerfMon() const { return HasPerfMon; }
bool hasTrustZone() const { return HasTrustZone; }		bool hasTrustZone() const { return HasTrustZone; }
bool has8MSecExt() const { return Has8MSecExt; }		bool has8MSecExt() const { return Has8MSecExt; }
bool hasZeroCycleZeroing() const { return HasZeroCycleZeroing; }		bool hasZeroCycleZeroing() const { return HasZeroCycleZeroing; }
		bool hasFPAO() const { return HasFPAO; }
bool isProfitableToUnpredicate() const { return IsProfitableToUnpredicate; }		bool isProfitableToUnpredicate() const { return IsProfitableToUnpredicate; }
bool hasSlowVGETLNi32() const { return HasSlowVGETLNi32; }		bool hasSlowVGETLNi32() const { return HasSlowVGETLNi32; }
bool hasSlowVDUP32() const { return HasSlowVDUP32; }		bool hasSlowVDUP32() const { return HasSlowVDUP32; }
bool preferVMOVSR() const { return PreferVMOVSR; }		bool preferVMOVSR() const { return PreferVMOVSR; }
bool preferISHSTBarriers() const { return PreferISHST; }		bool preferISHSTBarriers() const { return PreferISHST; }
bool expandMLx() const { return ExpandMLx; }		bool expandMLx() const { return ExpandMLx; }
bool hasVMLxHazards() const { return HasVMLxHazards; }		bool hasVMLxHazards() const { return HasVMLxHazards; }
bool hasSlowOddRegister() const { return SlowOddRegister; }		bool hasSlowOddRegister() const { return SlowOddRegister; }
▲ Show 20 Lines • Show All 176 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/ARM/lsr-scale-addr-mode.ll

	; RUN: llc -mtriple=arm-eabi %s -o - \| FileCheck %s			; RUN: llc -mtriple=arm-eabi %s -o - \| FileCheck %s
	; Should use scaled addressing mode.			; Should use scaled addressing mode.

				; RUN: llc -mtriple=arm-eabi -mcpu=cortex-a53 %s -o - \| FileCheck %s -check-prefix CHECK-NONEGOFF-A53
				; RUN: llc -mtriple=arm-eabi -mcpu=cortex-a57 %s -o - \| FileCheck %s -check-prefix CHECK-NONEGOFF-A57
				; Should not generate negated register offset

	define void @sintzero(i32* %a) nounwind {			define void @sintzero(i32* %a) nounwind {
	entry:			entry:
	store i32 0, i32* %a			store i32 0, i32* %a
	br label %cond_next			br label %cond_next

	cond_next: ; preds = %cond_next, %entry			cond_next: ; preds = %cond_next, %entry
	%indvar = phi i32 [ 0, %entry ], [ %tmp25, %cond_next ] ; <i32> [#uses=1]			%indvar = phi i32 [ 0, %entry ], [ %tmp25, %cond_next ] ; <i32> [#uses=1]
	%tmp25 = add i32 %indvar, 1 ; <i32> [#uses=3]			%tmp25 = add i32 %indvar, 1 ; <i32> [#uses=3]
	%tmp36 = getelementptr i32, i32* %a, i32 %tmp25 ; <i32*> [#uses=1]			%tmp36 = getelementptr i32, i32* %a, i32 %tmp25 ; <i32*> [#uses=1]
	store i32 0, i32* %tmp36			store i32 0, i32* %tmp36
	icmp eq i32 %tmp25, -1 ; <i1>:0 [#uses=1]			icmp eq i32 %tmp25, -1 ; <i1>:0 [#uses=1]
	br i1 %0, label %return, label %cond_next			br i1 %0, label %return, label %cond_next

	return: ; preds = %cond_next			return: ; preds = %cond_next
	ret void			ret void
	}			}

	; CHECK: lsl{{.*}}#2]			; CHECK: lsl{{.*}}#2]
				; CHECK-NONEGOFF-A53: [{{r[0-9]+}}, {{r[0-9]+}}, lsl{{.*}}#2]
				; CHECK-NONEGOFF-A57: [{{r[0-9]+}}, {{r[0-9]+}}, lsl{{.*}}#2]