This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Enable merging of adjacent zero stores for all subtargets.
ClosedPublic

Authored by mcrosier on Nov 8 2016, 6:25 AM.

Download Raw Diff

Details

Reviewers

rengolin
t.p.northover
pgode
jmolloy
MatzeB
evandro

Commits

rG10c7aaaee981: [AArch64] Enable merging of adjacent zero stores for all subtargets.
rL286592: [AArch64] Enable merging of adjacent zero stores for all subtargets.

Summary

This optimization merges adjacent zero stores into a wider store.

e.g.,

strh wzr, [x0]
strh wzr, [x0, #2]
; becomes
str wzr, [x0]

e.g.,

str wzr, [x0]
str wzr, [x0, #4]
; becomes
str xzr, [x0]

Previously, this was only enabled for Kryo and Cortex-A57. I'd like to enable it for all subtargets.

Chad

Diff Detail

Repository: rL LLVM

Event Timeline

mcrosier updated this revision to Diff 77185.Nov 8 2016, 6:25 AM

mcrosier retitled this revision from to [AArch64] Enable merging of adjacent zero stores for all subtargets..

mcrosier updated this object.

mcrosier added reviewers: jmolloy, t.p.northover, rengolin, pgode, MatzeB, evandro.

mcrosier added subscribers: llvm-commits, gberry.

Herald added a subscriber: aemerson. · View Herald TranscriptNov 8 2016, 6:25 AM

LGTM, nitpick below.

BTW: I tried to find some discussion of why we have this load/store combining feature in the AArch64 target (don't we already have that in GVN and other places in the middleend?) but couldn't find any reviews or discussions that would motivate it.

test/CodeGen/AArch64/arm64-narrow-st-merge.ll
1–8 ↗	(On Diff #77185)	I don't see the benefits of testing this for every CPU in the AArch64 target. Maybe 1 single RUN: is enough here?

This revision is now accepted and ready to land.Nov 10 2016, 2:18 PM

In D26396#592260, @MatzeB wrote:

LGTM, nitpick below.

BTW: I tried to find some discussion of why we have this load/store combining feature in the AArch64 target (don't we already have that in GVN and other places in the middleend?) but couldn't find any reviews or discussions that would motivate it.

Thanks for the review, Matthias. I'll follow up on your question with @junbuml as he was the original author of this optimization. I thought we did a similar form of combing at isel lowering, but I may be mistaken.

test/CodeGen/AArch64/arm64-narrow-st-merge.ll
1–8 ↗	(On Diff #77185)	Sure.

Closed by commit rL286592: [AArch64] Enable merging of adjacent zero stores for all subtargets. (authored by mcrosier). · Explain WhyNov 11 2016, 6:19 AM

This revision was automatically updated to reflect the committed changes.

mcrosier marked an inline comment as done.

gberry added inline comments.Nov 11 2016, 10:29 AM

test/CodeGen/AArch64/arm64-narrow-st-merge.ll
9 ↗	(On Diff #77185)	Perhaps we should add a run that sets FeatureStrictAlign and check that this merging doesn't happen?

Committed r286617 to show strict align disables the narrow zero store optimization. Thanks, Geoff.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AArch64/

AArch64.td

7 lines

AArch64LoadStoreOptimizer.cpp

3 lines

AArch64Subtarget.h

2 lines

test/

CodeGen/

AArch64/

arm64-narrow-st-merge.ll

4 lines

Diff 77612

llvm/trunk/lib/Target/AArch64/AArch64.td

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	def FeatureStrictAlign : SubtargetFeature<"strict-align",
"StrictAlign", "true",		"StrictAlign", "true",
"Disallow all unaligned memory "		"Disallow all unaligned memory "
"access">;		"access">;

def FeatureReserveX18 : SubtargetFeature<"reserve-x18", "ReserveX18", "true",		def FeatureReserveX18 : SubtargetFeature<"reserve-x18", "ReserveX18", "true",
"Reserve X18, making it unavailable "		"Reserve X18, making it unavailable "
"as a GPR">;		"as a GPR">;

def FeatureMergeNarrowZeroSt : SubtargetFeature<"merge-narrow-zero-st",
"MergeNarrowZeroStores", "true",
"Merge narrow zero store "
"instructions">;

def FeatureUseAA : SubtargetFeature<"use-aa", "UseAA", "true",		def FeatureUseAA : SubtargetFeature<"use-aa", "UseAA", "true",
"Use alias analysis during codegen">;		"Use alias analysis during codegen">;

def FeatureBalanceFPOps : SubtargetFeature<"balance-fp-ops", "BalanceFPOps",		def FeatureBalanceFPOps : SubtargetFeature<"balance-fp-ops", "BalanceFPOps",
"true",		"true",
"balance mix of odd and even D-registers for fp multiply(-accumulate) ops">;		"balance mix of odd and even D-registers for fp multiply(-accumulate) ops">;

def FeaturePredictableSelectIsExpensive : SubtargetFeature<		def FeaturePredictableSelectIsExpensive : SubtargetFeature<
▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines

def ProcA57 : SubtargetFeature<"a57", "ARMProcFamily", "CortexA57",		def ProcA57 : SubtargetFeature<"a57", "ARMProcFamily", "CortexA57",
"Cortex-A57 ARM processors", [		"Cortex-A57 ARM processors", [
FeatureBalanceFPOps,		FeatureBalanceFPOps,
FeatureCRC,		FeatureCRC,
FeatureCrypto,		FeatureCrypto,
FeatureCustomCheapAsMoveHandling,		FeatureCustomCheapAsMoveHandling,
FeatureFPARMv8,		FeatureFPARMv8,
FeatureMergeNarrowZeroSt,
FeatureNEON,		FeatureNEON,
FeaturePerfMon,		FeaturePerfMon,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeaturePredictableSelectIsExpensive		FeaturePredictableSelectIsExpensive
]>;		]>;

def ProcA72 : SubtargetFeature<"a72", "ARMProcFamily", "CortexA72",		def ProcA72 : SubtargetFeature<"a72", "ARMProcFamily", "CortexA72",
"Cortex-A72 ARM processors", [		"Cortex-A72 ARM processors", [
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	def ProcExynosM2 : SubtargetFeature<"exynosm2", "ARMProcFamily", "ExynosM1",
FeatureZCZeroing]>;		FeatureZCZeroing]>;

def ProcKryo : SubtargetFeature<"kryo", "ARMProcFamily", "Kryo",		def ProcKryo : SubtargetFeature<"kryo", "ARMProcFamily", "Kryo",
"Qualcomm Kryo processors", [		"Qualcomm Kryo processors", [
FeatureCRC,		FeatureCRC,
FeatureCrypto,		FeatureCrypto,
FeatureCustomCheapAsMoveHandling,		FeatureCustomCheapAsMoveHandling,
FeatureFPARMv8,		FeatureFPARMv8,
FeatureMergeNarrowZeroSt,
FeatureNEON,		FeatureNEON,
FeaturePerfMon,		FeaturePerfMon,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeaturePredictableSelectIsExpensive,		FeaturePredictableSelectIsExpensive,
FeatureZCZeroing		FeatureZCZeroing
]>;		]>;

def ProcVulcan : SubtargetFeature<"vulcan", "ARMProcFamily", "Vulcan",		def ProcVulcan : SubtargetFeature<"vulcan", "ARMProcFamily", "Vulcan",
▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp

Show First 20 Lines • Show All 1,693 Lines • ▼ Show 20 Lines	bool AArch64LoadStoreOpt::runOnMachineFunction(MachineFunction &Fn) {

// Resize the modified and used register bitfield trackers. We do this once		// Resize the modified and used register bitfield trackers. We do this once
// per function and then clear the bitfield each time we optimize a load or		// per function and then clear the bitfield each time we optimize a load or
// store.		// store.
ModifiedRegs.resize(TRI->getNumRegs());		ModifiedRegs.resize(TRI->getNumRegs());
UsedRegs.resize(TRI->getNumRegs());		UsedRegs.resize(TRI->getNumRegs());

bool Modified = false;		bool Modified = false;
bool enableNarrowZeroStOpt =		bool enableNarrowZeroStOpt = !Subtarget->requiresStrictAlign();
Subtarget->mergeNarrowStores() && !Subtarget->requiresStrictAlign();
for (auto &MBB : Fn)		for (auto &MBB : Fn)
Modified \|= optimizeBlock(MBB, enableNarrowZeroStOpt);		Modified \|= optimizeBlock(MBB, enableNarrowZeroStOpt);

return Modified;		return Modified;
}		}

// FIXME: Do we need/want a pre-alloc pass like ARM has to try to keep		// FIXME: Do we need/want a pre-alloc pass like ARM has to try to keep
// loads and stores near one another?		// loads and stores near one another?
Show All 11 Lines

llvm/trunk/lib/Target/AArch64/AArch64Subtarget.h

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	protected:
// HasZeroCycleRegMove - Has zero-cycle register mov instructions.		// HasZeroCycleRegMove - Has zero-cycle register mov instructions.
bool HasZeroCycleRegMove = false;		bool HasZeroCycleRegMove = false;

// HasZeroCycleZeroing - Has zero-cycle zeroing instructions.		// HasZeroCycleZeroing - Has zero-cycle zeroing instructions.
bool HasZeroCycleZeroing = false;		bool HasZeroCycleZeroing = false;

// StrictAlign - Disallow unaligned memory accesses.		// StrictAlign - Disallow unaligned memory accesses.
bool StrictAlign = false;		bool StrictAlign = false;
bool MergeNarrowZeroStores = false;
bool UseAA = false;		bool UseAA = false;
bool PredictableSelectIsExpensive = false;		bool PredictableSelectIsExpensive = false;
bool BalanceFPOps = false;		bool BalanceFPOps = false;
bool CustomAsCheapAsMove = false;		bool CustomAsCheapAsMove = false;
bool UsePostRAScheduler = false;		bool UsePostRAScheduler = false;
bool Misaligned128StoreIsSlow = false;		bool Misaligned128StoreIsSlow = false;
bool AvoidQuadLdStPairs = false;		bool AvoidQuadLdStPairs = false;
bool UseAlternateSExtLoadCVTF32Pattern = false;		bool UseAlternateSExtLoadCVTF32Pattern = false;
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	public:
bool requiresStrictAlign() const { return StrictAlign; }		bool requiresStrictAlign() const { return StrictAlign; }

bool isX18Reserved() const { return ReserveX18; }		bool isX18Reserved() const { return ReserveX18; }
bool hasFPARMv8() const { return HasFPARMv8; }		bool hasFPARMv8() const { return HasFPARMv8; }
bool hasNEON() const { return HasNEON; }		bool hasNEON() const { return HasNEON; }
bool hasCrypto() const { return HasCrypto; }		bool hasCrypto() const { return HasCrypto; }
bool hasCRC() const { return HasCRC; }		bool hasCRC() const { return HasCRC; }
bool hasRAS() const { return HasRAS; }		bool hasRAS() const { return HasRAS; }
bool mergeNarrowStores() const { return MergeNarrowZeroStores; }
bool balanceFPOps() const { return BalanceFPOps; }		bool balanceFPOps() const { return BalanceFPOps; }
bool predictableSelectIsExpensive() const {		bool predictableSelectIsExpensive() const {
return PredictableSelectIsExpensive;		return PredictableSelectIsExpensive;
}		}
bool hasCustomCheapAsMoveHandling() const { return CustomAsCheapAsMove; }		bool hasCustomCheapAsMoveHandling() const { return CustomAsCheapAsMove; }
bool isMisaligned128StoreSlow() const { return Misaligned128StoreIsSlow; }		bool isMisaligned128StoreSlow() const { return Misaligned128StoreIsSlow; }
bool avoidQuadLdStPairs() const { return AvoidQuadLdStPairs; }		bool avoidQuadLdStPairs() const { return AvoidQuadLdStPairs; }
bool useAlternateSExtLoadCVTF32Pattern() const {		bool useAlternateSExtLoadCVTF32Pattern() const {
▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AArch64/arm64-narrow-st-merge.ll

	; RUN: llc < %s -mtriple aarch64--none-eabi -mcpu=cortex-a57 -verify-machineinstrs \| FileCheck %s			; RUN: llc < %s -mtriple aarch64--none-eabi -verify-machineinstrs \| FileCheck %s
	; RUN: llc < %s -mtriple aarch64_be--none-eabi -mcpu=cortex-a57 -verify-machineinstrs \| FileCheck %s
	; RUN: llc < %s -mtriple aarch64--none-eabi -mcpu=kryo -verify-machineinstrs \| FileCheck %s

	; CHECK-LABEL: Strh_zero			; CHECK-LABEL: Strh_zero
	; CHECK: str wzr			; CHECK: str wzr
	define void @Strh_zero(i16* nocapture %P, i32 %n) {			define void @Strh_zero(i16* nocapture %P, i32 %n) {
	entry:			entry:
	%idxprom = sext i32 %n to i64			%idxprom = sext i32 %n to i64
	%arrayidx = getelementptr inbounds i16, i16* %P, i64 %idxprom			%arrayidx = getelementptr inbounds i16, i16* %P, i64 %idxprom
	store i16 0, i16* %arrayidx			store i16 0, i16* %arrayidx
	▲ Show 20 Lines • Show All 170 Lines • Show Last 20 Lines