This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.cpp
-
AArch64TargetTransformInfo.h
-
AArch64TargetTransformInfo.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
bcmp-inline-small.ll

Differential D64805

[AArch64] Expand bcmp() for small comparisons
ClosedPublic

Authored by evandro on Jul 16 2019, 9:20 AM.

Download Raw Diff

Details

Reviewers

rengolin
efriedma
SjoerdMeijer
t.p.northover
kristof.beyls
brzycki

Commits

rG13c43456a9a8: Merging r367898: --------------------------------------------------------------…
rL368017: Merging r367898:
rGa005c1ac4f3b: [AArch64] Expand bcmp() for small block lengths
rL367898: [AArch64] Expand bcmp() for small block lengths

Summary

Patch D56593 by @courbet causes bcmp() to be emitted in some cases. Unfortunately, TTI::MemCmpExpansionOptions() is not overridden by AArch64, which results in calls to bcmp() even for small constant values. In a proprietary benchmark we see a performance drop of about 12% on PNG compression before this patch, though it passes all tests.

This patch mirrors how X86 initializes TTI::MemCmpExpansionOptions() to then expand calls to bcmp() when appropriate. No tuning of the parameters was performed, but, at this point, it's enough to recover the performance drop above.

This problem also exists on ARM. Once a consensus is reached for AArch64, we can work to fix ARM as well.

Authors: Evandro Menezes (@evandro) <e.menezes@samsung.com>, Brian Rzycki (@brzycki) <b.rzycki@samsung.com>

Diff Detail

Repository: rL LLVM

Event Timeline

brzycki created this revision.Jul 16 2019, 9:20 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 16 2019, 9:20 AM

Herald added subscribers: hiraditya, javed.absar. · View Herald Transcript

SjoerdMeijer added inline comments.Jul 17 2019, 1:04 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
625 ↗	(On Diff #210116)	why 2?
626 ↗	(On Diff #210116)	How about vector loads and alignment? Can we generate them for this, and do we then also need 16 byte loads to loadsizes?
llvm/test/CodeGen/AArch64/bcmp-inline-small.ll
2 ↗	(On Diff #210116)	I think tests are missing for: code-size edge cases for NumLoadsPerBlock and MaxNumLoads

Hello @SjoerdMeijer , thank you for the review.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
625 ↗	(On Diff #210116)	@evandro selected these based on the x8_64 values set in `llvm/lib/Target/X86/X86TargetTransformInfo.cpp`: Options.MaxNumLoads = TLI->getMaxExpandSizeMemcmp(OptSize); Options.NumLoadsPerBlock = 2; // ... if (ST->is64Bit()) { Options.LoadSizes.push_back(8); } Options.LoadSizes.push_back(4); Options.LoadSizes.push_back(2); Options.LoadSizes.push_back(1); I do not know why x86_64 selected these values.
626 ↗	(On Diff #210116)	Interesting question, I'll defer to @evandro when he can comment on the ticket. There is a bit more logic in x86_64 he stripped out when building this patch: if (IsZeroCmp) { // Only enable vector loads for equality comparison. Right now the vector // version is not as fast for three way compare (see #33329). // TODO: enable AVX512 when the DAG is ready. // if (ST->hasAVX512()) Options.LoadSizes.push_back(64); const unsigned PreferredWidth = ST->getPreferVectorWidth(); if (PreferredWidth >= 256 && ST->hasAVX2()) Options.LoadSizes.push_back(32); if (PreferredWidth >= 128 && ST->hasSSE2()) Options.LoadSizes.push_back(16); // All GPR and vector loads can be unaligned. SIMD compare requires integer // vectors (SSE2/AVX2). Options.AllowOverlappingLoads = true; }
llvm/test/CodeGen/AArch64/bcmp-inline-small.ll
2 ↗	(On Diff #210116)	Understood. I'll try to build cases to accommodate these checks.

FYI there are some benchmarks in https://reviews.llvm.org/D64082 that are not submitted but could be used as basis for making decisions for this patch.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
625 ↗	(On Diff #210116)	This does trades loads for jumps: with this setting, each block will be twice as big (more work per block) but there will be half as many blocks. This was found to be more efficient in general on x86, and might not be true for arm. This will have to be benchmarked.
626 ↗	(On Diff #210116)	For x86, vector loads are enabled only for equality comparison because there is no efficient vector-sized bswap. I don't know much about arm, but: if vector loads and equality comparisons are efficient there is no reason not to use them if IsZeroCmp if there is an efficient vector-sized bswap vector sizes can be used in both cases

evandro added inline comments.Jul 23 2019, 6:43 PM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
625 ↗	(On Diff #210116)	As @courbet said, the best figure may be different, but this seemed like a sensible starting point. Since A64 has more registers, methinks that a higher value might be beneficial, but it depends on how many live registers there are.
626 ↗	(On Diff #210116)	Vector loads are efficient, performance wise, on A64, but typically not so much power wise. That's why I'm not adding them just yet. If so, then vector byte swaps may be used.

evandro commandeered this revision.Jul 24 2019, 5:06 PM

evandro edited reviewers, added: brzycki; removed: evandro.

evandro updated this revision to Diff 211642.Jul 24 2019, 5:09 PM

evandro retitled this revision from [AArch64] Don't call bcmp for small byte comparisons to [AArch64] Expand bcmp() for small comparisons.

evandro edited the summary of this revision. (Show Details)

evandro marked 10 inline comments as done.Jul 24 2019, 5:12 PM

Ping! 🔔

Still think regressions tests are missing, e.g. for code-size (see a previous comments).

SjoerdMeijer added inline comments.Aug 5 2019, 8:38 AM

llvm/test/CodeGen/AArch64/bcmp-inline-small.ll
29 ↗	(On Diff #211642)	ah, sorry, that test case is here

SjoerdMeijer added inline comments.Aug 5 2019, 8:46 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
625 ↗	(On Diff #210116)	These options look a lot more sensible now as TLI is used, but has this actually been benchmarked?
626 ↗	(On Diff #210116)	Adding this comment as a `// FIXME` or `// TODO` would be good I guess.

evandro updated this revision to Diff 213375.Aug 5 2019, 9:07 AM

evandro marked 2 inline comments as done.

evandro edited the summary of this revision. (Show Details)

evandro added inline comments.Aug 5 2019, 10:13 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
625 ↗	(On Diff #210116)	As I said above, not extensively. But enough to close the gap that motivated this patch. As it stands, even without carefully mined values, at least the egregious regression from not inlining `bcmp()` is addressed.

Okay, cheers, looks reasonable to me.

This revision is now accepted and ready to land.Aug 5 2019, 10:40 AM

Closed by commit rL367898: [AArch64] Expand bcmp() for small block lengths (authored by evandro). · Explain WhyAug 5 2019, 11:10 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

4 lines

AArch64TargetTransformInfo.h

3 lines

AArch64TargetTransformInfo.cpp

13 lines

test/

CodeGen/

AArch64/

bcmp-inline-small.ll

44 lines

Diff 213410

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 623 Lines • ▼ Show 20 Lines	AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,

MaxGluedStoresPerMemcpy = 4;		MaxGluedStoresPerMemcpy = 4;
MaxStoresPerMemcpyOptSize = 4;		MaxStoresPerMemcpyOptSize = 4;
MaxStoresPerMemcpy = Subtarget->requiresStrictAlign()		MaxStoresPerMemcpy = Subtarget->requiresStrictAlign()
? MaxStoresPerMemcpyOptSize : 16;		? MaxStoresPerMemcpyOptSize : 16;

MaxStoresPerMemmoveOptSize = MaxStoresPerMemmove = 4;		MaxStoresPerMemmoveOptSize = MaxStoresPerMemmove = 4;

		MaxLoadsPerMemcmpOptSize = 4;
		MaxLoadsPerMemcmp = Subtarget->requiresStrictAlign()
		? MaxLoadsPerMemcmpOptSize : 8;

setStackPointerRegisterToSaveRestore(AArch64::SP);		setStackPointerRegisterToSaveRestore(AArch64::SP);

setSchedulingPreference(Sched::Hybrid);		setSchedulingPreference(Sched::Hybrid);

EnableExtLdPromotion = true;		EnableExtLdPromotion = true;

// Set required alignment.		// Set required alignment.
setMinFunctionAlignment(2);		setMinFunctionAlignment(2);
▲ Show 20 Lines • Show All 11,567 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines	int getArithmeticInstrCost(
TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >());		ArrayRef<const Value > Args = ArrayRef<const Value >());

int getAddressComputationCost(Type Ty, ScalarEvolution SE, const SCEV *Ptr);		int getAddressComputationCost(Type Ty, ScalarEvolution SE, const SCEV *Ptr);

int getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy,		int getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy,
const Instruction *I = nullptr);		const Instruction *I = nullptr);

		TTI::MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,
		bool IsZeroCmp) const;

int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace, const Instruction *I = nullptr);		unsigned AddressSpace, const Instruction *I = nullptr);

int getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys);		int getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys);

void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP);		TTI::UnrollingPreferences &UP);

▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 612 Lines • ▼ Show 20 Lines	if (SelCondTy.isSimple() && SelValTy.isSimple()) {
SelCondTy.getSimpleVT(),		SelCondTy.getSimpleVT(),
SelValTy.getSimpleVT()))		SelValTy.getSimpleVT()))
return Entry->Cost;		return Entry->Cost;
}		}
}		}
return BaseT::getCmpSelInstrCost(Opcode, ValTy, CondTy, I);		return BaseT::getCmpSelInstrCost(Opcode, ValTy, CondTy, I);
}		}

		AArch64TTIImpl::TTI::MemCmpExpansionOptions
		AArch64TTIImpl::enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const {
		TTI::MemCmpExpansionOptions Options;
		Options.AllowOverlappingLoads = !ST->requiresStrictAlign();
		Options.MaxNumLoads = TLI->getMaxExpandSizeMemcmp(OptSize);
		Options.NumLoadsPerBlock = Options.MaxNumLoads;
		// TODO: Though vector loads usually perform well on AArch64, in some targets
		// they may wake up the FP unit, which raises the power consumption. Perhaps
		// they could be used with no holds barred (-O3).
		Options.LoadSizes = {8, 4, 2, 1};
		return Options;
		}

int AArch64TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Ty,		int AArch64TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Ty,
unsigned Alignment, unsigned AddressSpace,		unsigned Alignment, unsigned AddressSpace,
const Instruction *I) {		const Instruction *I) {
auto LT = TLI->getTypeLegalizationCost(DL, Ty);		auto LT = TLI->getTypeLegalizationCost(DL, Ty);

if (ST->isMisaligned128StoreSlow() && Opcode == Instruction::Store &&		if (ST->isMisaligned128StoreSlow() && Opcode == Instruction::Store &&
LT.second.is128BitVector() && Alignment < 16) {		LT.second.is128BitVector() && Alignment < 16) {
// Unaligned stores are extremely inefficient. We don't split all		// Unaligned stores are extremely inefficient. We don't split all
▲ Show 20 Lines • Show All 374 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AArch64/bcmp-inline-small.ll

				; RUN: llc -O2 < %s -mtriple=aarch64-linux-gnu \| FileCheck %s --check-prefixes=CHECK,CHECKN
				; RUN: llc -O2 < %s -mtriple=aarch64-linux-gnu -mattr=strict-align \| FileCheck %s --check-prefixes=CHECK,CHECKS

				declare i32 @bcmp(i8, i8, i64) nounwind readonly
				declare i32 @memcmp(i8, i8, i64) nounwind readonly

				define i1 @bcmp_b2(i8* %s1, i8* %s2) {
				entry:
				%bcmp = call i32 @bcmp(i8* %s1, i8* %s2, i64 15)
				%ret = icmp eq i32 %bcmp, 0
				ret i1 %ret

				; CHECK-LABEL: bcmp_b2:
				; CHECK-NOT: bl bcmp
				; CHECKN: ldr x
				; CHECKN-NEXT: ldr x
				; CHECKN-NEXT: ldur x
				; CHECKN-NEXT: ldur x
				; CHECKS: ldr x
				; CHECKS-NEXT: ldr x
				; CHECKS-NEXT: ldr w
				; CHECKS-NEXT: ldr w
				; CHECKS-NEXT: ldrh w
				; CHECKS-NEXT: ldrh w
				; CHECKS-NEXT: ldrb w
				; CHECKS-NEXT: ldrb w
				}

				define i1 @bcmp_bs(i8* %s1, i8* %s2) optsize {
				entry:
				%memcmp = call i32 @memcmp(i8* %s1, i8* %s2, i64 31)
				%ret = icmp eq i32 %memcmp, 0
				ret i1 %ret

				; CHECK-LABEL: bcmp_bs:
				; CHECKN-NOT: bl memcmp
				; CHECKN: ldp x
				; CHECKN-NEXT: ldp x
				; CHECKN-NEXT: ldr x
				; CHECKN-NEXT: ldr x
				; CHECKN-NEXT: ldur x
				; CHECKN-NEXT: ldur x
				; CHECKS: bl memcmp
				}