Download Raw Diff

Details

Reviewers

SjoerdMeijer
dmgreen
fhahn

Commits

rGcd880442ae66: [CodeGen][AArch64] Add TargetInstrInfo hook to modify the TailDuplicateSize…

Summary

Different targets might handle branch performance differently, so this patch allows for
targets to specify the TailDuplicateSize threshold. Said threshold defines how small a branch
can be and still be duplicated to generate straight-line code instead.
This patch also specifies said override values for the AArch64 subtarget.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

NickGuy created this revision.Jan 28 2021, 9:56 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptJan 28 2021, 9:56 AM

NickGuy requested review of this revision.Jan 28 2021, 9:56 AM

NickGuy added a child revision: D95632: [AArch64] Specify Tail Duplication Size Override.

Harbormaster completed remote builds in B87052: Diff 319899.Jan 28 2021, 10:47 AM

I don't think I would mind merging this with D95632 to keep things in one place; don't think separating brings much value in this case.

Also adding some more folks who might care about AArch64 tuning: @fhahn, @samparker , @t.p.northover

We have seen a case for which generating straight-line code for tail blocks is really good for performance. From memory, this was a relatively small function that was called a lot of times, so that this makes a big difference.
But since this is changing a generic AArch64 codegen option/value, I think @NickGuy you need to show that this is good/bad/neutral for some more code other than our one motivation example.

I think a target hook for TailDuplicateSize makes sense though, it is easy to imagine I think that different targets would have different preferences for this.

I don't think I would mind merging this with D95632 to keep things in one place; don't think separating brings much value in this case.

Done

In D95632#2533284, @dmgreen wrote:

This sounds OK to me, but 6 is quite a big jump up from 2. Is is possible to make the normal threshold 2 or 3, and the aggressive threshold 6?
Oh. It looks like there is already an aggressive limit. TailDuplicator is either used by itself or in MachineBlockPlacement. MachineBlockPlacement already sets the limit to 4 when optimizing aggressively. Would it work for your usecase to alter the value there based on the target instead?

I don't see a problem with only performing this at -O3, and moving the override from TailDuplicator to MachineBlockPlacement doesn't seem to cause any problems (at least, as far as I could tell).

you need to show that this is good/bad/neutral for some more code other than our one motivation example.

Here are the results from the benchmarking performed with this patch; It causes an improvement in a number of cases, however due to the size of said tests, the results are rather susceptible to system noise.

LLVM Test suite, AArch64 exec_time results (cherry-picked for the most interesting/significant tests), lower is better

Benchmark	Duplication threshold = 2	Duplication threshold = 6	Difference %
test-suite :: SingleSource/Benchmarks/Misc/evalloop.test	1.212	0.6376	-47.39%
test-suite :: MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl.test	4.38	6.408	46.30%
test-suite :: MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt.test	6.408	4.356	-32.02%
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-ary2.test	0.02	0.0136	-32.00%
test-suite :: SingleSource/Benchmarks/McGill/misr.test	0.252	0.2968	17.78%
test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test	0.016	0.0184	15.00%
test-suite :: MultiSource/Benchmarks/Ptrdist/ks/ks.test	1.7544	1.5	-14.50%
test-suite :: SingleSource/Benchmarks/Misc/himenobmtxpa.test	1.536	1.3496	-12.14%
test-suite :: MultiSource/Applications/kimwitu++/kc.test	0.048	0.0432	-10.00%
test-suite :: SingleSource/Benchmarks/Stanford/Towers.test	0.0184	0.0168	-8.70%
test-suite :: MultiSource/Benchmarks/Olden/mst/mst.test	0.1168	0.1072	-8.22%
test-suite :: MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt.test	5.7104	6.1456	7.62%
test-suite :: MultiSource/Applications/hexxagon/hexxagon.test	12.4872	11.6384	-6.80%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test	0.0256	0.0272	6.25%
test-suite :: MultiSource/Applications/lambda-0.1.3/lambda.test	5.7496	6.1	6.09%
test-suite :: MultiSource/Benchmarks/Ptrdist/anagram/anagram.test	1.2448	1.32	6.04%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test	6.048	5.7024	-5.71%
test-suite :: MultiSource/Benchmarks/Olden/bisort/bisort.test	0.6808	0.648	-4.82%
test-suite :: MultiSource/Applications/ClamAV/clamscan.test	0.1984	0.1896	-4.44%
test-suite :: MultiSource/Benchmarks/McCat/03-testtrie/testtrie.test	0.0192	0.02	4.17%

Spec results are a bit more stable, and show a similar pattern (a small delta in both directions, but overall an improvement).
Beyond the benchmarks shown in this table, no other meaningful differences were identified.

Spec2017 results, higher is better

Benchmark	Difference %
505.mcf_r	0.675%
523.xalancbmk_r	-0.24%
557.xz_r	0.417%

Harbormaster completed remote builds in B87533: Diff 320791.Feb 2 2021, 8:47 AM

Thanks for those results.

I think the summary of those results is: it doesn't make things worse, which is also difficult to imagine from this change, and it improves a few cases.
I am ignoring the first 3 rows of the llvm test suite results, that just looks like noise to me. The TSVC ones look like individual test cases from that vectoriser suite, are very small, and is probably just noise. I would be good to confirm that though, just have a brief look if there are any meaningful codegen difference in e.g. TSVC/Searching-dbl/Searching-dbl.test.

SPEC looks good: even tough it was not our motivating example, it does a little bit of good (or doesn't make things worse, or shows surprises).

llvm/include/llvm/CodeGen/TargetInstrInfo.h
1943	Bikeshedding names: don't think we need the `Override` part in the name.
llvm/test/CodeGen/AArch64/aarch64-tail-dup-size.ll
5	Can we add a RUN line with `tail-dup-size=4`, which should give the same result as the O2 one.

SjoerdMeijer added a reviewer: fhahn.Feb 3 2021, 6:45 AM

just have a brief look if there are any meaningful codegen difference in e.g. TSVC/Searching-dbl/Searching-dbl.test.

From what I could tell, I couldn't find any meaningful differences in that test

Herald added subscribers: danielkiss, kristof.beyls. · View Herald TranscriptFeb 3 2021, 7:32 AM

Harbormaster completed remote builds in B87706: Diff 321099.Feb 3 2021, 8:24 AM

Like we were talking about offline, there's a chance this isn't as good as the first version that altered the taildup threshold directly - this done earlier might give more scheduling freedom and other optimization opportunities to the folded tails. It's now only done in machine block placement late in the pipeline. If this version is working that's great, but check it's still OK.

llvm/include/llvm/CodeGen/TargetInstrInfo.h
1943	+1 to removing override. It's probably worth calling out that this is the threshold used in machineblockplacement now, not the one used in the other taildup passes. It seems to be called TailDupPlacementThreshold there.

If this version is working that's great, but check it's still OK.

I haven't seen any issues so far, and in some cases I've seen a slight improvement. I've included the new Spec results, this time there are more significant changes across the suite.

Benchmark	TailDuplicator	MachineBlockPlacement
500.perlbench_r	-0.048%	-0.155%
502.gcc_r	0.036%	0.234%
505.mcf_r	0.675%	1.349%
513.xalancbmk_r	-0.240%	-0.051%
520.omnetpp_r	0.048%	0.164%
525.x264_r	0.062%	0.185%
557.xz_r	0.417%	0.296%

Harbormaster completed remote builds in B88088: Diff 321760.Feb 5 2021, 9:31 AM

OK. Thanks for checking. In that case LGTM.

This revision is now accepted and ready to land.Feb 8 2021, 12:44 AM

This revision was landed with ongoing or failed builds.Feb 8 2021, 5:28 AM

Closed by commit rGcd880442ae66: [CodeGen][AArch64] Add TargetInstrInfo hook to modify the TailDuplicateSize… (authored by NickGuy). · Explain Why

This revision was automatically updated to reflect the committed changes.

NickGuy added a commit: rGcd880442ae66: [CodeGen][AArch64] Add TargetInstrInfo hook to modify the TailDuplicateSize….

Diff 322086

llvm/include/llvm/CodeGen/TargetInstrInfo.h

Show First 20 Lines • Show All 1,931 Lines • ▼ Show 20 Lines	public:
/// Return MIR formatter to format/parse MIR operands. Target can override		/// Return MIR formatter to format/parse MIR operands. Target can override
/// this virtual function and return target specific MIR formatter.		/// this virtual function and return target specific MIR formatter.
virtual const MIRFormatter *getMIRFormatter() const {		virtual const MIRFormatter *getMIRFormatter() const {
if (!Formatter.get())		if (!Formatter.get())
Formatter = std::make_unique<MIRFormatter>();		Formatter = std::make_unique<MIRFormatter>();
return Formatter.get();		return Formatter.get();
}		}

		/// Returns the target-specific default value for tail duplication.
		/// This value will be used if the tail-dup-placement-threshold argument is
		/// not provided.
		virtual unsigned getTailDuplicateSize(CodeGenOpt::Level OptLevel) const {
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Bikeshedding names: don't think we need the `Override` part in the name. SjoerdMeijer: Bikeshedding names: don't think we need the `Override` part in the name.
		dmgreenUnsubmitted Done Reply Inline Actions +1 to removing override. It's probably worth calling out that this is the threshold used in machineblockplacement now, not the one used in the other taildup passes. It seems to be called TailDupPlacementThreshold there. dmgreen: +1 to removing override. It's probably worth calling out that this is the threshold used in…
		return OptLevel >= CodeGenOpt::Aggressive ? 4 : 2;
		}

private:		private:
mutable std::unique_ptr<MIRFormatter> Formatter;		mutable std::unique_ptr<MIRFormatter> Formatter;
unsigned CallFrameSetupOpcode, CallFrameDestroyOpcode;		unsigned CallFrameSetupOpcode, CallFrameDestroyOpcode;
unsigned CatchRetOpcode;		unsigned CatchRetOpcode;
unsigned ReturnOpcode;		unsigned ReturnOpcode;
};		};

/// Provide DenseMapInfo for TargetInstrInfo::RegSubRegPair.		/// Provide DenseMapInfo for TargetInstrInfo::RegSubRegPair.
Show All 30 Lines

llvm/lib/CodeGen/MachineBlockPlacement.cpp

Show First 20 Lines • Show All 3,331 Lines • ▼ Show 20 Lines	if (PassConfig->getOptLevel() >= CodeGenOpt::Aggressive) {
// At O3 we should be more willing to copy blocks for tail duplication. This		// At O3 we should be more willing to copy blocks for tail duplication. This
// increases size pressure, so we only do it at O3		// increases size pressure, so we only do it at O3
// Do this unless only the regular threshold is explicitly set.		// Do this unless only the regular threshold is explicitly set.
if (TailDupPlacementThreshold.getNumOccurrences() == 0 \|\|		if (TailDupPlacementThreshold.getNumOccurrences() == 0 \|\|
TailDupPlacementAggressiveThreshold.getNumOccurrences() != 0)		TailDupPlacementAggressiveThreshold.getNumOccurrences() != 0)
TailDupSize = TailDupPlacementAggressiveThreshold;		TailDupSize = TailDupPlacementAggressiveThreshold;
}		}

		// If there's no threshold provided through options, query the target
		// information for a threshold instead.
		if (TailDupPlacementThreshold.getNumOccurrences() == 0 &&
		(PassConfig->getOptLevel() < CodeGenOpt::Aggressive \|\|
		TailDupPlacementAggressiveThreshold.getNumOccurrences() == 0))
		TailDupSize = TII->getTailDuplicateSize(PassConfig->getOptLevel());

if (allowTailDupPlacement()) {		if (allowTailDupPlacement()) {
MPDT = &getAnalysis<MachinePostDominatorTree>();		MPDT = &getAnalysis<MachinePostDominatorTree>();
bool OptForSize = MF.getFunction().hasOptSize() \|\|		bool OptForSize = MF.getFunction().hasOptSize() \|\|
llvm::shouldOptimizeForSize(&MF, PSI, &MBFI->getMBFI());		llvm::shouldOptimizeForSize(&MF, PSI, &MBFI->getMBFI());
if (OptForSize)		if (OptForSize)
TailDupSize = 1;		TailDupSize = 1;
bool PreRegAlloc = false;		bool PreRegAlloc = false;
TailDup.initMF(MF, PreRegAlloc, MBPI, MBFI.get(), PSI,		TailDup.initMF(MF, PreRegAlloc, MBPI, MBFI.get(), PSI,
▲ Show 20 Lines • Show All 136 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.h

Show First 20 Lines • Show All 293 Lines • ▼ Show 20 Lines	public:
static bool isSEHInstruction(const MachineInstr &MI);		static bool isSEHInstruction(const MachineInstr &MI);

Optional<RegImmPair> isAddImmediate(const MachineInstr &MI,		Optional<RegImmPair> isAddImmediate(const MachineInstr &MI,
Register Reg) const override;		Register Reg) const override;

Optional<ParamLoadedValue> describeLoadedValue(const MachineInstr &MI,		Optional<ParamLoadedValue> describeLoadedValue(const MachineInstr &MI,
Register Reg) const override;		Register Reg) const override;

		unsigned int getTailDuplicateSize(CodeGenOpt::Level OptLevel) const override;

static void decomposeStackOffsetForFrameOffsets(const StackOffset &Offset,		static void decomposeStackOffsetForFrameOffsets(const StackOffset &Offset,
int64_t &NumBytes,		int64_t &NumBytes,
int64_t &NumPredicateVectors,		int64_t &NumPredicateVectors,
int64_t &NumDataVectors);		int64_t &NumDataVectors);
static void decomposeStackOffsetForDwarfOffsets(const StackOffset &Offset,		static void decomposeStackOffsetForDwarfOffsets(const StackOffset &Offset,
int64_t &ByteSized,		int64_t &ByteSized,
int64_t &VGSized);		int64_t &VGSized);
#define GET_INSTRINFO_HELPER_DECLS		#define GET_INSTRINFO_HELPER_DECLS
▲ Show 20 Lines • Show All 185 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

	Show First 20 Lines • Show All 7,177 Lines • ▼ Show 20 Lines
	bool AArch64InstrInfo::isPTestLikeOpcode(unsigned Opc) const {			bool AArch64InstrInfo::isPTestLikeOpcode(unsigned Opc) const {
	return get(Opc).TSFlags & AArch64::InstrFlagIsPTestLike;			return get(Opc).TSFlags & AArch64::InstrFlagIsPTestLike;
	}			}

	bool AArch64InstrInfo::isWhileOpcode(unsigned Opc) const {			bool AArch64InstrInfo::isWhileOpcode(unsigned Opc) const {
	return get(Opc).TSFlags & AArch64::InstrFlagIsWhile;			return get(Opc).TSFlags & AArch64::InstrFlagIsWhile;
	}			}

				unsigned int
				AArch64InstrInfo::getTailDuplicateSize(CodeGenOpt::Level OptLevel) const {
				return OptLevel >= CodeGenOpt::Aggressive ? 6 : 2;
				}

	unsigned llvm::getBLRCallOpcode(const MachineFunction &MF) {			unsigned llvm::getBLRCallOpcode(const MachineFunction &MF) {
	if (MF.getSubtarget<AArch64Subtarget>().hardenSlsBlr())			if (MF.getSubtarget<AArch64Subtarget>().hardenSlsBlr())
	return AArch64::BLRNoIP;			return AArch64::BLRNoIP;
	else			else
	return AArch64::BLR;			return AArch64::BLR;
	}			}

	#define GET_INSTRINFO_HELPERS			#define GET_INSTRINFO_HELPERS
	#define GET_INSTRMAP_INFO			#define GET_INSTRMAP_INFO
	#include "AArch64GenInstrInfo.inc"			#include "AArch64GenInstrInfo.inc"

llvm/test/CodeGen/AArch64/aarch64-tail-dup-size.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=aarch64-none-linux -O2 < %s \| FileCheck %s --check-prefix=CHECK-O2
				; RUN: llc -mtriple=aarch64-none-linux -O3 < %s \| FileCheck %s --check-prefix=CHECK-O3

				; RUN: llc -mtriple=aarch64-none-linux -tail-dup-size=4 < %s \| FileCheck %s --check-prefix=CHECK-O2
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Can we add a RUN line with `tail-dup-size=4`, which should give the same result as the O2 one. SjoerdMeijer: Can we add a RUN line with `tail-dup-size=4`, which should give the same result as the O2 one.
				; RUN: llc -mtriple=aarch64-none-linux -tail-dup-placement-threshold=4 < %s \| FileCheck %s --check-prefix=CHECK-O2
				; RUN: llc -mtriple=aarch64-none-linux -tail-dup-placement-threshold=6 < %s \| FileCheck %s --check-prefix=CHECK-O3

				%a = type { %a*, i32, %b }
				%b = type { %c }
				%c = type { i32, i32, [31 x i8] }

				@global_ptr = dso_local local_unnamed_addr global %a* null, align 8
				@global_int = dso_local local_unnamed_addr global i32 0, align 4

				define dso_local void @testcase(%a** nocapture %arg){
				; CHECK-O2-LABEL: testcase:
				; CHECK-O2: // %bb.0: // %entry
				; CHECK-O2-NEXT: adrp x8, global_ptr
				; CHECK-O2-NEXT: ldr x9, [x8, :lo12:global_ptr]
				; CHECK-O2-NEXT: cbz x9, .LBB0_2
				; CHECK-O2-NEXT: // %bb.1: // %if.then
				; CHECK-O2-NEXT: ldr x9, [x9]
				; CHECK-O2-NEXT: str x9, [x0]
				; CHECK-O2-NEXT: ldr x8, [x8, :lo12:global_ptr]
				; CHECK-O2-NEXT: b .LBB0_3
				; CHECK-O2-NEXT: .LBB0_2:
				; CHECK-O2-NEXT: mov x8, xzr
				; CHECK-O2-NEXT: .LBB0_3: // %if.end
				; CHECK-O2-NEXT: adrp x9, global_int
				; CHECK-O2-NEXT: ldr w1, [x9, :lo12:global_int]
				; CHECK-O2-NEXT: add x2, x8, #16 // =16
				; CHECK-O2-NEXT: mov w0, #10
				; CHECK-O2-NEXT: b externalfunc
				;
				; CHECK-O3-LABEL: testcase:
				; CHECK-O3: // %bb.0: // %entry
				; CHECK-O3-NEXT: adrp x8, global_ptr
				; CHECK-O3-NEXT: ldr x9, [x8, :lo12:global_ptr]
				; CHECK-O3-NEXT: cbz x9, .LBB0_2
				; CHECK-O3-NEXT: // %bb.1: // %if.then
				; CHECK-O3-NEXT: ldr x9, [x9]
				; CHECK-O3-NEXT: str x9, [x0]
				; CHECK-O3-NEXT: ldr x8, [x8, :lo12:global_ptr]
				; CHECK-O3-NEXT: adrp x9, global_int
				; CHECK-O3-NEXT: ldr w1, [x9, :lo12:global_int]
				; CHECK-O3-NEXT: add x2, x8, #16 // =16
				; CHECK-O3-NEXT: mov w0, #10
				; CHECK-O3-NEXT: b externalfunc
				; CHECK-O3-NEXT: .LBB0_2:
				; CHECK-O3-NEXT: mov x8, xzr
				; CHECK-O3-NEXT: adrp x9, global_int
				; CHECK-O3-NEXT: ldr w1, [x9, :lo12:global_int]
				; CHECK-O3-NEXT: add x2, x8, #16 // =16
				; CHECK-O3-NEXT: mov w0, #10
				; CHECK-O3-NEXT: b externalfunc
				entry:
				%0 = load %a, %a* @global_ptr, align 8
				%cmp.not = icmp eq %a* %0, null
				br i1 %cmp.not, label %if.end, label %if.then

				if.then: ; preds = %entry
				%1 = getelementptr inbounds %a, %a* %0, i64 0, i32 0
				%2 = load %a, %a* %1, align 8
				store %a* %2, %a** %arg, align 8
				%.pre = load %a, %a* @global_ptr, align 8
				br label %if.end

				if.end: ; preds = %if.then, %entry
				%3 = phi %a* [ %.pre, %if.then ], [ null, %entry ]
				%4 = load i32, i32* @global_int, align 4
				%5 = getelementptr inbounds %a, %a* %3, i64 0, i32 2, i32 0, i32 1
				tail call void @externalfunc(i32 10, i32 %4, i32* nonnull %5)
				ret void
				}

				declare dso_local void @externalfunc(i32, i32, i32*)

This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen][AArch64] Add TargetInstrInfo hook to modify the TailDuplicateSize default threshold
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 322086

llvm/include/llvm/CodeGen/TargetInstrInfo.h

llvm/lib/CodeGen/MachineBlockPlacement.cpp

llvm/lib/Target/AArch64/AArch64InstrInfo.h

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

llvm/test/CodeGen/AArch64/aarch64-tail-dup-size.ll

This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen][AArch64] Add TargetInstrInfo hook to modify the TailDuplicateSize default thresholdClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 322086

llvm/include/llvm/CodeGen/TargetInstrInfo.h

llvm/lib/CodeGen/MachineBlockPlacement.cpp

llvm/lib/Target/AArch64/AArch64InstrInfo.h

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

llvm/test/CodeGen/AArch64/aarch64-tail-dup-size.ll

[CodeGen][AArch64] Add TargetInstrInfo hook to modify the TailDuplicateSize default threshold
ClosedPublic