Download Raw Diff

Details

Reviewers

SjoerdMeijer
dmgreen
jroelofs

Commits

rG2b6e0c90f981: [AArch64] Enable runtime unrolling for in-order sched models

Summary

This change also changes the "default" aarch64 compilation options, as
if -mcpu is omitted, an in-order sched model is used.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

NickGuy created this revision.Mar 4 2021, 6:40 AM

Herald added subscribers: danielkiss, zzheng, hiraditya, kristof.beyls. · View Herald TranscriptMar 4 2021, 6:40 AM

NickGuy requested review of this revision.Mar 4 2021, 6:40 AM

SjoerdMeijer added inline comments.Mar 4 2021, 6:50 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
981	Looks sensible to me that we do this (first) for smaller in-order cores. From a quick look, quite a few other targets that implement this hook have: // Scan the loop: don't unroll loops with calls as this could prevent // inlining. Do we need that too? Have you benchmarked this, and can we try this too?

Runtime unrolling is always going to be a bit of a heuristic, unfortunately. We don't usually know at compile time what the trip count will be, so don't know if we will end up making things worse.

If the loop has already been vectorized (and possibly interleaved) it will already have been somewhat unrolled though. We may want to be more careful with cases like that and not runtime unroll the loop that has already been vectorized. We do the same thing for MVE, where that is more important due to the low register count and tail predicated loops. But the same reasoning may well apply here too, where unrolling it further just make the loop body too large to be useful.

Harbormaster completed remote builds in B92063: Diff 328159.Mar 4 2021, 6:06 PM

I've done some more benchmarks with the provided suggestions;

in D97947#2603450, @SjoerdMeijer wrote:

Do we need that too? Have you benchmarked this, and can we try this too?

In the benchmark I ran with this change, I saw a slight regression (~0.01%) which given the gains from unrolling is a negligible change.

In D97947#2604347, @dmgreen wrote:

If the loop has already been vectorized (and possibly interleaved) it will already have been somewhat unrolled though. We may want to be more careful with cases like that and not runtime unroll the loop that has already been vectorized. We do the same thing for MVE, where that is more important due to the low register count and tail predicated loops. But the same reasoning may well apply here too, where unrolling it further just make the loop body too large to be useful

With this change added onto the previous, I saw an improvement of ~0.4% over the unrestricted runtime unrolling.

Harbormaster completed remote builds in B92637: Diff 328978.Mar 8 2021, 6:34 AM

This will still unroll the remainders of vectorized loops, which will be quite common but generally unhelpful to unroll. Can we try to prevent unrolling there too?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1007	Should we be enabling partial unrolling too? If not, why not?

In D97947#2612301, @dmgreen wrote:

This will still unroll the remainders of vectorized loops, which will be quite common but generally unhelpful to unroll. Can we try to prevent unrolling there too?

It seems to be difficult to identify the remainder loop; Checking the llvm.loop.isvectorized attribute (like in ARMTargetTransformInfo) caused all gains to be negated, and caused a slight regression when compared to without this change. I've instead included a check for the llvm.loop.unroll.disable attribute, which was seen on the remainder loop IR. This check causes no difference to the benchmark numbers. (Though thinking about it now, that might be due to it being handled elsewhere, making this check redundant)

In D97947#2612301, @dmgreen wrote:

Should we be enabling partial unrolling too? If not, why not?

Enabling partial unrolling made no difference in the benchmark results, so I haven't enabled it here.

Harbormaster completed remote builds in B95507: Diff 333009.Mar 24 2021, 4:18 PM

It seems to be difficult to identify the remainder loop; Checking the llvm.loop.isvectorized attribute (like in ARMTargetTransformInfo) caused all gains to be negated, and caused a slight regression when compared to without this change. I've instead included a check for the llvm.loop.unroll.disable attribute, which was seen on the remainder loop IR. This check causes no difference to the benchmark numbers. (Though thinking about it now, that might be due to it being handled elsewhere, making this check redundant)

Yeah it doesn't seem worthwhile to check for something that the loop unroller will already be checking for. The vectorizer will unroll (/"interleave") at the same time as vectorizing, and we can already produce loops that are very wide compared to the input, sometimes to the point that they are never executed and we are only running the remainders. Unrolling on top of that will not be beneficial in a lot of cases, but that will depend heavily on the trip count of the loops.

The very quick benchmarks I ran didn't show this to be great because of all that extra unrolling, mostly of remainder loops by the look of it. I'm sure you have been running more benchmarks, and perhaps the ones here are not very representative of general A64 code.

Enabling partial unrolling made no difference in the benchmark results, so I haven't enabled it here.

Partial unrolling is when the trip count is known but we can unroll a mod of it. I would be very surprised if it didn't come up in any of your testing.

The very quick benchmarks I ran didn't show this to be great because of all that extra unrolling, mostly of remainder loops by the look of it. I'm sure you have been running more benchmarks, and perhaps the ones here are not very representative of general A64 code.

I've removed the check for the loop attribute altogether, as it seemed to do more harm than good in the majority of benchmarks I ran, and I've added some further tuning to get some extra performance. The options specified in this patch were the best all-round of those I tested, giving up to a 10% improvement in some benchmark suites.

Running the llvm test suite with this change gave anywhere between a 0.35% and 2.1% improvement, depending on the specific hardware it was tested on. Interestingly, the 2.1% gain was on an out-of-order core, indicating that these changes could be beneficial there too. However I don't have any other numbers to hand to back that claim up, so I'll keep this patch scoped on in-order cores only.

fhahn added a reviewer: jroelofs.Apr 19 2021, 5:15 AM

Okay, that's a good result and this all looks sensible: you've run different benchmarks on different targets and noticed the different performance uplifts. Looks like that is the best we can do.

Just one remaining question from my side. Can you remind me about the impact of this? I.e., if -mcpu is omitted, we default to generic which is classified, or is using, an in-order schedmodel description on Android? Was that it? It would be great to at least mention this in the description / commit message, or as a comment too. You can also consider leaving a TODO above the !isOutOfOrder check that this might be beneficial for bigger cores too.

Harbormaster completed remote builds in B99453: Diff 338482.Apr 19 2021, 5:55 AM

Can you remind me about the impact of this? I.e., if -mcpu is omitted, we default to generic which is classified, or is using, an in-order schedmodel description on Android?

Yep, that's right. Though I don't think it was specific to Android, but AArch64 in general.

In D97947#2698654, @NickGuy wrote:

Can you remind me about the impact of this? I.e., if -mcpu is omitted, we default to generic which is classified, or is using, an in-order schedmodel description on Android?

Yep, that's right. Though I don't think it was specific to Android, but AArch64 in general.

So does this mean the new behavior will be the default if no CPU is specified? I'm not sure if we are ready for that yet, unless we are confident that the current heuristics work well for out-of-order cores too. (Last time I benchmarked this for out-of-order cores there were a few notable regressions, but it's been a few months since then)

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
988	This comment seems out of date?
llvm/test/Transforms/PhaseOrdering/AArch64/hoisting-required-for-vectorization.ll
107 ↗	(On Diff #338524)	why do we need to disable unrolling here?

Harbormaster completed remote builds in B99483: Diff 338524.Apr 19 2021, 8:43 AM

In D97947#2698661, @fhahn wrote:

So does this mean the new behavior will be the default if no CPU is specified? I'm not sure if we are ready for that yet, unless we are confident that the current heuristics work well for out-of-order cores too. (Last time I benchmarked this for out-of-order cores there were a few notable regressions, but it's been a few months since then)

Do you remember where those regressions were, and how important these cases are? I can look into them with these changes to see if they help or hinder.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
988	Doesn't seem like it is to me. If a function call is found, it bails without enabling runtime unrolling. This check is also performed by many other implementations of this hook.
llvm/test/Transforms/PhaseOrdering/AArch64/hoisting-required-for-vectorization.ll
107 ↗	(On Diff #338524)	We don't need to disable it, but this test was ballooning in size when unrolling (as expected), while not testing unrolling itself. Without the knowledge of what it's testing for specifically, I didn't want to change it.

fhahn added inline comments.Apr 19 2021, 9:19 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
988	If a function call is found, That's correct, but you are also bailing out on other cases, like instructions with vector types, right?
llvm/test/Transforms/PhaseOrdering/AArch64/hoisting-required-for-vectorization.ll
107 ↗	(On Diff #338524)	I am not sure that's expected. It doesn't look like the loop in the function has a runtime trip count, unless I am missing something?

In D97947#2698661, @fhahn wrote:

In D97947#2698654, @NickGuy wrote:

Can you remind me about the impact of this? I.e., if -mcpu is omitted, we default to generic which is classified, or is using, an in-order schedmodel description on Android?

Yep, that's right. Though I don't think it was specific to Android, but AArch64 in general.

So does this mean the new behavior will be the default if no CPU is specified? I'm not sure if we are ready for that yet, unless we are confident that the current heuristics work well for out-of-order cores too. (Last time I benchmarked this for out-of-order cores there were a few notable regressions, but it's been a few months since then)

We have probably two options here:

Florian is right that we may need some more data on how well the bigger cores like this although it seems you already have some,
Perhaps alternatively, or as a first step, we can introduce a target feature for this and set this for our smaller in-order-cores for which you have just determined this is a good thing?

I was under the impression that without a -mcpu it defaulted to cortex-a53 schedule. It looks like it's no-schedule though, which still counts as an in-order core as it has no MicroOpBufferSize. Can we check if ST->getSchedModel().ProcID != 0, which will be the "NoSchedModel".

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1006	"Force" doesn't sound like the right wording to me. It is enabling runtime/partial unrolling, which the loop unroller may use or not.
1009	Out of order cores will already enable runtime unrolling based on the MicroOpBufferSize, it will just have a reduced threshold. I don't think the TODO is useful to add.
llvm/test/Transforms/PhaseOrdering/AArch64/hoisting-required-for-vectorization.ll
107 ↗	(On Diff #338524)	It can still be partially unrolled.

In D97947#2700960, @dmgreen wrote:

I was under the impression that without a -mcpu it defaulted to cortex-a53 schedule. It looks like it's no-schedule though, which still counts as an in-order core as it has no MicroOpBufferSize. Can we check if ST->getSchedModel().ProcID != 0, which will be the "NoSchedModel".

Agreed with this. I.e., this is my current impression too. But we do need to get to the bottom of this to be sure. I seem to remember that this is the case for Android. So, could this be set in the driver?

In D97947#2700960, @dmgreen wrote:

I was under the impression that without a -mcpu it defaulted to cortex-a53 schedule. It looks like it's no-schedule though, which still counts as an in-order core as it has no MicroOpBufferSize. Can we check if ST->getSchedModel().ProcID != 0, which will be the "NoSchedModel".

Instead of checking the ProcID (-mcpu=carmel returns a ProcID of 0), I've added a check against the processor family (through ST->getProcFamily())

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
988	Right, I misunderstood. I've updated it to mention that we bail out in vector loops.
1006	Reworded both the commit message and this comment

SjoerdMeijer added inline comments.Apr 21 2021, 7:54 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1011	Great. Can we add some tests for this?

Harbormaster completed remote builds in B99979: Diff 339229.Apr 21 2021, 7:57 AM

NickGuy updated this revision to Diff 339974.Apr 23 2021, 3:59 AM

NickGuy marked an inline comment as done.

Herald added a subscriber: tmatheson. · View Herald TranscriptApr 23 2021, 3:59 AM

Nice one, looks good to me. Please wait a few days and commit this early next week to provide an opportunity to still comment on this.

This revision is now accepted and ready to land.Apr 23 2021, 5:12 AM

Harbormaster completed remote builds in B100531: Diff 339974.Apr 23 2021, 5:18 AM

Yeah, thanks. You didn't report them in very much detail here, but this seemed to be an general improvement on a wide range of benchmarks. Not universal but a decent improvement. LGTM.

Closed by commit rG2b6e0c90f981: [AArch64] Enable runtime unrolling for in-order sched models (authored by NickGuy). · Explain WhyApr 27 2021, 5:22 AM

This revision was automatically updated to reflect the committed changes.

NickGuy added a commit: rG2b6e0c90f981: [AArch64] Enable runtime unrolling for in-order sched models.

Diff 333009

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show All 12 Lines
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/CodeGen/BasicTTIImpl.h"		#include "llvm/CodeGen/BasicTTIImpl.h"
#include "llvm/CodeGen/CostTable.h"		#include "llvm/CodeGen/CostTable.h"
#include "llvm/CodeGen/TargetLowering.h"		#include "llvm/CodeGen/TargetLowering.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/IntrinsicsAArch64.h"		#include "llvm/IR/IntrinsicsAArch64.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
		#include "llvm/Transforms/Utils/LoopUtils.h"
#include <algorithm>		#include <algorithm>
using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;

#define DEBUG_TYPE "aarch64tti"		#define DEBUG_TYPE "aarch64tti"

static cl::opt<bool> EnableFalkorHWPFUnrollFix("enable-falkor-hwpf-unroll-fix",		static cl::opt<bool> EnableFalkorHWPFUnrollFix("enable-falkor-hwpf-unroll-fix",
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);
▲ Show 20 Lines • Show All 943 Lines • ▼ Show 20 Lines	void AArch64TTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
BaseT::getUnrollingPreferences(L, SE, UP);		BaseT::getUnrollingPreferences(L, SE, UP);

// For inner loop, it is more likely to be a hot one, and the runtime check		// For inner loop, it is more likely to be a hot one, and the runtime check
// can be promoted out from LICM pass, so the overhead is less, let's try		// can be promoted out from LICM pass, so the overhead is less, let's try
// a larger threshold to unroll more loops.		// a larger threshold to unroll more loops.
if (L->getLoopDepth() > 1)		if (L->getLoopDepth() > 1)
UP.PartialThreshold *= 2;		UP.PartialThreshold *= 2;

// Disable partial & runtime unrolling on -Os.		// Disable partial & runtime unrolling on -Os.
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Looks sensible to me that we do this (first) for smaller in-order cores. From a quick look, quite a few other targets that implement this hook have: // Scan the loop: don't unroll loops with calls as this could prevent // inlining. Do we need that too? Have you benchmarked this, and can we try this too? SjoerdMeijer: Looks sensible to me that we do this (first) for smaller in-order cores. From a quick look…
UP.PartialOptSizeThreshold = 0;		UP.PartialOptSizeThreshold = 0;

if (ST->getProcFamily() == AArch64Subtarget::Falkor &&		if (ST->getProcFamily() == AArch64Subtarget::Falkor &&
EnableFalkorHWPFUnrollFix)		EnableFalkorHWPFUnrollFix)
getFalkorUnrollingPreferences(L, SE, UP);		getFalkorUnrollingPreferences(L, SE, UP);

		if (getBooleanLoopAttribute(L, "llvm.loop.unroll.disable"))
		fhahnUnsubmitted Done Reply Inline Actions This comment seems out of date? fhahn: This comment seems out of date?
		NickGuyAuthorUnsubmitted Done Reply Inline Actions Doesn't seem like it is to me. If a function call is found, it bails without enabling runtime unrolling. This check is also performed by many other implementations of this hook. NickGuy: Doesn't seem like it is to me. If a function call is found, it bails without enabling runtime…
		fhahnUnsubmitted Done Reply Inline Actions If a function call is found, That's correct, but you are also bailing out on other cases, like instructions with vector types, right? fhahn: > If a function call is found, That's correct, but you are also bailing out on other cases…
		NickGuyAuthorUnsubmitted Done Reply Inline Actions Right, I misunderstood. I've updated it to mention that we bail out in vector loops. NickGuy: Right, I misunderstood. I've updated it to mention that we bail out in vector loops.
		return;

		// Scan the loop: don't unroll loops with calls as this could prevent
		// inlining.
		for (auto *BB : L->getBlocks()) {
		for (auto &I : *BB) {
		// Don't unroll vectorised loop.
		if (I.getType()->isVectorTy())
		return;

		if (isa<CallInst>(I) \|\| isa<InvokeInst>(I)) {
		if (const Function *F = cast<CallBase>(I).getCalledFunction()) {
		if (!isLoweredToCall(F))
		continue;
		}
		return;
		}
		}
		dmgreenUnsubmitted Done Reply Inline Actions "Force" doesn't sound like the right wording to me. It is enabling runtime/partial unrolling, which the loop unroller may use or not. dmgreen: "Force" doesn't sound like the right wording to me. It is enabling runtime/partial unrolling…
		NickGuyAuthorUnsubmitted Done Reply Inline Actions Reworded both the commit message and this comment NickGuy: Reworded both the commit message and this comment
		}
		dmgreenUnsubmitted Not Done Reply Inline Actions Should we be enabling partial unrolling too? If not, why not? dmgreen: Should we be enabling partial unrolling too? If not, why not?

		// Force runtime unrolling for in-order models
		dmgreenUnsubmitted Done Reply Inline Actions Out of order cores will already enable runtime unrolling based on the MicroOpBufferSize, it will just have a reduced threshold. I don't think the TODO is useful to add. dmgreen: Out of order cores will already enable runtime unrolling based on the MicroOpBufferSize, it…
		UP.Runtime \|= !ST->getSchedModel().isOutOfOrder();
}		}
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Great. Can we add some tests for this? SjoerdMeijer: Great. Can we add some tests for this?

void AArch64TTIImpl::getPeelingPreferences(Loop *L, ScalarEvolution &SE,		void AArch64TTIImpl::getPeelingPreferences(Loop *L, ScalarEvolution &SE,
TTI::PeelingPreferences &PP) {		TTI::PeelingPreferences &PP) {
BaseT::getPeelingPreferences(L, SE, PP);		BaseT::getPeelingPreferences(L, SE, PP);
}		}

Value AArch64TTIImpl::getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,		Value AArch64TTIImpl::getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,
Type *ExpectedType) {		Type *ExpectedType) {
▲ Show 20 Lines • Show All 299 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopUnroll/AArch64/runtime-loop.ll

	; RUN: opt < %s -S -loop-unroll -mtriple aarch64 -mcpu=cortex-a57 -unroll-runtime-epilog=true \| FileCheck %s -check-prefix=EPILOG			; RUN: opt < %s -S -loop-unroll -mtriple aarch64 -mcpu=cortex-a57 -unroll-runtime-epilog=true \| FileCheck %s -check-prefix=EPILOG
	; RUN: opt < %s -S -loop-unroll -mtriple aarch64 -mcpu=cortex-a57 -unroll-runtime-epilog=false \| FileCheck %s -check-prefix=PROLOG			; RUN: opt < %s -S -loop-unroll -mtriple aarch64 -mcpu=cortex-a57 -unroll-runtime-epilog=false \| FileCheck %s -check-prefix=PROLOG
				; RUN: opt < %s -S -loop-unroll -mtriple aarch64 -mcpu=cortex-r82 -unroll-runtime-epilog=true \| FileCheck %s -check-prefix=EPILOG
				; RUN: opt < %s -S -loop-unroll -mtriple aarch64 -mcpu=cortex-r82 -unroll-runtime-epilog=false \| FileCheck %s -check-prefix=PROLOG

	; Tests for unrolling loops with run-time trip counts			; Tests for unrolling loops with run-time trip counts

	; EPILOG: %xtraiter = and i32 %n			; EPILOG: %xtraiter = and i32 %n
	; EPILOG: for.body:			; EPILOG: for.body:
	; EPILOG: %lcmp.mod = icmp ne i32 %xtraiter, 0			; EPILOG: %lcmp.mod = icmp ne i32 %xtraiter, 0
	; EPILOG: br i1 %lcmp.mod, label %for.body.epil.preheader, label %for.end.loopexit			; EPILOG: br i1 %lcmp.mod, label %for.body.epil.preheader, label %for.end.loopexit
	; EPILOG: for.body.epil:			; EPILOG: for.body.epil:
	Show All 29 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Enable runtime unrolling for in-order scheduling models
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 333009

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Transforms/LoopUnroll/AArch64/runtime-loop.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Enable runtime unrolling for in-order scheduling modelsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 333009

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Transforms/LoopUnroll/AArch64/runtime-loop.ll

[AArch64] Enable runtime unrolling for in-order scheduling models
ClosedPublic