This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Use 8 bytes as preferred function alignment on Cortex-A53.
ClosedPublic

Authored by fhahn on Jul 18 2017, 9:01 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
aadg
silviu.baranga
rengolin
mcrosier

Commits

rG2f86e3d49488: [AArch64] Use 8 bytes as preferred function alignment on Cortex-A53.
rL309494: [AArch64] Use 8 bytes as preferred function alignment on Cortex-A53.

Summary

This change gives a 0.25% speedup on execution time, a 0.82% improvement
in benchmark scores and a 0.20% increase in binary size on a Cortex-A53.
These numbers are the geomean results on a wide range of benchmarks from
the test-suite and a range of proprietary suites.

Diff Detail

Event Timeline

fhahn created this revision.Jul 18 2017, 9:01 AM

Herald added subscribers: kristof.beyls, javed.absar, rengolin, aemerson. · View Herald TranscriptJul 18 2017, 9:01 AM

davide added a subscriber: davide.Jul 18 2017, 9:19 AM

davide added inline comments.

lib/Target/AArch64/AArch64Subtarget.cpp
133–136	I assume you're talking about the llvm test-suite benchmarks. If not, you may want to add a link to your benchmarks :) That said, the size increase seems non negligible. Have you considered disabling this when optimizing for size?

grimar added a subscriber: grimar.Jul 18 2017, 9:23 AM

fhahn added inline comments.Jul 18 2017, 9:45 AM

lib/Target/AArch64/AArch64Subtarget.cpp
133–136	Yes I meant the llvm test-suite benchmarks :) I'll look into only setting PerfFunctionAlignment only when not optimizing for size. That seems like a sensible thing to do and may be worth doing for the other Cortex-A cores too.

fhahn edited the summary of this revision. (Show Details)Jul 18 2017, 9:45 AM

davide added inline comments.Jul 18 2017, 9:46 AM

lib/Target/AArch64/AArch64Subtarget.cpp
133–136	Thank you. That's greatly appreciated.

fhahn added inline comments.Jul 19 2017, 3:54 AM

lib/Target/AArch64/AArch64Subtarget.cpp
133–136	That said, the size increase seems non negligible. Have you considered disabling this when optimizing for size? After having a look I found that PrefFunctionAlignment is not set when optimizing for size [1], so we do not have to handle the case in AArch64Subtarget.cpp [1] https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/MachineFunction.cpp#L132

fhahn marked an inline comment as done.Jul 19 2017, 3:54 AM

fhahn added inline comments.Jul 19 2017, 6:15 AM

lib/Target/AArch64/AArch64Subtarget.cpp
133–136	I've created D35620 which updates preferred-function-alignment.ll to make sure we do not regress for optsize functions.

fhahn marked an inline comment as done.Jul 21 2017, 2:51 PM

We found similar results on spec2k6 for aarch64 that we attributed to function alignment. Have you tried that? I need to dig the one culprit...

So, it seems it was sphinx, but that was loop alignment, 4 bytes on A53, 8 bytes on A57, to do with the fetch alignment. Maybe this is a related issue. Why 16, though?

Thanks Renato. Yes aligning the function start at 16 byte boundaries is for maximum fetch performance. To quote from the A57 Optimization Guide:

Consider aligning subroutine entry points and branch targets to quadword boundaries, within the bounds of the code-density requirements of the program. This will ensure that the subsequent fetch can retrieve four (or a full quadword’s worth of) instructions, maximizing fetch bandwidth following the taken branch.

For Cortex-A53, 8byte alignment may be enough, I'll run the same set of benchmarks with 8 byte alignment.

Using 8 byte alignment gives a 0.25% speedup on execution time (was 0.23% with 16 bytes), a 0.82% improvement
in benchmark scores (was 0.93% with 16 bytes) and a 0.20% increase in binary size (was 0.55%). So for the score related benchmarks, the 8 byte alignment makes things worse quite a bit, but the impact on size is much smaller. Should we use 8 byte alignment, to keep the binary size down?

fhahn added a reviewer: rengolin.Jul 24 2017, 5:23 AM

In D35568#818708, @fhahn wrote:

Using 8 byte alignment gives a 0.25% speedup on execution time (was 0.23% with 16 bytes), a 0.82% improvement
in benchmark scores (was 0.93% with 16 bytes) and a 0.20% increase in binary size (was 0.55%). So for the score related benchmarks, the 8 byte alignment makes things worse quite a bit, but the impact on size is much smaller. Should we use 8 byte alignment, to keep the binary size down?

I wouldn't rely too much on LLVM's "benchmarking" suite. They're good to spot regressions, but not very representative of all things. The reduction in code size is higher than in performance, so I think that's a win.

@davide, comments on the new code size changes?

cheers,
--renato

PS: A quick EEMBC run would also be interesting, given that we're talking about code size on A53.

Yeah, I agree LLVM’s benchmarking suite isn’t a good test in itself, which is why I also tried the proprietary benchmarks. Unfortunately I can’t share details about which proprietary benchmarks were or weren’t included.

mcrosier resigned from this revision.Jul 26 2017, 6:10 AM

I think we should go with 8 byte alignment for Cortex-A53, as the small improvement of 16 byte alignment is outweighed by the big increase in size. @davide what do you think?

I'm fine with this.

LGTM too. Thanks!

This revision is now accepted and ready to land.Jul 28 2017, 3:23 PM

fhahn closed this revision.Jul 29 2017, 1:05 PM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64Subtarget.cpp

4 lines

test/

CodeGen/

AArch64/

preferred-function-alignment.ll

2 lines

Diff 107876

lib/Target/AArch64/AArch64Subtarget.cpp

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines	void AArch64Subtarget::initializeProperties() {
case ThunderXT83:		case ThunderXT83:
CacheLineSize = 128;		CacheLineSize = 128;
PrefFunctionAlignment = 3;		PrefFunctionAlignment = 3;
PrefLoopAlignment = 2;		PrefLoopAlignment = 2;
// FIXME: remove this to enable 64-bit SLP if performance looks good.		// FIXME: remove this to enable 64-bit SLP if performance looks good.
MinVectorRegisterBitWidth = 128;		MinVectorRegisterBitWidth = 128;
break;		break;
case CortexA35: break;		case CortexA35: break;
case CortexA53: break;		case CortexA53:
		PrefFunctionAlignment = 3;
		break;
case CortexA72:		case CortexA72:
		davideUnsubmitted Done Reply Inline Actions I assume you're talking about the llvm test-suite benchmarks. If not, you may want to add a link to your benchmarks :) That said, the size increase seems non negligible. Have you considered disabling this when optimizing for size? davide: I assume you're talking about the llvm test-suite benchmarks. If not, you may want to add a…
		fhahnAuthorUnsubmitted Not Done Reply Inline Actions Yes I meant the llvm test-suite benchmarks :) I'll look into only setting PerfFunctionAlignment only when not optimizing for size. That seems like a sensible thing to do and may be worth doing for the other Cortex-A cores too. fhahn: Yes I meant the llvm test-suite benchmarks :) I'll look into only setting…
		davideUnsubmitted Done Reply Inline Actions Thank you. That's greatly appreciated. davide: Thank you. That's greatly appreciated.
		fhahnAuthorUnsubmitted Not Done Reply Inline Actions I've created D35620 which updates preferred-function-alignment.ll to make sure we do not regress for optsize functions. fhahn: I've created D35620 which updates preferred-function-alignment.ll to make sure we do not…
		fhahnAuthorUnsubmitted Not Done Reply Inline Actions That said, the size increase seems non negligible. Have you considered disabling this when optimizing for size? After having a look I found that PrefFunctionAlignment is not set when optimizing for size [1], so we do not have to handle the case in AArch64Subtarget.cpp [1] https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/MachineFunction.cpp#L132 fhahn: > That said, the size increase seems non negligible. Have you considered disabling this when…
PrefFunctionAlignment = 4;		PrefFunctionAlignment = 4;
break;		break;
case CortexA73:		case CortexA73:
PrefFunctionAlignment = 4;		PrefFunctionAlignment = 4;
break;		break;
case Others: break;		case Others: break;
}		}
}		}
▲ Show 20 Lines • Show All 163 Lines • Show Last 20 Lines

test/CodeGen/AArch64/preferred-function-alignment.ll

	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=generic < %s \| FileCheck --check-prefixes=ALIGN2,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=generic < %s \| FileCheck --check-prefixes=ALIGN2,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cortex-a35 < %s \| FileCheck --check-prefixes=ALIGN2,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cortex-a35 < %s \| FileCheck --check-prefixes=ALIGN2,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cortex-a53 < %s \| FileCheck --check-prefixes=ALIGN2,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cyclone < %s \| FileCheck --check-prefixes=ALIGN2,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cyclone < %s \| FileCheck --check-prefixes=ALIGN2,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=falkor < %s \| FileCheck --check-prefixes=ALIGN2,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=falkor < %s \| FileCheck --check-prefixes=ALIGN2,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=kryo < %s \| FileCheck --check-prefixes=ALIGN2,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=kryo < %s \| FileCheck --check-prefixes=ALIGN2,CHECK %s
				; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cortex-a53 < %s \| FileCheck --check-prefixes=ALIGN3,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=thunderx < %s \| FileCheck --check-prefixes=ALIGN3,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=thunderx < %s \| FileCheck --check-prefixes=ALIGN3,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=thunderxt81 < %s \| FileCheck --check-prefixes=ALIGN3,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=thunderxt81 < %s \| FileCheck --check-prefixes=ALIGN3,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=thunderxt83 < %s \| FileCheck --check-prefixes=ALIGN3,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=thunderxt83 < %s \| FileCheck --check-prefixes=ALIGN3,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=thunderxt88 < %s \| FileCheck --check-prefixes=ALIGN3,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=thunderxt88 < %s \| FileCheck --check-prefixes=ALIGN3,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=thunderx2t99 < %s \| FileCheck --check-prefixes=ALIGN3,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=thunderx2t99 < %s \| FileCheck --check-prefixes=ALIGN3,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cortex-a57 < %s \| FileCheck --check-prefixes=ALIGN4,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cortex-a57 < %s \| FileCheck --check-prefixes=ALIGN4,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cortex-a72 < %s \| FileCheck --check-prefixes=ALIGN4,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cortex-a72 < %s \| FileCheck --check-prefixes=ALIGN4,CHECK %s
	; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cortex-a73 < %s \| FileCheck --check-prefixes=ALIGN4,CHECK %s			; RUN: llc -mtriple=aarch64-unknown-linux -mcpu=cortex-a73 < %s \| FileCheck --check-prefixes=ALIGN4,CHECK %s
	Show All 19 Lines