This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Use 8 bytes as preferred function alignment on Cortex-A53.
ClosedPublic

Authored by fhahn on Jul 18 2017, 9:01 AM.

Details

Summary

This change gives a 0.25% speedup on execution time, a 0.82% improvement
in benchmark scores and a 0.20% increase in binary size on a Cortex-A53.
These numbers are the geomean results on a wide range of benchmarks from
the test-suite and a range of proprietary suites.

Diff Detail

Event Timeline

fhahn created this revision.Jul 18 2017, 9:01 AM
davide added a subscriber: davide.Jul 18 2017, 9:19 AM
davide added inline comments.
lib/Target/AArch64/AArch64Subtarget.cpp
133–136

I assume you're talking about the llvm test-suite benchmarks.
If not, you may want to add a link to your benchmarks :)

That said, the size increase seems non negligible. Have you considered disabling this when optimizing for size?

grimar added a subscriber: grimar.Jul 18 2017, 9:23 AM
fhahn added inline comments.Jul 18 2017, 9:45 AM
lib/Target/AArch64/AArch64Subtarget.cpp
133–136

Yes I meant the llvm test-suite benchmarks :)

I'll look into only setting PerfFunctionAlignment only when not optimizing for size. That seems like a sensible thing to do and may be worth doing for the other Cortex-A cores too.

fhahn edited the summary of this revision. (Show Details)Jul 18 2017, 9:45 AM
davide added inline comments.Jul 18 2017, 9:46 AM
lib/Target/AArch64/AArch64Subtarget.cpp
133–136

Thank you. That's greatly appreciated.

fhahn added inline comments.Jul 19 2017, 3:54 AM
lib/Target/AArch64/AArch64Subtarget.cpp
133–136

That said, the size increase seems non negligible. Have you considered disabling this when optimizing for size?

After having a look I found that PrefFunctionAlignment is not set when optimizing for size [1], so we do not have to handle the case in AArch64Subtarget.cpp

[1] https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/MachineFunction.cpp#L132

fhahn marked an inline comment as done.Jul 19 2017, 3:54 AM
fhahn added inline comments.Jul 19 2017, 6:15 AM
lib/Target/AArch64/AArch64Subtarget.cpp
133–136

I've created D35620 which updates preferred-function-alignment.ll to make sure we do not regress for optsize functions.

fhahn marked an inline comment as done.Jul 21 2017, 2:51 PM

We found similar results on spec2k6 for aarch64 that we attributed to function alignment. Have you tried that? I need to dig the one culprit...

So, it seems it was sphinx, but that was loop alignment, 4 bytes on A53, 8 bytes on A57, to do with the fetch alignment. Maybe this is a related issue. Why 16, though?

fhahn added a comment.Jul 23 2017, 8:30 AM

Thanks Renato. Yes aligning the function start at 16 byte boundaries is for maximum fetch performance. To quote from the A57 Optimization Guide:

Consider aligning subroutine entry points and branch targets to quadword boundaries, within the bounds of the code-density requirements of the program. This will ensure that the subsequent fetch can retrieve four (or a full quadword’s worth of) instructions, maximizing fetch bandwidth following the taken branch.

For Cortex-A53, 8byte alignment may be enough, I'll run the same set of benchmarks with 8 byte alignment.

fhahn updated this revision to Diff 107876.Jul 24 2017, 3:37 AM
fhahn retitled this revision from [AArch64] Use 16 bytes as preferred function alignment on Cortex-A53. to [AArch64] Use 8 bytes as preferred function alignment on Cortex-A53..
fhahn edited the summary of this revision. (Show Details)

Using 8 byte alignment gives a 0.25% speedup on execution time (was 0.23% with 16 bytes), a 0.82% improvement
in benchmark scores (was 0.93% with 16 bytes) and a 0.20% increase in binary size (was 0.55%). So for the score related benchmarks, the 8 byte alignment makes things worse quite a bit, but the impact on size is much smaller. Should we use 8 byte alignment, to keep the binary size down?

rengolin edited edge metadata.Jul 24 2017, 5:30 AM

Using 8 byte alignment gives a 0.25% speedup on execution time (was 0.23% with 16 bytes), a 0.82% improvement
in benchmark scores (was 0.93% with 16 bytes) and a 0.20% increase in binary size (was 0.55%). So for the score related benchmarks, the 8 byte alignment makes things worse quite a bit, but the impact on size is much smaller. Should we use 8 byte alignment, to keep the binary size down?

I wouldn't rely too much on LLVM's "benchmarking" suite. They're good to spot regressions, but not very representative of all things. The reduction in code size is higher than in performance, so I think that's a win.

@davide, comments on the new code size changes?

cheers,
--renato

PS: A quick EEMBC run would also be interesting, given that we're talking about code size on A53.

fhahn added a comment.Jul 24 2017, 6:14 AM

Yeah, I agree LLVM’s benchmarking suite isn’t a good test in itself, which is why I also tried the proprietary benchmarks. Unfortunately I can’t share details about which proprietary benchmarks were or weren’t included.

mcrosier resigned from this revision.Jul 26 2017, 6:10 AM
fhahn added a comment.Jul 27 2017, 8:49 AM

I think we should go with 8 byte alignment for Cortex-A53, as the small improvement of 16 byte alignment is outweighed by the big increase in size. @davide what do you think?

I'm fine with this.

rengolin accepted this revision.Jul 28 2017, 3:23 PM

LGTM too. Thanks!

This revision is now accepted and ready to land.Jul 28 2017, 3:23 PM
fhahn closed this revision.Jul 29 2017, 1:05 PM