This change gives a 0.25% speedup on execution time, a 0.82% improvement
in benchmark scores and a 0.20% increase in binary size on a Cortex-A53.
These numbers are the geomean results on a wide range of benchmarks from
the test-suite and a range of proprietary suites.
Details
Diff Detail
Event Timeline
lib/Target/AArch64/AArch64Subtarget.cpp | ||
---|---|---|
133–136 | I assume you're talking about the llvm test-suite benchmarks. That said, the size increase seems non negligible. Have you considered disabling this when optimizing for size? |
lib/Target/AArch64/AArch64Subtarget.cpp | ||
---|---|---|
133–136 | Yes I meant the llvm test-suite benchmarks :) I'll look into only setting PerfFunctionAlignment only when not optimizing for size. That seems like a sensible thing to do and may be worth doing for the other Cortex-A cores too. |
lib/Target/AArch64/AArch64Subtarget.cpp | ||
---|---|---|
133–136 | Thank you. That's greatly appreciated. |
lib/Target/AArch64/AArch64Subtarget.cpp | ||
---|---|---|
133–136 |
After having a look I found that PrefFunctionAlignment is not set when optimizing for size [1], so we do not have to handle the case in AArch64Subtarget.cpp [1] https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/MachineFunction.cpp#L132 |
We found similar results on spec2k6 for aarch64 that we attributed to function alignment. Have you tried that? I need to dig the one culprit...
So, it seems it was sphinx, but that was loop alignment, 4 bytes on A53, 8 bytes on A57, to do with the fetch alignment. Maybe this is a related issue. Why 16, though?
Thanks Renato. Yes aligning the function start at 16 byte boundaries is for maximum fetch performance. To quote from the A57 Optimization Guide:
Consider aligning subroutine entry points and branch targets to quadword boundaries, within the bounds of the code-density requirements of the program. This will ensure that the subsequent fetch can retrieve four (or a full quadword’s worth of) instructions, maximizing fetch bandwidth following the taken branch.
For Cortex-A53, 8byte alignment may be enough, I'll run the same set of benchmarks with 8 byte alignment.
Using 8 byte alignment gives a 0.25% speedup on execution time (was 0.23% with 16 bytes), a 0.82% improvement
in benchmark scores (was 0.93% with 16 bytes) and a 0.20% increase in binary size (was 0.55%). So for the score related benchmarks, the 8 byte alignment makes things worse quite a bit, but the impact on size is much smaller. Should we use 8 byte alignment, to keep the binary size down?
I wouldn't rely too much on LLVM's "benchmarking" suite. They're good to spot regressions, but not very representative of all things. The reduction in code size is higher than in performance, so I think that's a win.
@davide, comments on the new code size changes?
cheers,
--renato
PS: A quick EEMBC run would also be interesting, given that we're talking about code size on A53.
Yeah, I agree LLVM’s benchmarking suite isn’t a good test in itself, which is why I also tried the proprietary benchmarks. Unfortunately I can’t share details about which proprietary benchmarks were or weren’t included.
I think we should go with 8 byte alignment for Cortex-A53, as the small improvement of 16 byte alignment is outweighed by the big increase in size. @davide what do you think?
I assume you're talking about the llvm test-suite benchmarks.
If not, you may want to add a link to your benchmarks :)
That said, the size increase seems non negligible. Have you considered disabling this when optimizing for size?