This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Cortex-A55 scheduler model
ClosedPublic

Authored by SjoerdMeijer on May 15 2018, 8:29 AM.

Details

Summary

This adds the Cortex-A55 scheduler model.

The optimisation guide describing Cortex-A55 micro-architecture in more detail
can be downloaded here:

https://static.docs.arm.com/epm128372/20/arm_cortex_a55_software_optimization_guide_v2.pdf

This model showed decent performance uplifts for a number of benchmarks on our
models, but we are interested in feedback if the same trend is observed on hardware;
Samsung was interested in providing feedback on this.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.May 15 2018, 8:29 AM
SjoerdMeijer added inline comments.May 15 2018, 8:33 AM
lib/Target/AArch64/AArch64SchedA55.td
32

Forgot to mention that I will look into this soon.

LGTM but then I wrote it originally so better to wait for comments from others.

lib/Target/AArch64/AArch64SchedA55.td
43

nitpick = "64-bi"

I will update benchmark results until next week.
benchmark list : dhrystone, spec2000, spec2006, one commercial benchmark.

Sjoerd, is it ok?

Yes, excellent, thank you!

When I tried to test this patch, it showed some improvements on several benchmarks like Spec2000/2006.
But other benchmarks like dhrystone, commercial benchmark's sub workloads show performance degradation.

We need to improve this scheduler model & I would be happy to test updated patch.

Thanks.

When I tried to test this patch, it showed some improvements on several benchmarks like Spec2000/2006.
But other benchmarks like dhrystone, commercial benchmark's sub workloads show performance degradation.

We need to improve this scheduler model & I would be happy to test updated patch.

Thanks.

Good to hear of the improvements in SPEC numbers.

Could you please provide some more information, if possible, on the possible source of regressions (i.e. where cycles were lost). Are the commercial benchmarks - memory intensive? floating-point intensive? control dominant.

Hi Junmo, many thanks for sharing this. Looks like we have some more work and further tuning to do!

Hi Javed.

Could you please provide some more information, if possible, on the possible source of regressions (i.e. where cycles were lost). Are the commercial benchmarks - memory intensive? floating-point intensive? control dominant.

About commercial benchmark(geekbench) result, I will run it more than 100 times & figure out which one has problem. I suspect the issue is with new branch predictor in CA55. But I am not sure now.
I think you cat get the answer about category of commercial benchmarks. :)

Jackson added a subscriber: Jackson.Sep 7 2018, 5:13 AM
chill added a subscriber: chill.Oct 4 2018, 10:30 AM
evgeny777 added inline comments.
lib/Target/AArch64/AArch64SchedA55.td
160

@SjoerdMeijer Latency/hazard for FDivDP seem to not match those in optimization guide for Cortex-A55 (should be 22/19). Why's that? There are mismatches in other places also (for instance WriteLD should (?) be 3 cycles, not 4)

SjoerdMeijer added inline comments.Sep 10 2020, 10:48 AM
lib/Target/AArch64/AArch64SchedA55.td
160

Just for a bit of context, I put this scheduler up for review to share it so people could pick it up. We knew it needed some more tuning (but we didn't have the bandwidth), which was confirmed with some benchmarks, see some earlier messages.

I haven't checked, but I assume you're right about those latencies. I am guessing they were oversights. For the FDivDP case, which its long latency, I am not sure it would make a practical difference though, but anyway. And for the WriteLD, it may be the case that this was giving better performance (schedmodels are not always an exact science), but it may have been a typo too.

If you plan to improve this, I would be happy to support this with reviews etc. and see if we can get this committed.

@SjoerdMeijer, I could confirm that D51160 with this scheduler model shows the improvement and it had been successfully applied to our in-house compiler.
Also, there was no issue about performance & regressions with this patch. So, I think that you can commit this with minor fixes @evgeny777 mentioned before.

evgeny777 added a comment.EditedSep 20 2020, 11:30 PM

@flyingforyou There are numerous places where latencies are different from those in arm_cortex_a55_software_optimization_guide_v2.pdf. Values in the guide seems to be correct, at least they match my measurements on real piece of hardware.
Also there are some forwarding paths not listed by model

@evgeny777 : how about we commit this to have a baseline that we iterate on?

@evgeny777 : how about we commit this to have a baseline that we iterate on?

I feel like that could cause problems in the short term, if the information in it is incorrect and it degraded performance. We could perhaps commit it without enabling it to begin with.

I would like to see the tests being improved too. Something like the equivalent of llvm/test/tools/llvm-mca/ARM/m4-int.s would be very useful for checking the values are all present and what we expect.

We could perhaps commit it without enabling it to begin with.

Excellent, let's start with that then. Once we have something committed, we can iterate on it, that's the best way forward now. I will do the commit under this ticket, and will open a review for any other follow up work (e.g. the extra tests).

We could perhaps commit it without enabling it to begin with.

I think enabling this commit is better than using cortex-a53 scheduler model. But I can agree with this idea.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 21 2020, 2:55 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptSep 21 2020, 2:55 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
SjoerdMeijer added a comment.EditedSep 21 2020, 2:59 AM

I have committed a first version that we can now iterate on; it is not enabled/used yet.

@evgeny777 : sounds like you've got your environment all setup. Would it be easy for you to quickly test the changes that you suggested earlier? At least the latency changes?
As I am interested in enabling this by default soon, that is my proposal: let's get the small things that are wrong out of the way, then enable it by default because this it is already better than the current A53 model as @flyingforyou mentioned.

FYI: I have created https://reviews.llvm.org/D88017 to enable the model.
I will now look at the optimisation guide, correct the obvious mistakes, and create a diff for that, unless @evgeny777 you get there first, let me know.

@SjoerdMeijer

sounds like you've got your environment all setup. Would it be easy for you to quickly test the changes that you suggested earlier?

Sorry, I'm involved in slightly different activity atm. Will do whenever I can.

@SjoerdMeijer

sounds like you've got your environment all setup. Would it be easy for you to quickly test the changes that you suggested earlier?

Sorry, I'm involved in slightly different activity atm. Will do whenever I can.

thanks, see also my comment in D88017, in which I have addressed some comments.