This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Neoverse V2 scheduling model
ClosedPublic

Authored by rjj on Jun 1 2023, 8:35 AM.

Details

Summary

This adds a scheduling model for the Neoverse V2. All information is taken from
the Neoverse V2 Software Optimisation Guide:

https://developer.arm.com/documentation/PJDOC-466751330-593177/r0p2

The model was tested on hardware and performance results are overall neutral
for SPEC2017 INT and FP compared to the Neoverse N2 scheduling model currently
used as default.

Having this model enables a more accurate description of the CPU with llvm-mca.

Depends on D152161

Diff Detail

Event Timeline

rjj created this revision.Jun 1 2023, 8:35 AM
Herald added a project: Restricted Project. · View Herald Transcript
rjj requested review of this revision.Jun 1 2023, 8:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 1 2023, 8:35 AM

Thanks for working on this, it looks like a good patch. I looked through some of the details and had a few questions.

It might make sense to use this for some other cpus like the cortex-x3, but they have slightly different pipelines, sitting between the N2 and V2 in the number of units. For now we can keep them as-is.

llvm/lib/Target/AArch64/AArch64InstrFormats.td
10865 ↗(On Diff #527428)

Can you do this in a separate patch, in case it causes problems.

llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
203

Should this use the load unit for 3 ResourceCycles, as opposed to being pipelined?

897

Can you explain where the differences between h/q and the other sizes come from?

1027–1028

12 and 20 are worst-case times. Would a value more in the middle of the range be better?

1036

It is usually done with read advances.

llvm/test/tools/llvm-mca/AArch64/Neoverse/V2-neon-instructions.s
400

Add more ldr tests perhaps.

Thanks for working on this.

I'm not sure how much difference the 'F' vs 'I' pipeline restrictions matter in practice, but I've spotted a few cases where the model will use 'I' instead of the more restrictive 'F'

llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
986

The flag setting variants use the 'F' pipelines rather than 'I'. The others do use 'I' though, so perhaps a predicate would work here.

994

Same issue with single-cycle flag setting variants using 'F' pipelines.

1016

These also use the 'F' pipelines in the NOLSL case.

mgabka added a subscriber: mgabka.Jun 2 2023, 4:33 AM
rjj updated this revision to Diff 528437.Jun 5 2023, 7:31 AM
rjj marked 5 inline comments as done.
rjj edited the summary of this revision. (Show Details)
  • Fix description of V2Write_5cyc_1I_1L (now V2Write_5cyc_1I_3L)
  • Correct usage of F pipeline in some instructions
  • Add more FP ldr tests
rjj marked 2 inline comments as done and an inline comment as not done.Jun 5 2023, 7:35 AM
rjj added inline comments.
llvm/lib/Target/AArch64/AArch64InstrFormats.td
10865 ↗(On Diff #527428)

Yep of course, done (D152161).

llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
203

You are right, changed to SchedWriteRes<[V2UnitI, V2UnitL, V2UnitL, V2UnitL]>.

897

It's from the software optimisation guide, https://developer.arm.com/documentation/PJDOC-466751330-593177/r0p2 p. 24.

986

Thanks, I've updated the model to use the 'F' pipelines in the cases you pointed out. Though I have a question: according to the SOG the throughput of these instructions is 3 instead of 4, even though there are 4 pipelines available. Do you have any idea why, or how we could accurately model this?

1027–1028

Sure, so maybe 8 and 12 respectively? Do you have a better suggestion? What about the throughput, 1/8 and 1/12?

1036

Thanks, I'll have a look. If you have any pointers to examples where read advances were used to model forwarding of instructions like madd and such, that would be greatly appreciated!

llvm/test/tools/llvm-mca/AArch64/Neoverse/V2-neon-instructions.s
400

I added a few more for H-form LDRs, but if you're referring to the FP loads they should be here already (you can grep for ldr\s[hwxq]).

It is probably good to double check performance numbers after these (small) modifications.

dmgreen added inline comments.Jun 7 2023, 3:47 AM
llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
1027–1028

It is the let ResourceCycles = [12] that will define the throughput. If the instruction uses the V2UnitM0 pipeline for multiple cycles, then other operations that use the same pipeline (for example other divs) will not be able to issue.

1036

The A53/A55 scheduling models have tried to model that on a number of operations (but it hasn't always worked very well). Other use a line like this, although it wont distinguish between IM32 and IM64 (the optimization guide it never very clean what "similar µOPs" mean):

def : ReadAdvance<ReadIMA,     2, [WriteIM32, WriteIM64]>;

If it overrides the regex, it may need to add Reads in the same way as the Falkor processor does.

It is probably fine to leave it as a NOTE for the moment, although it would be nice to model eventually.

llvm/test/tools/llvm-mca/AArch64/Neoverse/V2-neon-instructions.s
400

Oh I see. I was expecting them to be in the neon tests, not the basic ones. Sounds good.

Harvin has been looking at writing Cortex-A510 scheduling model recently, and noticed that most of the SVE instructions use Pseudos through a lot of the pipeline instead of the real instructions. Like ABS_ZPmZ_UNDEF_B and ADDHA_MPPZ_D_PSEUDO_D. You may find that especially in pre-ra scheduling the instruction do not match the real ones added here, and they need extra regex's to match the pseudos. (The same is likely true of the Neoverse-N2 scheduling model).

rjj updated this revision to Diff 530897.Jun 13 2023, 7:13 AM
rjj marked 3 inline comments as done and an inline comment as not done.
  • Match UNDEF instructions
  • Model forwarding

This revision also adds
llvm/test/tools/llvm-mca/AArch64/Neoverse/V2-forwarding.s which contains
tests for instructions that support forwarding.

rjj added a comment.Jun 13 2023, 7:17 AM

It is probably good to double check performance numbers after these (small) modifications.

As of the last revision, performance on SPEC2017 INT and FP is comparable to that obtained with the N2 model.

Harvin has been looking at writing Cortex-A510 scheduling model recently, and noticed that most of the SVE instructions use Pseudos through a lot of the pipeline instead of the real instructions. Like ABS_ZPmZ_UNDEF_B and ADDHA_MPPZ_D_PSEUDO_D. You may find that especially in pre-ra scheduling the instruction do not match the real ones added here, and they need extra regex's to match the pseudos. (The same is likely true of the Neoverse-N2 scheduling model).

Thanks for the heads up, the UNDEFs should now be getting matched. I'm not sure how this could be tested though, do you have any ideas?

rjj marked 5 inline comments as done.Jun 13 2023, 7:20 AM
rjj added inline comments.
llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
1027–1028

Yep, sorry, I meant if 8 and 12 seemed like reasonable latencies for DIVs, compared to the worst-case times of 12 and 20.

1036

Thanks, this should now be done, with tests in llvm/test/tools/llvm-mca/AArch64/Neoverse/V2-forwarding.s.

rjj marked 2 inline comments as done.Jun 13 2023, 7:21 AM
Matt added a subscriber: Matt.Jun 13 2023, 12:51 PM
dmgreen accepted this revision.Jun 14 2023, 7:19 AM

Thanks. The new test looks good.

The details in the model looks good from what I have checked. I think it is worth getting this in we can iterate if needed from there if needed.

LGTM Thanks.

llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
1027–1028

Honestly I'm not sure what the profile of latencies would look like, for common values of divides. I'm happy to stick with the current values if we don't see them causing problems, and we can adjust them in the future if we need to.

This revision is now accepted and ready to land.Jun 14 2023, 7:19 AM
rjj marked an inline comment as done.Jun 14 2023, 7:34 AM

Thanks. The new test looks good.

The details in the model looks good from what I have checked. I think it is worth getting this in we can iterate if needed from there if needed.

LGTM Thanks.

Perfect, thanks very much!

llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
1027–1028

Cool, I'll leave the worst-case ones then and we can revisit this later if need be.

This revision was landed with ongoing or failed builds.Jun 14 2023, 8:20 AM
This revision was automatically updated to reflect the committed changes.
rjj marked an inline comment as done.
llvm/test/tools/llvm-mca/AArch64/Neoverse/V2-neon-instructions.s