This is an archive of the discontinued LLVM Phabricator instance.

[Aarch64] Add Cortex-A510 specific scheduling
ClosedPublic

Authored by harviniriawan on Jun 12 2023, 1:59 AM.

Details

Summary

Diff Detail

Unit TestsFailed

Event Timeline

harviniriawan created this revision.Jun 12 2023, 1:59 AM
harviniriawan requested review of this revision.Jun 12 2023, 1:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2023, 1:59 AM
andreadb resigned from this revision.Jun 12 2023, 3:46 AM

Looks good. The CortexA510Write seems to work out well. I had a few comments/questions from looking through the software optimization guide.

Can you add a neon-instructions test file too? And update the sve test to include all the other instructions from the neoverse tests, which include more SVE2 instructions than the original A64FX test.

llvm/lib/Target/AArch64/AArch64SchedA510.td
10–13

This comment can be updated.

81

Is 3 better here? Or is a better value to use for 32bit muls? I see 4 was used in the A55 model too.

123

2 sounds quite low. Is that better than using 3?

Matt added a subscriber: Matt.Jun 12 2023, 10:26 AM

Address comments from David

harviniriawan marked 2 inline comments as done.Jun 12 2023, 1:36 PM
harviniriawan added inline comments.
llvm/lib/Target/AArch64/AArch64SchedA510.td
123

SWOG indicates that integer loads have latency of 2

Can you please share some performance numbers showing that this scheduling model is beneficial most of the time?

llvm/lib/Target/AArch64/AArch64SchedA510.td
21

Integer loads take 2 cycles and that's the value that would be more sensible to use here. The comment should be improved too.

45

BufferSize should be in a block outside all these ProcResource lines.

57

You modeled the optional 128-bit wide VPU. In my opinion, it would be better to model for the worst case, the narrower 64-bit wide VPU. Thus code scheduled for the narrow VPU will typically run better on the wide VPU.

68

I encourage you to model the forwarding paths as well, at least for those instructions whose throughput is 1 or less.

78

64-bit multiplication takes from 4 to 5 cycles. It's usually better to be early than late, so 5 would be a more sensible value here. It also takes up the resource for 2 or 3 cycles, with 2 being a sensible value for ResourceCycles.

rjj added inline comments.Jun 13 2023, 8:00 AM
llvm/lib/Target/AArch64/AArch64SchedA510.td
646–647

I think there are not as many UNDEF patterns as the ones matched here (in other words the regex could be tightened a bit). UNDEF patterns only seem to only exist for ABS|CNOT|NEG.

654

I don't think these UNDEFs exist.

660

No UNDEFs for ADD|SUB|SUBR.

663

These UNDEFs don't seem to exist.

664

Same as above.

There are a few other places where the UNDEF patterns don't seem to exist, but before going through them, any reason why you've included these extra UNDEFs here?

1102

Straggling comment?

harviniriawan marked 12 inline comments as done.

Addressed several comments

Can you please share some performance numbers showing that this scheduling model is beneficial most of the time?

There's no performance degradation on SPEC2k17, and we get significant uplift on some SVE-specific workloads (about 4-5%) on the LITTLE core

llvm/lib/Target/AArch64/AArch64SchedA510.td
21

I think 3 is a good compromise, as in the future I'd like this to be the default scheduling for -mcpu=generic

57

I think we'd like to make the config that is preferred (2x128 bits) to be the default as SVE VL is a minimum of 128 bits.

68

I think it's best to let the hardware takes care of this, as it is internally able to do so

664

I added them because I see from the debug output of machine-scheduler that sometimes it will generate pseudoinstructions with UNDEF* on some benchmarks, which then will pick up the wrong latency information without adding them into the regex. Maybe in the future we should make UNDEF and PSEUDO to always be at the end of the instruction for easier regex matching.

rjj added inline comments.Jun 14 2023, 4:42 AM
llvm/lib/Target/AArch64/AArch64SchedA510.td
664

Yep, I totally understand that- I was not suggesting their complete removal! It's just that in a few places the regex was matching "fictitious" UNDEFs, which personally I try to avoid.

I agree, in the future UNDEF and PSEUDO could go at the end of the instruction to facilitate their matching.

dmgreen added inline comments.Jun 14 2023, 7:53 AM
llvm/lib/Target/AArch64/AArch64SchedA510.td
59

Perhaps unindent this

68

We haven't found the forwarding paths to be super useful in the past. They are difficult to model what is really going on in the core, and we have often found have caused more scheduling inaccuracies than they have helped with. The scheduler doesn't have a great way of modelling skewing.

123

Oh yeah. I think I was looking at the wrong optimization guide at the time.

664

I don't have a strong opinion on the loose regexes, but it would be good to model the UNDEF nodes if we can, as those will be the ones seen during scheduling. SVE seemed to be a place where this scheduling model can have a decent benefit.

harviniriawan marked 3 inline comments as done.Jun 14 2023, 9:55 AM
harviniriawan added inline comments.
llvm/lib/Target/AArch64/AArch64SchedA510.td
664

Perhaps we can revisit this in future patch. I'd also prefer not to add some cryptic UNDEF and PSEUDO if not necessarry

dmgreen accepted this revision.Jun 19 2023, 8:19 AM

OK, that sounds good. It would be good to have some tests for UNDEF nodes if we can, so we can add those separately. From what you have shown the performance has sounded pretty decent for SVE code, so scheduling of those UNDEF nodes should be quite important. Lets get this in and we can iterator from there.

LGTM, with a small Nit.

llvm/lib/Target/AArch64/AArch64SchedA510.td
9

You can add "This file defines the machine model for the ARM Cortex-A510 processors."

This revision is now accepted and ready to land.Jun 19 2023, 8:19 AM
harviniriawan edited the summary of this revision. (Show Details)Jun 20 2023, 2:38 AM
This revision was automatically updated to reflect the committed changes.

How about some test results, please?

llvm/lib/Target/AArch64/AArch64SchedA510.td
21

Methinks that 3 would be too short in general.

57

The VL may be at least 128 bits, but the implementation could still be 64 bits wide. Again, it's better to be early than late when scheduling, so it'd be better to model for 64 bits, which would run well in all A510.

68

That can be said of very super scalar pipelines. When the throughput is just 1, it's important to make it easier for the core to digest instructions as quickly as possible. After all, that's why there are scheduling models.

68

I agree that there are cases when they can hurt throughput, but only when it's greater than 1. The case that I have in mind is that it's better to have, say, a MUL execute in parallel with an unrelated MADD rather than sequentially with a dependent one. But that only applies when there, in this case, more than one multiplier. If there's only one multiplier, then it's important to schedule instructions whose combined latency is smaller if close together.

llvm/test/tools/llvm-mca/AArch64/Cortex/A510-neon-instructions.s