- Update the Cortex-A510 mcpu target to use A510 scheduling info instead of A55. Values taken are based on the A510 software optimisation guide https://developer.arm.com/documentation/PJDOC-466751330-536816/latest
- Make latency of most integer ops to 1. CPU uarch is able to resolve most integer ops in 1 cycle
Differential Revision: https://reviews.llvm.org/D152688
Details
Diff Detail
Event Timeline
Looks good. The CortexA510Write seems to work out well. I had a few comments/questions from looking through the software optimization guide.
Can you add a neon-instructions test file too? And update the sve test to include all the other instructions from the neoverse tests, which include more SVE2 instructions than the original A64FX test.
llvm/lib/Target/AArch64/AArch64SchedA510.td | ||
---|---|---|
9–12 | This comment can be updated. | |
80 | Is 3 better here? Or is a better value to use for 32bit muls? I see 4 was used in the A55 model too. | |
122 | 2 sounds quite low. Is that better than using 3? |
llvm/lib/Target/AArch64/AArch64SchedA510.td | ||
---|---|---|
122 | SWOG indicates that integer loads have latency of 2 |
Can you please share some performance numbers showing that this scheduling model is beneficial most of the time?
llvm/lib/Target/AArch64/AArch64SchedA510.td | ||
---|---|---|
21 | Integer loads take 2 cycles and that's the value that would be more sensible to use here. The comment should be improved too. | |
45 | BufferSize should be in a block outside all these ProcResource lines. | |
57 | You modeled the optional 128-bit wide VPU. In my opinion, it would be better to model for the worst case, the narrower 64-bit wide VPU. Thus code scheduled for the narrow VPU will typically run better on the wide VPU. | |
68 | I encourage you to model the forwarding paths as well, at least for those instructions whose throughput is 1 or less. | |
78 | 64-bit multiplication takes from 4 to 5 cycles. It's usually better to be early than late, so 5 would be a more sensible value here. It also takes up the resource for 2 or 3 cycles, with 2 being a sensible value for ResourceCycles. |
llvm/lib/Target/AArch64/AArch64SchedA510.td | ||
---|---|---|
646–647 | I think there are not as many UNDEF patterns as the ones matched here (in other words the regex could be tightened a bit). UNDEF patterns only seem to only exist for ABS|CNOT|NEG. | |
654 | I don't think these UNDEFs exist. | |
660 | No UNDEFs for ADD|SUB|SUBR. | |
663 | These UNDEFs don't seem to exist. | |
664 | Same as above. There are a few other places where the UNDEF patterns don't seem to exist, but before going through them, any reason why you've included these extra UNDEFs here? | |
1102 | Straggling comment? |
There's no performance degradation on SPEC2k17, and we get significant uplift on some SVE-specific workloads (about 4-5%) on the LITTLE core
llvm/lib/Target/AArch64/AArch64SchedA510.td | ||
---|---|---|
21 | I think 3 is a good compromise, as in the future I'd like this to be the default scheduling for -mcpu=generic | |
57 | I think we'd like to make the config that is preferred (2x128 bits) to be the default as SVE VL is a minimum of 128 bits. | |
68 | I think it's best to let the hardware takes care of this, as it is internally able to do so | |
664 | I added them because I see from the debug output of machine-scheduler that sometimes it will generate pseudoinstructions with UNDEF* on some benchmarks, which then will pick up the wrong latency information without adding them into the regex. Maybe in the future we should make UNDEF and PSEUDO to always be at the end of the instruction for easier regex matching. |
llvm/lib/Target/AArch64/AArch64SchedA510.td | ||
---|---|---|
664 | Yep, I totally understand that- I was not suggesting their complete removal! It's just that in a few places the regex was matching "fictitious" UNDEFs, which personally I try to avoid. I agree, in the future UNDEF and PSEUDO could go at the end of the instruction to facilitate their matching. |
llvm/lib/Target/AArch64/AArch64SchedA510.td | ||
---|---|---|
60 | Perhaps unindent this | |
68 | We haven't found the forwarding paths to be super useful in the past. They are difficult to model what is really going on in the core, and we have often found have caused more scheduling inaccuracies than they have helped with. The scheduler doesn't have a great way of modelling skewing. | |
122 | Oh yeah. I think I was looking at the wrong optimization guide at the time. | |
664 | I don't have a strong opinion on the loose regexes, but it would be good to model the UNDEF nodes if we can, as those will be the ones seen during scheduling. SVE seemed to be a place where this scheduling model can have a decent benefit. |
llvm/lib/Target/AArch64/AArch64SchedA510.td | ||
---|---|---|
664 | Perhaps we can revisit this in future patch. I'd also prefer not to add some cryptic UNDEF and PSEUDO if not necessarry |
OK, that sounds good. It would be good to have some tests for UNDEF nodes if we can, so we can add those separately. From what you have shown the performance has sounded pretty decent for SVE code, so scheduling of those UNDEF nodes should be quite important. Lets get this in and we can iterator from there.
LGTM, with a small Nit.
llvm/lib/Target/AArch64/AArch64SchedA510.td | ||
---|---|---|
10 | You can add "This file defines the machine model for the ARM Cortex-A510 processors." |
How about some test results, please?
llvm/lib/Target/AArch64/AArch64SchedA510.td | ||
---|---|---|
21 | Methinks that 3 would be too short in general. | |
57 | The VL may be at least 128 bits, but the implementation could still be 64 bits wide. Again, it's better to be early than late when scheduling, so it'd be better to model for 64 bits, which would run well in all A510. | |
68 | That can be said of very super scalar pipelines. When the throughput is just 1, it's important to make it easier for the core to digest instructions as quickly as possible. After all, that's why there are scheduling models. | |
68 | I agree that there are cases when they can hurt throughput, but only when it's greater than 1. The case that I have in mind is that it's better to have, say, a MUL execute in parallel with an unrelated MADD rather than sequentially with a dependent one. But that only applies when there, in this case, more than one multiplier. If there's only one multiplier, then it's important to schedule instructions whose combined latency is smaller if close together. |
You can add "This file defines the machine model for the ARM Cortex-A510 processors."