This is an archive of the discontinued LLVM Phabricator instance.

[SchedModels][CortexA55] Add ASIMD integer instructioins
ClosedPublic

Authored by kpdev42 on Jan 11 2022, 3:04 AM.

Details

Summary

Depends on D114642

Original review https://reviews.llvm.org/D112201

OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg

Diff Detail

Event Timeline

kpdev42 created this revision.Jan 11 2022, 3:04 AM
kpdev42 requested review of this revision.Jan 11 2022, 3:04 AM

I'm not sure how much I love the predicate matching in the scheduler, as opposed to just matching instructions opcodes. There are quite a few instructions which narrow or enlarge vectorsm where register types are misleading.

Can you at least move the code so it doesn't look like this is bolted onto the end of the existing schedule :)
The SchedWriteRes should live near the other SchedWriteRes, and only the instruction definitions should be with the other instructions, etc.

llvm/lib/Target/AArch64/AArch64SchedA55.td
415

"01" in the dual issue tables means it must be the first item (slot 0). "10" would be EndGroup, and is mostly limited to certains branches and rets.

529

Does this add a lot? It's not really how COPYs work.

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-neon-instructions.s
2507

What is the reasoning for the integer multiplies going down the FPMAC pipeline?

kpdev42 added inline comments.Feb 7 2022, 10:36 PM
llvm/lib/Target/AArch64/AArch64SchedA55.td
529

According to our experiments FPU copy (fmov) has latency of 1 cycle and throughput of 2 or 1 (Q-form). According to model integer ALU copy has 3 cycle latency. What would be correct model for COPY in your opinion?

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-neon-instructions.s
2507

I guess mla/mls (ASIMD multiply/accumulate) utilize NEON pipeline. For some reason 2 NEON pipelines of Cortex-A55 are modelled with 5 pipelines (2 x FPALU, 2 x FPMAC, 1 x FPDIV). What you think would be correct resource assignment for mla/mls?

kpdev42 updated this revision to Diff 406708.Feb 7 2022, 10:37 PM

Addressed review comments

dmgreen added inline comments.Feb 10 2022, 2:15 AM
llvm/lib/Target/AArch64/AArch64SchedA55.td
529

Yep - the vector mov latency and throughput sound good to me.

The issue is that a COPY is that post-ra scheduling they won't exist, they will already have been turned into either movs or removed because they were not needed. And pre-RA it is difficult to know if they will be deleted later, if they are just no-op copys. The assumption in a lot of places will be that they will be removed, so adding any scheduling info to them about resources can be incorrect.

Cross register bank copies can be more important, and won't be removed as easily. Those are the ones that transfer between gpr and fpr.

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-neon-instructions.s
2507

I'm not entirely sure either way, to be honest. A lot of this has been around from long ago.

From what I can tell, the FPMAC is for floating point operations that are expected to take a long time (the ones that finish out of order in the optimization guide). There are 2 because of the way it splits 128bit operations into 2 64bit operations, and so that models the dual-issue. I'm not sure what FPDIV is. It models the hazards in fsqrt/fdiv maybe?

So I don't think that the integer mla's need to go onto the same FPMAC pipeline. They can go onto into FPALU I think (or maybe it doesn't matter which they go down, but FPALU sounds more correct to me).

kpdev42 updated this revision to Diff 407874.Feb 11 2022, 7:22 AM

Addressed review comments

dmgreen accepted this revision.Feb 14 2022, 1:06 AM

Thanks. LGTM

This revision is now accepted and ready to land.Feb 14 2022, 1:06 AM
This revision was landed with ongoing or failed builds.Feb 17 2022, 2:43 AM
This revision was automatically updated to reflect the committed changes.

Hello.

I'm getting a few reports of this making performance worse, especially on Cortex-A510 cpu's. I think that adding the forwarding paths present on A55, but not available in A510 are causing more hazards and the performance to drop significantly in places, because they are compiled for cpu=generic. The A510 generally has higher throughput, but also higher latencies in places.

We may need to back out some of these changes, even if it makes the A55 model less precise. At least in the short term. We might need to take the route of not hurting other cpus, providing it doesn't help the A55 performance much.

I have partially reverted this in 61b616755aced8ed7afc48ffd152f02194b9d201. I was trying not to undo the whole thing, but just removed the forwarding paths and some other parts that were making performance worse around the "L" instructions. The rest was honestly making some performance worse too, but some stuff was better and the parts removed seemed to be causing much of the change. We probably need to be more careful going forward that we benchmark on more cpu's, not just the Cortex-A55. The schedule is used by any -mcpu=generic compile, so even if it's a less accurate model of the A55, we may need to strike more of a balance between different cpus until we have a better option.