This feature enables the fusion of such operations on Cortex A57, as recommended in its Software Optimisation Guide, section 4.13, and on Exynos M1.
On A57, it improves the results of a proprietary benchmark by about 20%.
Differential D28491
[AArch64] Add new subtarget feature to fuse AES crypto operations evandro on Jan 9 2017, 3:10 PM. Authored by
Details This feature enables the fusion of such operations on Cortex A57, as recommended in its Software Optimisation Guide, section 4.13, and on Exynos M1. On A57, it improves the results of a proprietary benchmark by about 20%.
Diff Detail
Event Timeline
Comment Actions The MacroFusion pass is currently being added before the RA runs. However, since the AArch64ExpandPseudo pass is run after the RA (in AArch64PassConfig::addPreSched2()), I wonder if it'd make more sense to run the MISched after the RA as well, and not before as it is now. Thoughts? Comment Actions There are a number of benefits when running the scheduler before register allocation (for example we can still reduce register pressure). We already have the PostMachineScheduler for scheduling again after regalloc (it's based on the same MISched framework but added considerably later in the pipeline; see also TargetSubtargetInfo::enablePostRAScheduler()). Comment Actions I'm asking this because, looking further at other pairs of instrs that A57 fuses, such as ADRP/ADD, they only appear in the instr stream after pseudo expansion. Comment Actions Well if there is no reason to ever break the instructions apart, then using a Pseudo instruction and expanding that later may be the easier solution, is that the case for the AES instructions? Comment Actions No, since they are pretty opaque. But the pseudo MOVaddr is expanded into the pair ADRP/ADD only after the RA. On A57, it's important to schedule them back to back, e.g., by running the MISched after the RA instead of before. Comment Actions
Or rather, I wonder why pseudo expansion is happening this late, when they are very simple instrs in AArch64. Methinks that expanding them sooner would expose them to more optimizations, yes?
|