Page MenuHomePhabricator

[ARM] Add an optimization to avoid S-register forwarding hazards
Needs ReviewPublic

Authored by asavonic on Mar 5 2021, 3:30 AM.

Details

Reviewers
evgeny777
dmgreen
Summary

Register forwarding hazards might occur when one uop reads a D- or
Q-register operand that has recently been written with one or more
S-register results. This happens only in AArch32 state on Cortex-A57,
Cortex-A72, Cortex-A77 (and probably other processors as well).
See Cortex-A72 Software Optimization Guide s4.4 "Register Forwarding Hazards"
for more details.

The pass replaces S-registers writes with the corresponding scalar
writes to D-registers. If there is no suitable replacement, an
S-register is copied to a D-register scalar via a core register.

The pass is disabled by default and it can be enabled by
-arm-subreg-write LLVM option when non-zero optimization level is set.

With this optimization, llvm-test-suite/MultiSource/Benchmarks/Bullet
shows ~10% performance improvement on Cortex-A72.

The pass has also been tested on Skia library. Skia's nanobench shows
~1.3% geomean improvement and ~10% improvement for some subtests on
Cortex-A72.

Diff Detail

Event Timeline

asavonic created this revision.Mar 5 2021, 3:30 AM
asavonic requested review of this revision.Mar 5 2021, 3:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 5 2021, 3:30 AM
asavonic edited the summary of this revision. (Show Details)Mar 9 2021, 4:25 AM
asavonic added reviewers: evgeny777, dmgreen.

Hello. There is the ExecutionDomainFix that tried to mitigate the same issue, but only for mov instructions I believe. There is also the A15SDOptimizer pass that does something similar. Is it possible to reuse/consolidate any of these?

As far as I understand, the cortex-a76 or newer will only see this hazarding between S and Q registers, not D regs, and even then will be less of an issue.

Also going via the integer registers sounds very slow, are you sure that would be better than hitting the hazard? I was told that using vdup d0, d0[0] may be a way to mitigate it, without resorting to using integer regs.