This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Enable useAA() for the in-order Cortex-R52
ClosedPublic

Authored by dmgreen on Jun 12 2018, 4:59 AM.

Details

Summary

This option allows codegen (such as DAGCombine or MI scheduling) to use alias analysis information, which can help with the codegen on in-order cpu's. Here I have done things the same way as AArch64, adding a subtarget feature to enable this for specific cores.

I was going to enable this for A53 too, but seeing as we happen to not have a AArch32 A53 schedule, the usefulness is not as high as R52.

Diff Detail

Repository
rL LLVM

Event Timeline

dmgreen created this revision.Jun 12 2018, 4:59 AM

Requires D48029 to survives a bootstrap, but that looks like a more generic error than having to use this option. Otherwise I believe this is safe.

dmgreen edited reviewers, added: t.p.northover, rengolin; removed: hfinkel.Jun 12 2018, 5:09 AM
dmgreen added a reviewer: hfinkel.

LGTM but will wait for others to comment as well before accepting

javed.absar accepted this revision.Jun 14 2018, 12:45 AM
This revision is now accepted and ready to land.Jun 14 2018, 12:45 AM

I'm generally not a fan of having features like this, which have widespread implications, turned on only for certain target CPUs; it tends to make it much harder to find bugs, since the code gets little testing. But I guess this is okay for now.

Yes I can see that. I would have liked to turn this on for more in-order cores, but without scheduling enough to at least say that a load takes multiple cycles, I didn't feel I had a great justification. For the record, these were the changes I saw on a A53 with useAA returning true (units are time, so lower is better. these are more than 2%):

SingleSource/Benchmarks/BenchmarkGame/n-body -14.38%
SingleSource/Benchmarks/Shootout/Shootout-lists -6.40%
SingleSource/Benchmarks/Misc-C++/Large/ray -6.20%
MultiSource/Applications/ALAC/encode/alacconvert-encode -5.44%
MultiSource/Benchmarks/McCat/17-bintr/bintr -3.27%
SingleSource/Benchmarks/CoyoteBench/huffbench -3.15%
MultiSource/Benchmarks/SciMark2-C/scimark2 -2.97%
MultiSource/Benchmarks/Bullet/bullet -2.50%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -2.33%
SingleSource/Benchmarks/Misc/richards_benchmark -2.20%
MultiSource/Benchmarks/Ptrdist/yacr2/yacr2 +4.60%
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 +9.66%

They don't look too bad, but there are some decreases. enc-pc1 is genuinely worse, yacr2 might be noise. And without instruction scheduling, they may be getting lucky. Compile time increase was roughly 0.25% on CT-mark (may not be statistically significant, but it was enough alternating runs to make me think it's probably close).

I tried it on the A72 too, on both T32 and A64, with more varied results, both showing several large increases in places (including a memcpy benchmark). This option, as far as I can tell, should give more freedom to the DAG, but that may not be used in the best way all the time.

This revision was automatically updated to reflect the committed changes.