Avoid generating indexed vector instructions for Exynos. This is needed for fmla/fmls/fmul/fmulx.
For example, the instruction fmla v0.4s, v1.4s, v2.s[1] is less efficient than the instructions dup v2.4s, v2.s[1] ; fmla v0.4s, v1.4s, v2.4s
Paths
| Differential D21571
[AArch64] Avoid generating indexed vector instructions for Exynos ClosedPublic Authored by az on Jun 21 2016, 3:06 PM.
Details Summary Avoid generating indexed vector instructions for Exynos. This is needed for fmla/fmls/fmul/fmulx. For example, the instruction fmla v0.4s, v1.4s, v2.s[1] is less efficient than the instructions dup v2.4s, v2.s[1] ; fmla v0.4s, v1.4s, v2.4s
Diff Detail Event Timelineaz updated this object. flyingforyou added inline comments.
evandro retitled this revision from Avoid generating indexed vector instructions for Exynos to [AArch64] Avoid generating indexed vector instructions for Exynos.Jun 22 2016, 12:57 PM Comment Actions Hi Evandro, Unless Exynos chips can't handle indexed lanes at all, this looks like a case for the cost model, not CPU flags. cheers,
az edited edge metadata. Comment ActionsRedone this work as an optimization instead of modifying TableGen. The optimization is based on the latency of instructions and only the latencies for Exynos enable this optimization. This is mainly tested on intrinsic code.
Comment Actions This looks better now, taking the scheduling costs into account. Thanks! Though, I'm dubious as to whether AArch64InstrInfo.cpp is really the right place to land this. This looks like a job for a new pass... Comment Actions What is posted here is a sub-pass in the backend peephole optimizer. The code implementing this optimization is in AArch64/AArch64InstrInfo.cpp. This is very similar to the optimizations optimizeCondBranch(), or optimizeCmpInstr () where both are sub-passes of peephole and the main implementation is in AArch64/AArch64InstrInfo.cpp. I can leave it mainly as is with minor modification such as moving the code to some other file or even a new file. I am also perfectly fine with creating a new pass. This optimization does not deserve, in my opinion, a new pass but it may be good idea to create a new pass so that we can include future similar optimizations in there. Let me know what you think. Thanks for looking at this. Comment Actions This version takes the previous patch functionality and code and puts it into a new AArch64 standalone optimization pass. Comment Actions
I'm not a big fan of adding optimisations to AArch64InstrInfo because that's a place for more generic codegen. I agree with you, this is a bit heavy for such a small gain on a single core, but it's also a big piece of unrelated code to live inside InstrInfo. It's possible that we could fuse all those passes into one, so we don't have to iterate over everything multiple times. But maybe this is not a job for this commit. As it stands, I'm happy either way, as long as we make this more sensibly in due time. @t.p.northover, any comments on that? However, my inline comments are still pertinent, no matter the choice, since that code will land on a new file or in AArch64InstrInfo anyway. cheers,
az added inline comments.
Comment Actions
It seems like @rengolin is ok with this change.
Comment Actions Perhaps I missed this, but how is the pass being enabled for *only* Exynos? AFAICT, it's enabled for all AArch64 targets.
Comment Actions
I think Sebastian answered my question indirectly; the instructions cost of Exynos-M1 triggers the transform. I would prefer we also have a target feature (as was done in the first version of the patch) that early exits runOnMachineFunction for non-Exynos-M1 subtargets. Otherwise, we're doing a lot of unnecessary work (i.e., switching over every instruction in the function) for non-Exynos-M1 subtargets. Comment Actions
Thanks Chad for catching this.
Comment Actions Added a simple check to exit this pass early on so that no analysis is done for targets that do not need it.
Comment Actions LGTM. This revision is now accepted and ready to land.Sep 30 2016, 3:02 PM
az edited edge metadata. Comment ActionsRemoved the EnableVectorByElement flag as it is not that useful for now. rengolin edited edge metadata. Comment Actions
Just to make sure your question is answered: earlyExitVectElement() does that job in a generic way. :) Closed by commit rL283663: [AArch64] Avoid generating indexed vector instructions for Exynos (authored by spop). · Explain WhyOct 8 2016, 7:18 AM This revision was automatically updated to reflect the committed changes.
Revision Contents
Diff 72841 llvm/lib/Target/AArch64/AArch64.h
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
llvm/lib/Target/AArch64/CMakeLists.txt
llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
|
Quick question: what's the point of this flag? It's hidden, enabled b default and we're not testing it...