This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Avoid generating indexed vector instructions for Exynos
ClosedPublic

Authored by az on Jun 21 2016, 3:06 PM.

Details

Summary

Avoid generating indexed vector instructions for Exynos. This is needed for fmla/fmls/fmul/fmulx.

For example, the instruction fmla v0.4s, v1.4s, v2.s[1] is less efficient than the instructions dup v2.4s, v2.s[1] ; fmla v0.4s, v1.4s, v2.4s

Diff Detail

Repository
rL LLVM

Event Timeline

az updated this revision to Diff 61447.Jun 21 2016, 3:06 PM
az retitled this revision from to Avoid generating indexed vector instructions for Exynos.
az updated this object.
az added a reviewer: evandro.
az added a subscriber: llvm-commits.
flyingforyou added inline comments.
llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
451 ↗(On Diff #61447)

Is this line necessary?

evandro retitled this revision from Avoid generating indexed vector instructions for Exynos to [AArch64] Avoid generating indexed vector instructions for Exynos.Jun 22 2016, 12:57 PM
evandro added reviewers: rengolin, t.p.northover.
rengolin edited edge metadata.Jun 22 2016, 1:00 PM

Hi Evandro,

Unless Exynos chips can't handle indexed lanes at all, this looks like a case for the cost model, not CPU flags.

cheers,
--renato

evandro added inline comments.Jun 22 2016, 1:13 PM
llvm/lib/Target/AArch64/AArch64.td
96 ↗(On Diff #61447)

The ISA manual refers to such operations as "vector by element", so I'd prefer something like 's/VectorIndexing/VectorByElement/'.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
311 ↗(On Diff #61447)

s/Indexing/ByElement/

llvm/lib/Target/AArch64/AArch64Subtarget.h
82 ↗(On Diff #61447)

s/Indexing/ByElement/

189 ↗(On Diff #61447)

s/Indexing/ByElement/

llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
2 ↗(On Diff #61447)

Refrain from setting the CPU in tests. Rather, use the feature that you added with "-mattr=no-vector-instruction-indexing", or rather, "-mattr=no-vector-by-element"

az updated this revision to Diff 66395.Aug 1 2016, 4:26 PM
az edited edge metadata.

Redone this work as an optimization instead of modifying TableGen. The optimization is based on the latency of instructions and only the latencies for Exynos enable this optimization. This is mainly tested on intrinsic code.

az added inline comments.Aug 1 2016, 4:35 PM
llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
2 ↗(On Diff #66395)

It is not a feature anymore. It is now an optimization that is currently triggered for Exynos only.

This looks better now, taking the scheduling costs into account. Thanks!

Though, I'm dubious as to whether AArch64InstrInfo.cpp is really the right place to land this. This looks like a job for a new pass...

az added a comment.Aug 2 2016, 3:00 PM

What is posted here is a sub-pass in the backend peephole optimizer. The code implementing this optimization is in AArch64/AArch64InstrInfo.cpp. This is very similar to the optimizations optimizeCondBranch(), or optimizeCmpInstr () where both are sub-passes of peephole and the main implementation is in AArch64/AArch64InstrInfo.cpp. I can leave it mainly as is with minor modification such as moving the code to some other file or even a new file. I am also perfectly fine with creating a new pass. This optimization does not deserve, in my opinion, a new pass but it may be good idea to create a new pass so that we can include future similar optimizations in there. Let me know what you think. Thanks for looking at this.

az updated this revision to Diff 67039.Aug 5 2016, 3:37 PM

This version takes the previous patch functionality and code and puts it into a new AArch64 standalone optimization pass.

This optimization does not deserve, in my opinion, a new pass but it may be good idea to create a new pass so that we can include future similar optimizations in there. Let me know what you think. Thanks for looking at this.

I'm not a big fan of adding optimisations to AArch64InstrInfo because that's a place for more generic codegen. I agree with you, this is a bit heavy for such a small gain on a single core, but it's also a big piece of unrelated code to live inside InstrInfo.

It's possible that we could fuse all those passes into one, so we don't have to iterate over everything multiple times. But maybe this is not a job for this commit.

As it stands, I'm happy either way, as long as we make this more sensibly in due time. @t.p.northover, any comments on that?

However, my inline comments are still pertinent, no matter the choice, since that code will land on a new file or in AArch64InstrInfo anyway.

cheers,
--renato

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
83 ↗(On Diff #67039)

Nit: "replaceInstruction" sounds like it actually replaces the instruction. This should be named "shouldReplaceInstruction".

257 ↗(On Diff #67039)

You only use Src2IsKill to set getKillRegState(Src2IsKill), maybe you should cache the result of getKillRegState(Src2IsKill) instead? Same to others.

263 ↗(On Diff #67039)

Why are you setting DupDest in reuseDUP, and re-setting it here?

Same for 4 ops below.

267 ↗(On Diff #67039)

Coding style: no newline between "}" and "else". All other cases, too.

284 ↗(On Diff #67039)

This looks odd and may be prone to future errors. Please change to:

} else {
  return false;
}
315 ↗(On Diff #67039)

This seems like an odd pattern. Can't you just add the MIs to a list when they change and delete all of them in the end?

llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
387 ↗(On Diff #67039)

What about the dups?

az updated this revision to Diff 69921.Aug 31 2016, 4:35 PM

Addressed all the comments (sorry for the delay as I was away)

az marked 5 inline comments as done.Aug 31 2016, 4:46 PM
az added inline comments.
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
264 ↗(On Diff #69921)

If DupDest is set in reuseDup (i.e. we found a dup instruction that we can reuse), then we are not re-setting it here given that reuseDup returns true. Otherwise, DupDest is set here. I added more comments in this new revision but I can rewrite this function to make things more clear if needed.

az marked an inline comment as done.Sep 12 2016, 3:43 PM

ping

sebpop added a subscriber: sebpop.Sep 19 2016, 7:44 AM

As it stands, I'm happy either way, as long as we make this more sensibly in due time. @t.p.northover, any comments on that?

It seems like @rengolin is ok with this change.
@t.p.northover could you please review as well? Thanks!

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
285 ↗(On Diff #69921)

@az please address this comment as well.

az updated this revision to Diff 71889.Sep 19 2016, 4:29 PM

addressed a couple of overlooked comment

az marked 4 inline comments as done.Sep 19 2016, 4:35 PM
az added a comment.Sep 28 2016, 7:35 AM

Ping again! We need this patch.

az updated this revision to Diff 72841.Sep 28 2016, 8:54 AM

fix some formatting issues.

Perhaps I missed this, but how is the pass being enabled for *only* Exynos? AFAICT, it's enabled for all AArch64 targets.

sebpop added inline comments.Sep 28 2016, 9:11 AM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
12 ↗(On Diff #72841)

Let's add a bit more description of what this pass does, and an example from below...

131 ↗(On Diff #72841)

Remove the else as there is a return stmt in then clause.

167 ↗(On Diff #72841)

... this example, and the comment at the top of the file can contain some of the text above.

177 ↗(On Diff #72841)

Remove this empty stmt.

llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
2 ↗(On Diff #72841)

Let's also add a comment here saying that we need the instructions cost of Exynos-M1 to trigger the transform.

Perhaps I missed this, but how is the pass being enabled for *only* Exynos? AFAICT, it's enabled for all AArch64 targets.

I think Sebastian answered my question indirectly; the instructions cost of Exynos-M1 triggers the transform.

I would prefer we also have a target feature (as was done in the first version of the patch) that early exits runOnMachineFunction for non-Exynos-M1 subtargets. Otherwise, we're doing a lot of unnecessary work (i.e., switching over every instruction in the function) for non-Exynos-M1 subtargets.

I would prefer we also have a target feature (as was done in the first version of the patch) that early exits runOnMachineFunction for non-Exynos-M1 subtargets. Otherwise, we're doing a lot of unnecessary work (i.e., switching over every instruction in the function) for non-Exynos-M1 subtargets.

Thanks Chad for catching this.
There seems to be another compile time improvement that we could do: see comment inline.

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
106 ↗(On Diff #72841)

Let's move this init() call together with ST and TII outside the for(BB) for(Insn) loops: we can call it in runOnFunction().

az updated this revision to Diff 72920.Sep 28 2016, 4:06 PM

Added a simple check to exit this pass early on so that no analysis is done for targets that do not need it.
Since there was a push against adding any target feature when this patch was first introduced, the check is mainly comparing the latency of indexed fmla with its replacement.

az marked 6 inline comments as done.Sep 28 2016, 4:14 PM
rengolin added inline comments.Sep 28 2016, 5:25 PM
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
463 ↗(On Diff #72920)

Unnecessary white space change

flyingforyou added inline comments.Sep 28 2016, 5:49 PM
llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
2 ↗(On Diff #72920)

I think more proper check-prefix is EXYNOSM1, if future exynos core has a possibility that can be different from now.

sebpop added inline comments.Sep 28 2016, 9:49 PM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
127 ↗(On Diff #72920)

Please move the above 10 lines of code out of the loops to...

341 ↗(On Diff #72920)

... move the code here.

llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
2 ↗(On Diff #72920)

As EXYNOS is only a marker for FileCheck, I think it is not important to change to EXYNOSM1: it is clear from the -mcpu flag that these checks are for Exynos-M1.

3 ↗(On Diff #72920)

s/triggers/trigger/

sebpop added inline comments.Sep 28 2016, 10:06 PM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
171 ↗(On Diff #72920)

This only checks for 1 pair of the current 12 pairs of instructions that may be replaced.
I think we should put all the 12 pairs in a vector and iterate through all of them.

224 ↗(On Diff #72920)

In that case the 12 cases currently handled can be refactored like this:

(DupMCID, MulMCID) = find(MI.getOpcode())

Adding new transform patterns would be by adding them to the vector.

az updated this revision to Diff 73069.Sep 30 2016, 9:30 AM

Moved some code outside a loop for better compiler time.

az marked 5 inline comments as done.Sep 30 2016, 9:39 AM
az added inline comments.
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
171 ↗(On Diff #72920)

Since all the instructions of concern, so far, are closely related and are done by the same hardware unit, then they are most likely showing the same behavior. No need to increase compile time for now.

sebpop accepted this revision.Sep 30 2016, 3:02 PM
sebpop added a reviewer: sebpop.

LGTM.
Before commit, let's wait for @t.p.northover or another maintainer to accept as well.
@evandro maybe you can approve this patch as you are the maintainer of Exynos-M1.

This revision is now accepted and ready to land.Sep 30 2016, 3:02 PM
rengolin added inline comments.Sep 30 2016, 6:57 PM
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
127 ↗(On Diff #73069)

Quick question: what's the point of this flag? It's hidden, enabled b default and we're not testing it...

az updated this revision to Diff 73500.Oct 4 2016, 9:52 AM
az edited edge metadata.

Removed the EnableVectorByElement flag as it is not that useful for now.

az marked an inline comment as done.Oct 4 2016, 9:53 AM
az added inline comments.
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
127 ↗(On Diff #73069)

Removed it.

rengolin accepted this revision.Oct 4 2016, 11:47 AM
rengolin edited edge metadata.

I would prefer we also have a target feature (as was done in the first version of the patch) that early exits runOnMachineFunction for non-Exynos-M1 subtargets. Otherwise, we're doing a lot of unnecessary work (i.e., switching over every instruction in the function) for non-Exynos-M1 subtargets.

Just to make sure your question is answered: earlyExitVectElement() does that job in a generic way. :)

evandro accepted this revision.Oct 7 2016, 1:55 PM
evandro edited edge metadata.
This revision was automatically updated to reflect the committed changes.