Page MenuHomePhabricator

[Target][ARM] Adding MVE VPT Optimisation Pass
ClosedPublic

Authored by Pierre-vh on Mar 24 2020, 8:26 AM.

Details

Summary

This patch adds a pass called "MVE VPT Optimisations", which does a few optimisations before register allocation.
The goal of this pass is to maximize the size of the VPT blocks created by the MVE VPT Block Insertion pass.

Currently, this pass:

  • Replaces VPCMPs with VPNOTs when possible.
    • The instruction selector in its current state doesn't generate VPNOTs very often. Instead, it generates a VCMP with the operands swapped and the condition reversed. This pass spots those VCMPs and transforms them into VPNOTs.
    • Why generate more VPNOTs? So the MVE VPT Block Insertion pass can use them (& remove them) to create larger/more complex VPT blocks (e.g. TEET, TETE, etc.)
  • Replaces usages of old VPR values with VPNOTs when inside a block of predicated instructions.
    • This is done to avoid overlapping lifetimes of different VPR values, reducing the chance that a spill/reload occurs.
    • Why ? Spill/reloads of VPR are particularly harmful to the MVE VPT Block Insertion Pass: it prevents it from creating large VPT blocks.

Diff Detail

Event Timeline

Pierre-vh created this revision.Mar 24 2020, 8:26 AM

The goal of this pass is to maximize the size of the VPT blocks created by the MVE VPT Block Insertion pass.

Just a general question first, why a separate pass? Why e.g. not just doing this in the MVE VPT Block Insertion pass?

Pierre-vh added a comment.EditedMar 24 2020, 1:48 PM

The goal of this pass is to maximize the size of the VPT blocks created by the MVE VPT Block Insertion pass.

Just a general question first, why a separate pass? Why e.g. not just doing this in the MVE VPT Block Insertion pass?

The first optimisation (VCMPs into VPNOTs) could technically be done in the MVE VPT Block insertion pass, but I think it's better to do it in this new pass instead of overloading the block Insertion pass too much.
The second optimisation (VPNOT Insertion for "spill prevention") can't be done in the Block Insertion pass as it needs to be done before register allocation (= before spill/reload instructions are emitted).
Overall, since both optimisations deal with VPNOTs and aren't directly related to VPT Block insertion/creation, I felt that it was a good idea to put them in a separate pass.

Is it possible to split this into two patches? The pass and "replaces VCMPs with VPNOTs when possible" part, then the second part to replace the re-use with the not. I think that would make each part easier to review, more manageable.

The first optimisation (VCMPs into VPNOTs) could technically be done in the MVE VPT Block insertion pass, but I think it's better to do it in this new pass instead of overloading the block Insertion pass too much.
The second optimisation (VPNOT Insertion for "spill prevention") can't be done in the Block Insertion pass as it needs to be done before register allocation (= before spill/reload instructions are emitted).
Overall, since both optimisations deal with VPNOTs and aren't directly related to VPT Block insertion/creation, I felt that it was a good idea to put them in a separate pass.

Plus (I think) in order to do #2, it's easier if you have already done #1.

llvm/lib/Target/ARM/ARMTargetMachine.cpp
490

Perhaps put this into the below getOptLevel() != CodeGenOpt::None block? As it is an optimisation

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
8

Can you add a file comment here like in other files, explaining what the pass does, like you have in the commit message.

55

Could you just use VCMPOpcodeToVPT, and check the return value isn't 0? To save this extra list being needed here.

llvm/test/CodeGen/Thumb2/mve-vcmpf.ll
700 ↗(On Diff #252329)

What's going on here? I think it needs to be checking q1 <= q0 && q0 < q1. ord means ordered, which means no NaN's. (Floating point compares can be tricky like that).

llvm/test/CodeGen/Thumb2/mve-vpt-optimisations.mir
2

I think this test would be simpler if there was a number of test functions inside it, each testing one thing (or maybe a small collection of things). Otherwise they run into each other a bit, and it will be hard to tell what's wrong if one of them starts to fail.

Pierre-vh marked 5 inline comments as done.Mar 25 2020, 12:46 AM

Is it possible to split this into two patches? The pass and "replaces VCMPs with VPNOTs when possible" part, then the second part to replace the re-use with the not. I think that would make each part easier to review, more manageable.

The first optimisation (VCMPs into VPNOTs) could technically be done in the MVE VPT Block insertion pass, but I think it's better to do it in this new pass instead of overloading the block Insertion pass too much.
The second optimisation (VPNOT Insertion for "spill prevention") can't be done in the Block Insertion pass as it needs to be done before register allocation (= before spill/reload instructions are emitted).
Overall, since both optimisations deal with VPNOTs and aren't directly related to VPT Block insertion/creation, I felt that it was a good idea to put them in a separate pass.

Plus (I think) in order to do #2, it's easier if you have already done #1.

Unfortunately, I've written this pass in a single commit, so there is no easy way for me to split this patch in 2.
I can do it if you want, it's not impossible, but it's going to take me a while to get right. Also, the second optimisation is a relatively small part of this patch, so patch #1 would still be a large patch.

llvm/test/CodeGen/Thumb2/mve-vcmpf.ll
700 ↗(On Diff #252329)

I didn't know that. In which case is it not "safe" to replace a float VCMP even when the conditions are met?
What would be the correct behaviour in this case?
Should I disable this optimisation for all float VCMPs?

llvm/test/CodeGen/Thumb2/mve-vpt-optimisations.mir
2

Would putting them all in different basic blocks be enough, or do you prefer functions?
The optimisation is done per basic-block, so it'd behave the same as function (but the test would be shorter).

Unfortunately, I've written this pass in a single commit, so there is no easy way for me to split this patch in 2.
I can do it if you want, it's not impossible, but it's going to take me a while to get right. Also, the second optimisation is a relatively small part of this patch, so patch #1 would still be a large patch.

Sure. If it's relatively small it should hopefully be simple enough to pull out into a separate commit.

Also can you make sure that "register" and "zero" variants of the VCMP's are tested. The inverting login around those might be different to a normal compare.

llvm/test/CodeGen/Thumb2/mve-vcmpf.ll
700 ↗(On Diff #252329)

Umm. I think for floats you cannot swap the operands. That only works for integers. You can still use the opposite condition code (lt <> ge, for example). Have a look at getOppositeCondition and isValidMVECond in ISelLowering for how it's done there.

llvm/test/CodeGen/Thumb2/mve-vpt-optimisations.mir
2

Separate functions would be more canonical. You can hopefully simplify the test quite a bit to make it so that's not too verbose.

10–16

Functions don't actually need bodies. They can just contain unreachable. And if the test below doesn't bare any resemblance to the code here, that is probably for the better.

21–23

I think a lot of this can probably be removed, if you replace +mve.fp it with command line options.

28–65

I think you can remove a lot of this.

samparker added inline comments.Mar 25 2020, 7:36 AM
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
128

Maybe you could the MachineInstr method 'definesRegister' instead?

Pierre-vh updated this revision to Diff 252835.Mar 26 2020, 7:17 AM
Pierre-vh marked 5 inline comments as done.

Refactoring the patch as requested + fixing a few issues.

Pierre-vh updated this revision to Diff 252845.Mar 26 2020, 7:36 AM

Adding newline at the end of the file and refactoring canHaveOperandsSwapped: It has been renamed to CanHaveSwappedOperands and it now returns true for everything except for all VCMPr and VCMPf16/f32 instructions.

dmgreen added inline comments.Mar 29 2020, 3:08 PM
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
142

How much does this function add? It doesn't seem to do a huge amount.

159–160

Do you have any tests for this bit?

180–181

You can do BuildMI(...).add(Instr.getOperand(0)).addReg(PrevVCMPResultReg)...

183–184

Operand 4 and 5 are always None and noreg? If so you can use addUnpredicatedMveVpredNOp, which makes it more obvious what the operands are expected to be.

llvm/test/CodeGen/Thumb2/mve-vpt-optimisations.mir
155–156

I was expecting these registers to be virtual, given where this is in the pipeline. Will they be physical instead?

Pierre-vh marked 7 inline comments as done.Mar 29 2020, 11:31 PM
Pierre-vh added inline comments.
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
142

Sure, it doesn't add much, I'll remove it.

159–160

Not yet, I'll add some.

llvm/test/CodeGen/Thumb2/mve-vpt-optimisations.mir
155–156

In the pipeline, they'd be virtual.
Should I replace all $vpr/$q0/$q1 here with virtual registers ? It won't make much of a difference testing wise but I can understand that virtual registers would be preferred.

Pierre-vh updated this revision to Diff 253847.Mar 31 2020, 5:18 AM
Pierre-vh marked 2 inline comments as done.
  • Fixed issues found in review (comments marked as done)
  • Added "ARM" in front of the pass's name.
  • Changed the test so it uses virtual registers everywhere.

Looking good.

Can you add some extra ll tests, made from intrinsics that show different blocks being created, but testing the entire backend. All kinds of things from simple to complicated if you can. A good collection of VCMP, conditional VCMP, other conditional instructions and VPNOT's in different positions. Then add --verify-machine-instr to it. It doesn't matter if they are not all optimal yet, they will show what is working well and what isn't yet.

Pierre-vh updated this revision to Diff 254521.Apr 2 2020, 7:35 AM
    • "rebased" the patch - I renamed IsWritingToVCCRorVPR to IsWritingToVCCR in the child revision, so I renamed it here as well. I also removed the line that checked if ARM::VPR was used, as it was useless outside of tests.
    • I added a new test, mve-vpt-blocks.ll that uses some intrinsics to attempt to generate all possible VPT blocks from LLVM IR. In this patch, it doesn't generate every block successfully due to spill/reloads, but that's fixed in the child revision.
  • Finally, I fixed a bug related to isKill flags on register and added a test for it.
dmgreen added inline comments.Apr 2 2020, 11:53 PM
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
193

Should we be clearing this value at some points? Setting it back to nullptr?

Pierre-vh marked 2 inline comments as done.Apr 3 2020, 12:01 AM
Pierre-vh added inline comments.
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
193

Sure, it should be cleared after. I'll see if I can add a test for that as well.

Pierre-vh updated this revision to Diff 254716.Apr 3 2020, 12:52 AM
Pierre-vh marked an inline comment as done.

PrevVCMPResultKiller is now correctly reset back to nullptr, but I didn't add a test for it as it was not useful (there was nothing to test).
It's pretty much an NFC, the behaviour is the exact same as before, but it's indeed more correct to reset it once a VCMP has been replaced by a VPNOT.

The reason why this change doesn't really impact the behaviour of the pass is that PrevVCMPResultKiller is either:

    • nullptr, so nothing happens.
  • Contains the "VCMP Result Killer" for the current PrevVCMP (Which is what we want)
  • Contains the "VCMP Result Killer" for *a previous* PrevVCMP, which is not correct but doesn't cause issues, it's just going to call setIsKill(false) twice on the same operand.
    • This case would only happen if 2 VCMPs are replaced with VPNOTs in the same basic block, but the first one set PrevVCMPResultKiller while the second one didn't.

Correctly resetting it to nullptr after use just removes the third possibility.

dmgreen accepted this revision.Apr 3 2020, 2:10 AM

Nice one. LGTM

This revision is now accepted and ready to land.Apr 3 2020, 2:10 AM
This revision was automatically updated to reflect the committed changes.