This patch is adds another optimisation to the MVE VPT Optimisations pass introduced in the previous patch.
This optimisation replaces usages of old VPR (VCCR) values (within predicated instruction blocks) with VPNOTs in order to avoid spill/reloads.
Those VPNOTs can then be removed by the MVE VPT Block Insertion pass, resulting in clean/compact VPT blocks such as TEET, TETE, etc. instead of lots of small blocks with spill/reloads in-between.
Details
Diff Detail
Event Timeline
I'm surprised not to see more test changes, does this really have no effect on any existing tests?
I don't think so, I ran check-llvm-codegen and it was green (with the changes in this patch and the parent patch).
Ok. Well, how about a couple more tests with differently ordered instructions too then? Like inserting the unpredicated VORR inbetween the predicated ones? And maybe some larger tests that would generate blocks with multiple instructions on a predicate before performing the inversion?
This kind of code can come up from intrinsics a lot, but not necessarily from the relatively simple codegen tests we have. Someone can write intrinsics knowing that VPNOT's should be folded into a single VPT block, only for llvm to come along and nicely optimize them all away, producing IR that we end up with worse codegen for. This should help us get back to better assembly.
More testing does sound good.
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp | ||
---|---|---|
167–168 | addUnpredicatedMveVpredNOp | |
233 | Should this not be replacing _all_ uses of the old condition with the value from the VPNOT? As in, if we see input code like: Could it not be generally beneficial to insert a VPNOT between use of b and the second use of a? We could also probably have any amount of code between the predicated uses and it might still be beneficial, considering the costs of spills/reloads vs a single VPNOT. |
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp | ||
---|---|---|
233 | I believe it's already doing that. I'll add more tests to show it. // Stop as soon as we leave the block of predicated instructions if (getVPTInstrPredicate(*Iter) == ARMVCC::None) break; |
- Fixed issues found during review (comments marked as done)
- Improved the pass:
- It can now move VPNOTs before their first user when needed. This currently only happens when another VCCR value is used between the VPNOT and its first user.
- Instead of inserting VPNOTs before VPNOTs, it instead replaces the existing VPNOT with a COPY.
- Added more tests, and they now use virtual registers instead.
- Fixed bugs related to isKill flags in multiple places. I now correctly change the isKill flags when needed, and I added tests for that.
- Rebased the patch - Added the mve-vpt-blocks.ll test.
Sorry for the delay.
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp | ||
---|---|---|
175 | if at there -> if there | |
240 | Does this only find the first pair in a basic block? What would happen if there are multiple (potential) vpt blocks in a larger basic block, and several of them can be optimized? Can this be done with a linear scan through the block that attempts to find and hold VPNOT's and the original value, converting any uses of the original to the VPNOT. Or does that not work very well for some reason? Maybe using something like MRI->use_instructions would also help. | |
256 | Can you explain the advantage of moving the VPNOT? |
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp | ||
---|---|---|
240 |
Unfortunately it'd only optimize the first one as it'd only pick up the first pair of VCMP/VPNOT and not the second.
I don't think it'd work well, but I can certainly try it and see if it works. I'll also look at use_instructions, but I'm not sure it'll help. | |
256 | Sometimes, you can have code like this (taken from the bottom test): %2:vccr = MVE_VCMPs32 %0:mqpr, %1:mqpr, 10, 0, $noreg %3:vccr = MVE_VPNOT %2:vccr, 0, $noreg %4:mqpr = MVE_VORR %0:mqpr, %1:mqpr, 1, %2:vccr, undef %4:mqpr %5:mqpr = MVE_VORR %1:mqpr, %0:mqpr, 1, %2:vccr, undef %5:mqpr %6:mqpr = MVE_VORR %4:mqpr, %5:mqpr, 1, %3:vccr, undef %6:mqpr %3 is not used directly: the original VCCR value, %2 is used before it, so their lifetimes overlap. If I didn't move the VPNOT further down in such situations, the pass would insert a double VPNOT, like this: %2:vccr = MVE_VCMPs32 %0:mqpr, %1:mqpr, 10, 0, $noreg %3:vccr = MVE_VPNOT %2:vccr, 0, $noreg %foo:vccr = MVE_VPNOT %3:vccr %4:mqpr = MVE_VORR %0:mqpr, %1:mqpr, 1, %foo:vccr, undef %4:mqpr %5:mqpr = MVE_VORR %1:mqpr, %0:mqpr, 1, %foo:vccr, undef %5:mqpr %bar:vccr = MVE_VPNOT %foo:vccr %6:mqpr = MVE_VORR %4:mqpr, %5:mqpr, 1, %bar:vccr, undef %6:mqpr But, since I now move the VPNOT further down, we get this instead, which is of course much better. // VPNOT moved down, no more overlapping VCCR lifetimes, no double VPNOTs. %2:vccr = MVE_VCMPs32 %0:mqpr, %1:mqpr, 10, 0, $noreg %4:mqpr = MVE_VORR %0:mqpr, %1:mqpr, 1, %2:vccr, undef %4:mqpr %5:mqpr = MVE_VORR %1:mqpr, %0:mqpr, 1, %2:vccr, undef %5:mqpr %3:vccr = MVE_VPNOT %2:vccr, 0, $noreg %6:mqpr = MVE_VORR %4:mqpr, %5:mqpr, 1, %3:vccr, undef %6:mqpr That transformation is of course only done if a use of the orginal VCCR value is found before the first use of the VPNOT's result. |
- Added support for optimising multiple VPT blocks in the same Basic Block, and added a test to show it
- Now VCCRValue is only set for VCMPs. (It doesn't really make a difference, but makes it clear that only VCMPs are supported by this optimisation)
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp | ||
---|---|---|
250 | We not do this for predicated VCMP's? | |
257 | findRegisterUseOperandIdx != -1 | |
269 | What makes us not do this for other opcodes? (I guess vcmp is the vast majority of cases?) | |
286 | The second loop -> This second loop, just to be a little clearer. | |
302 | Do we need to add the copy? We could just use LastVPNOTResult directly? Or would that be more complex? | |
305 | Set Modified = true here too I think. |
- Minor refactoring of the patch
- The pass is no longer limited to VCMPs for VCCRValue, it can now use any instruction that writes to VPR (e.g. VMSR)
- The pass no longer replaces VPNOT with copies - it just removes the VPNOT and replaces all of its uses.
- Other minor fixes
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp | ||
---|---|---|
250 | Currently, no, to keep things simple. %0:vccr = vpt ... %1:vccr = vcmp /* predicated on %0 */ %2:vccr = vcmp /* unpredicated, opposite of %1 */ If we replace the second vcmp with vpnot %1:vccr, what happens if the first vcmp isn't executed? | |
269 | It's just that vcmp is the vast majority of cases. There's no issue with enabling it for other instructions that write to VPR. |
LGTM. Thanks.
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp | ||
---|---|---|
250 | It's not about "not executing", a predicated vcmp acts as a "and %0, <newcond>". Overlapping ranges of different predicates certainly does sound more complex though. Lets leave that for other patches if we find cases where it's needed. |
addUnpredicatedMveVpredNOp