Page MenuHomePhabricator

[Target][ARM] Replace re-uses of old VPR values with VPNOTs
ClosedPublic

Authored by Pierre-vh on Mar 26 2020, 7:21 AM.

Details

Summary

This patch is adds another optimisation to the MVE VPT Optimisations pass introduced in the previous patch.
This optimisation replaces usages of old VPR (VCCR) values (within predicated instruction blocks) with VPNOTs in order to avoid spill/reloads.
Those VPNOTs can then be removed by the MVE VPT Block Insertion pass, resulting in clean/compact VPT blocks such as TEET, TETE, etc. instead of lots of small blocks with spill/reloads in-between.

Diff Detail

Event Timeline

Pierre-vh created this revision.Mar 26 2020, 7:21 AM

I'm surprised not to see more test changes, does this really have no effect on any existing tests?

I'm surprised not to see more test changes, does this really have no effect on any existing tests?

I don't think so, I ran check-llvm-codegen and it was green (with the changes in this patch and the parent patch).

Ok. Well, how about a couple more tests with differently ordered instructions too then? Like inserting the unpredicated VORR inbetween the predicated ones? And maybe some larger tests that would generate blocks with multiple instructions on a predicate before performing the inversion?

I'm surprised not to see more test changes, does this really have no effect on any existing tests?

This kind of code can come up from intrinsics a lot, but not necessarily from the relatively simple codegen tests we have. Someone can write intrinsics knowing that VPNOT's should be folded into a single VPT block, only for llvm to come along and nicely optimize them all away, producing IR that we end up with worse codegen for. This should help us get back to better assembly.

More testing does sound good.

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
160–161

addUnpredicatedMveVpredNOp

437

Should this not be replacing _all_ uses of the old condition with the value from the VPNOT?

As in, if we see input code like:
a = VCMP ..
use(a)
b = VPNOT a
..
use(b)
..
use(a)

Could it not be generally beneficial to insert a VPNOT between use of b and the second use of a? We could also probably have any amount of code between the predicated uses and it might still be beneficial, considering the costs of spills/reloads vs a single VPNOT.

Pierre-vh marked 3 inline comments as done.Mar 29 2020, 11:51 PM
Pierre-vh added inline comments.
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
437

I believe it's already doing that. I'll add more tests to show it.
I can certainly do the optimisation even outside blocks of predicated instruction, I just need to remove these lines and add a few tests:

// Stop as soon as we leave the block of predicated instructions
   if (getVPTInstrPredicate(*Iter) == ARMVCC::None)
     break;
Pierre-vh updated this revision to Diff 253848.Mar 31 2020, 5:23 AM
Pierre-vh marked an inline comment as done.
  • Fixed issues found during review (comments marked as done)
  • Improved the pass:
    • It can now move VPNOTs before their first user when needed. This currently only happens when another VCCR value is used between the VPNOT and its first user.
    • Instead of inserting VPNOTs before VPNOTs, it instead replaces the existing VPNOT with a COPY.
  • Added more tests, and they now use virtual registers instead.
Pierre-vh updated this revision to Diff 254524.Apr 2 2020, 7:37 AM
  • Fixed bugs related to isKill flags in multiple places. I now correctly change the isKill flags when needed, and I added tests for that.
  • Rebased the patch - Added the mve-vpt-blocks.ll test.
Pierre-vh updated this revision to Diff 257268.Apr 14 2020, 3:43 AM
  • Rebasing the patch because I inserted D77798 between this and D76709
    • D77798 also added a new test in mve-pred-not and this pass improves it as well.

Sorry for the delay.

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
175

if at there -> if there

240

Does this only find the first pair in a basic block? What would happen if there are multiple (potential) vpt blocks in a larger basic block, and several of them can be optimized?

Can this be done with a linear scan through the block that attempts to find and hold VPNOT's and the original value, converting any uses of the original to the VPNOT. Or does that not work very well for some reason? Maybe using something like MRI->use_instructions would also help.

256

Can you explain the advantage of moving the VPNOT?

Pierre-vh marked 3 inline comments as done.Apr 16 2020, 7:11 AM
Pierre-vh added inline comments.
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
240

What would happen if there are multiple (potential) vpt blocks in a larger basic block, and several of them can be optimized?

Unfortunately it'd only optimize the first one as it'd only pick up the first pair of VCMP/VPNOT and not the second.
I can try to rewrite the function so it handles those cases, I'll do that tomorrow.

Can this be done with a linear scan through the block that attempts to find and hold VPNOT's and the original value, converting any uses of the original to the VPNOT. Or does that not work very well for some reason?

I don't think it'd work well, but I can certainly try it and see if it works. I'll also look at use_instructions, but I'm not sure it'll help.

256

Sometimes, you can have code like this (taken from the bottom test):

%2:vccr = MVE_VCMPs32 %0:mqpr, %1:mqpr, 10, 0, $noreg
%3:vccr = MVE_VPNOT %2:vccr, 0, $noreg
%4:mqpr = MVE_VORR %0:mqpr, %1:mqpr, 1, %2:vccr, undef %4:mqpr
%5:mqpr = MVE_VORR %1:mqpr, %0:mqpr, 1, %2:vccr, undef %5:mqpr
%6:mqpr = MVE_VORR %4:mqpr, %5:mqpr, 1, %3:vccr, undef %6:mqpr

%3 is not used directly: the original VCCR value, %2 is used before it, so their lifetimes overlap.

If I didn't move the VPNOT further down in such situations, the pass would insert a double VPNOT, like this:

%2:vccr = MVE_VCMPs32 %0:mqpr, %1:mqpr, 10, 0, $noreg
%3:vccr = MVE_VPNOT %2:vccr, 0, $noreg
%foo:vccr = MVE_VPNOT %3:vccr
%4:mqpr = MVE_VORR %0:mqpr, %1:mqpr, 1, %foo:vccr, undef %4:mqpr
%5:mqpr = MVE_VORR %1:mqpr, %0:mqpr, 1, %foo:vccr, undef %5:mqpr
%bar:vccr = MVE_VPNOT %foo:vccr
%6:mqpr = MVE_VORR %4:mqpr, %5:mqpr, 1, %bar:vccr, undef %6:mqpr

But, since I now move the VPNOT further down, we get this instead, which is of course much better.

// VPNOT moved down, no more overlapping VCCR lifetimes, no double VPNOTs.
%2:vccr = MVE_VCMPs32 %0:mqpr, %1:mqpr, 10, 0, $noreg
%4:mqpr = MVE_VORR %0:mqpr, %1:mqpr, 1, %2:vccr, undef %4:mqpr
%5:mqpr = MVE_VORR %1:mqpr, %0:mqpr, 1, %2:vccr, undef %5:mqpr
%3:vccr = MVE_VPNOT %2:vccr, 0, $noreg
%6:mqpr = MVE_VORR %4:mqpr, %5:mqpr, 1, %3:vccr, undef %6:mqpr

That transformation is of course only done if a use of the orginal VCCR value is found before the first use of the VPNOT's result.

Pierre-vh updated this revision to Diff 258290.Apr 17 2020, 4:41 AM
  • Added support for optimising multiple VPT blocks in the same Basic Block, and added a test to show it
  • Now VCCRValue is only set for VCMPs. (It doesn't really make a difference, but makes it clear that only VCMPs are supported by this optimisation)
Pierre-vh marked 2 inline comments as done.Apr 17 2020, 4:42 AM
dmgreen added inline comments.Mon, May 11, 1:24 AM
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
250

We not do this for predicated VCMP's?

257

findRegisterUseOperandIdx != -1

269

What makes us not do this for other opcodes? (I guess vcmp is the vast majority of cases?)

286

The second loop -> This second loop, just to be a little clearer.

302

Do we need to add the copy? We could just use LastVPNOTResult directly? Or would that be more complex?

305

Set Modified = true here too I think.

Pierre-vh updated this revision to Diff 263153.Mon, May 11, 5:53 AM
Pierre-vh marked 7 inline comments as done.
  • Minor refactoring of the patch
  • The pass is no longer limited to VCMPs for VCCRValue, it can now use any instruction that writes to VPR (e.g. VMSR)
  • The pass no longer replaces VPNOT with copies - it just removes the VPNOT and replaces all of its uses.
  • Other minor fixes
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
250

Currently, no, to keep things simple.
I think that allowing predicated vcmps would require a bit more analysis to ensure that we don't make things worse by accident, and I'm not even sure it'd be worth it (are there any cases where it could be beneficial?)

%0:vccr = vpt ...
%1:vccr = vcmp /* predicated on %0 */
%2:vccr = vcmp /* unpredicated, opposite of %1 */

If we replace the second vcmp with vpnot %1:vccr, what happens if the first vcmp isn't executed?

269

It's just that vcmp is the vast majority of cases. There's no issue with enabling it for other instructions that write to VPR.
I've removed this restriction and added a test with vmsr.

dmgreen accepted this revision.Tue, May 12, 2:16 AM

LGTM. Thanks.

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
250

It's not about "not executing", a predicated vcmp acts as a "and %0, <newcond>".

Overlapping ranges of different predicates certainly does sound more complex though. Lets leave that for other patches if we find cases where it's needed.

This revision is now accepted and ready to land.Tue, May 12, 2:16 AM
This revision was automatically updated to reflect the committed changes.