Page MenuHomePhabricator

[Target][ARM] Make Low Overhead Loops coexist with VPT blocks
ClosedPublic

Authored by Pierre-vh on Apr 15 2020, 6:43 AM.

Details

Summary

Previously, the LowOverheadLoops pass couldn't handle VPT blocks that used the vpt instruction, or loops containing multiple identical VCTPs.
This patch improves the LowOverheadLoops pass so it can handle those cases

I'm still unsure about the changes in this patch, so comments/suggestions are welcome.

This patch will also need a follow-up ARMTargetTransformInfo change to work because the TTI, in its current state, won't allow the vectorizer to do tail-predication for loops bigger than 1 basic block, and loops containing compare instructions, and, as VPT blocks are generated from comparisons (which create the predicate), they never make it to this pass in the current state of things.

However, with the right changes to the TTI and the right compiler options, you can generate this kind of code with these changes:

// C++
void test(int* A, int n, int x)  {             
    for(int i = 0; i < n; i++)  
      if (A[i] < x && A[i] > -x)
        A[i] = 0;               
}
// assembly
	dlstp.32	lr, r1
.LBB0_1:                                @ %vector.body
                                        @ =>This Inner Loop Header: Depth=1
	vldrw.u32	q1, [r0]
	vptt.s32	lt, q1, r2
	vcmpt.s32	gt, q1, r3
	vstrwt.32	q0, [r0], #16
	letp	lr, .LBB0_1

Diff Detail

Event Timeline

Pierre-vh created this revision.Apr 15 2020, 6:43 AM
  • removing a one-line-change that didn't belong to this patch (a TTI change)
samparker added inline comments.Apr 17 2020, 6:30 AM
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
186

Something's a miss here! This should have been caught by a test.

830

isIdentical is not going to do what you're hoping for here. Use RDA is figure out if both VCTPs are operating on the same value (not just the same register). Also needs a test. I'm wondering whether we should also assert that the VCTP, if predicated, is 'Then' predicated. An example you had (last week?) confused me because I'm not sure why an Else predicate VCTP should appear here and how it maps the current idea that all predicates are ANDed.

838

Following on from my above comment... the logic here was assuming that each value in VPR.P0 is rooted at the VCTP and then subsequent instructions (only VCMP) can modify the VPR and those instructions have to predicated on the VCTP. This should result in a predicate along the lines of VCTP && VCMP. Now we're trying to allow VPTs, which create a predicate and cannot be predicated by an existing VPT block - so how do we know that the resulting predicate is still rooted at VCTP and not 'disjoint'?
Everything in this pass assumes that any instruction using VPR is predicated on the VCTP and this is the place where we need to make that guarantee. So a couple of things that I can think of now are:

  • that the isVectorPredicated function actually means 'isPredicatedOnVCTP' and that will no longer be true.
  • the notion of 'disjoint' has become more complicated, though a VPT doesn't use the VPR value defined by VCTP, it could be using arguments that are dependent upon the VCTP - and I have the feeling this is the only case that we can support.

These nuances will also require more testing.

842

Format this differently so this statement is attached to the rest of the conditional blocks.

1220–1221

Document the new case(s).

Pierre-vh updated this revision to Diff 258689.Apr 20 2020, 3:04 AM
Pierre-vh marked 6 inline comments as done.

Updated the patch following review.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
186

This function was unused, which is why the mistake was not caught in a test - I fixed it now.

830

Also needs a test. I'm wondering whether we should also assert that the VCTP, if predicated, is 'Then' predicated. An example you had (last week?) confused me because I'm not sure why an Else predicate VCTP should appear here and how it maps the current idea that all predicates are ANDed.

I found an example where a else-predicated VCTP is generated and I added it to the tests, it comes from this C++ code:

void test(int * data, int N, int T)
{
  for(int i = 0; i < N; i++) {
    int d = data[i];
    if (d < T || d > -T)
      data[i] = 0;
  }
}
838

the notion of 'disjoint' has become more complicated, though a VPT doesn't use the VPR value defined by VCTP, it could be using arguments that are dependent upon the VCTP - and I have the feeling this is the only case that we can support.

I added a check in ValidateMVEInst - It'll now refuse VPTs if none of their operands is defined by a predicated instruction, and I added a test for this. Is this enough?

Sorry, I forgot about this. Thanks for that example, I now see how the VCTP can be Else predicated, but I obviously don't quite understand how VPT predication works! The code below is what is generated from the example and now I don't understand why the VSTR is Else predicated. Is it because it needs the same inverted predicate as the VCTP, coming from the VCMP, as well as being ANDed with the VCTP? My intuition wanted it to be Then predicated after the VCTP has already done an inversion.

vctp.32 r1
vpst
vldrwt.u32      q1, [r0]
vptee.s32       ge, q1, r2
vcmpt.s32       le, q1, r3
vctpe.32        r1
vstrwe.32       q0, [r0], #16
subs    r1, #4
le      lr, .LBB0_1

Sorry, I forgot about this. Thanks for that example, I now see how the VCTP can be Else predicated, but I obviously don't quite understand how VPT predication works! The code below is what is generated from the example and now I don't understand why the VSTR is Else predicated. Is it because it needs the same inverted predicate as the VCTP, coming from the VCMP, as well as being ANDed with the VCTP? My intuition wanted it to be Then predicated after the VCTP has already done an inversion.

vctp.32 r1
vpst
vldrwt.u32      q1, [r0]
vptee.s32       ge, q1, r2
vcmpt.s32       le, q1, r3
vctpe.32        r1
vstrwe.32       q0, [r0], #16
subs    r1, #4
le      lr, .LBB0_1

The VPT Blocks pass changes the predicate to E when there is a VPNOT, so if it generated this code, there was a VPNOT between the VCTP/VCMP. This is why the last 2 instructions are else predicated.
The VSTR is else predicated as well because we want it to be executed only when the VCTP is. If it were then predicated, it wouldn't be the case - it would be executed even when both conditions evaluate to true (and the VCTP would be skipped).
(This is kind of related to D77798, which was a miscompilation issue that occured because we initially thought that we had to change the predicate back to a "then" after VPR is written to.)

Ah, thanks. Well, I guess I'm glad I'm not the only one to be confused by this! I'm carry on reviewing the patch now.

samparker added inline comments.May 4 2020, 12:44 AM
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
244

I'm still pondering about this... I think this should be checking that both operands are either predicated q-regs, scalar or are loop invariant, that way should ensure that the result will be the same after the transform. This method also restricts us to instructions that are directly predicated, so it would be good to add a TODO here as well.

809

You don't have to call id() on the register.

1224

Does the block have to contain a vctp? Do we check that this is true?

1285

I might be viewing the nesting incorrectly here, but isn't this handled on line 1250?

1301

Surely we should have discovered all of these during the loop above?

Pierre-vh marked 2 inline comments as done.May 4 2020, 1:39 AM
Pierre-vh added inline comments.
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
244

I'm still pondering about this... I think this should be checking that both operands are either predicated q-regs, scalar or are loop invariant, that way should ensure that the result will be the same after the transform.

If I understand correctly, this function currently only handles the first case (predicated q-regs), but it should be expanded to allow any r-reg and loop invariants (registers that aren't defined inside any of the loop's basic blocks) ?

This method also restricts us to instructions that are directly predicated, so it would be good to add a TODO here as well.

What do you mean by "directly predicated"? Can you please give me an example of an indirectly predicated instruction in this context?

1224

It doesn't have to contain a VCTP. I'll change this comment to make it clear.

1285

The if at line 1250 is inside the if at line 1244. That else-if is from the if at line 1244.

  • if (1244)
    • if (1250)
  • else if (1299)
1301

The loop above catches the VCTP inside VPT blocks, but not the other ones.

So for instance, if you have this (from /test/CodeGen/Thumb2/LowOverheadLoops/vctp-in-vpt-2.mir)

MVE_VPST 4, implicit $vpr
renamable $vpr = MVE_VCTP32 renamable $r2, 1, killed renamable $vpr
renamable $r1, renamable $q0 = MVE_VLDRWU32_post killed renamable $r1, 16, 1, killed renamable $vpr :: (load 16 from %ir.lsr.iv24, align 4)
renamable $vpr = MVE_VCTP32 renamable $r2, 0, $noreg
renamable $r2, dead $cpsr = tSUBi8 killed renamable $r2, 4, 14, $noreg

The first VCTP will be removed by the loop above, but the "secondary" one below won't - it'll be removed by this loop.

Pierre-vh updated this revision to Diff 261760.May 4 2020, 2:12 AM
Pierre-vh marked 3 inline comments as done.
samparker added inline comments.May 4 2020, 2:50 AM
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
244

By directly predicated, I mean in a VPT block. This pass also tries to understand which instructions are indirectly predicated too, by looking at their use-def chain in LowOverheadLoop::ValidateLiveOuts. Just checking q-regs is fine at the moment, but is it really acceptable is only one operand is predicated?

Pierre-vh added inline comments.May 4 2020, 3:16 AM
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
244

Right, I see, I'll change this so it requires that both operands are either a predicated q-reg, a r-reg, or a loop invariant.
Can I accept every r-reg, or only the loop invariants ones? Can I accept loop-invariant non-predicated q-regs as well?

samparker added inline comments.May 4 2020, 4:07 AM
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
244

I would say predicated or invariant for both register classes. I think should should guarantee what we need.

Pierre-vh updated this revision to Diff 262059.May 5 2020, 4:08 AM

Change IsAcceptableVPT and adding GetReachingDefs

samparker added inline comments.May 6 2020, 7:36 AM
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
277

I feel like this will only be safe if we're still guaranteeing that any predicated instruction is predicated upon the VCTP. I'm concerned about some code that may look llike this:

loop.ph:
  a = ...
  b = ...
  c = ...
  n = ...
  z = ...
  DLS

loop:
  VCTP
  VPTT z, n
  VSTRT
  VPTTTT a, b
  VLDRT d
  VLDRT e
  VADDT z, e, d
  LE loop

Is it possible now to have a VCTP in the loop with nothing actually predicated upon it?

827

I doubt CurrentPredicate is correct here as a VPT would be generating its own, maybe it should be cleared..?

Pierre-vh marked 2 inline comments as done.May 7 2020, 5:32 AM
Pierre-vh added inline comments.
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
277

I did this with the assumption that VPT blocks that aren't related to the VCTP would be rejected.

Is it possible now to have a VCTP in the loop with nothing actually predicated upon it?

I think it's possible if something clears VPR.P0, or overwrites it, right after the VCTP, like so:

VCTP
VMSR VPR.P0 R0
...

(Of course I don't know if that ever happens)
Other instructions, such VCMP/VPT, "and" their result with VPR.P0 so those should be fine I believe.

What do you think I should do here ? Should I check that the Def is inside a VPT block that we know? (e.g. by saving all predicated Defs in a set and checking whether Def is in the set).

827

According to the Armv8-M Architecture Reference Manual, VPT is similar to VCMP, it and's the result of the comparison with the current element mask (see page 1174) , it doesn't overwrite VPR.P0 completely, so I don't think this needs to be cleared.

samparker added inline comments.May 18 2020, 5:33 AM
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
277

I did this with the assumption that VPT blocks that aren't related to the VCTP would be rejected.

That's the idea, but I suspect that this code now no longer does. I don't mind how you go about it, but yes, somehow we need to add more information into the predicate tracking and we can't accept anything that we cannot reason about.

Pierre-vh updated this revision to Diff 264880.May 19 2020, 6:22 AM
  • Added note explaining how instructions that write to VPR.P0 work
  • Removed the logic to check VPTs, they're acceptable whenever a VCMP is
  • Adding VPT before VCTP test case
    • It was causing a crash, which I fixed by adding a null-pointer check at line 404
  • Added new debug message at line

Is there a test for the case for something like this?

VPTTT
VLDRT
VCTPT
VSTRT

I just want to guarantee that we reject this.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
398

Looking at this again, I guess the logic should be if (Block.HasNonUniformPredicate() && !isVCTP(...)) because then we won't try to dereference a nullptr.

Pierre-vh updated this revision to Diff 265190.May 20 2020, 2:53 AM
Pierre-vh marked an inline comment as done.
  • Allow VCMPs before the VCTP
  • Add test case as requested
  • Revert the null-pointer check I added in the previous patch & refactor the condition as suggested
samparker accepted this revision.May 20 2020, 3:29 AM

Good stuff, thanks for all your work on this. LGTM.

This revision is now accepted and ready to land.May 20 2020, 3:29 AM
This revision was automatically updated to reflect the committed changes.