This is an archive of the discontinued LLVM Phabricator instance.

[Target][ARM] Add a fix for an LSR Pattern that can't be tail-predicated
AbandonedPublic

Authored by Pierre-vh on Apr 28 2020, 4:23 AM.

Details

Summary

The LSR pass can generate an undesired pattern which hurts tail predication in some cases.
However, fixing LSR directly isn't easy and could break other targets, so instead of changing LSR, we decided to fix it in the MVETailPredicationPass.

This patch improves the MVETailPredication pass so it can detect the undesirable pattern and rewrite it in a tail-predication-friendly form.

Here is an example of the IR that LSR can generate

loopbody:
  %lsr.iv = phi i32 [ %lsr.iv.next, %loopbody ], [ %42, %pred ]
  %44 = add i32 %lsr.iv, -4
  %45 = call <4 x i1> @llvm.arm.mve.vctp32(i32 %44) #5
  ; ... etc
  %lsr.iv.next = add nsw i32 %lsr.iv, -4

That can't be tail-predicated because the VCTP's operand is defined inside the loop, so this patch will rewrite it like this:

pred:
  %44 = add i32 %42, -4
loopbody:
  %lsr.iv = phi i32 [ %lsr.iv.next, %loopbody ], [ %42, %pred ]
  %lsr.fixed = phi i32 [ %lsr.iv.next, %loopbody ], [ %44, %pred ]
  %45 = call <4 x i1> @llvm.arm.mve.vctp32(i32 %lsr.fixed)
  ; ... etc
  %lsr.iv.next = add nsw i32 %lsr.iv, -4

That IR is functionally the same as before, but causes no issue for tail-predication.

Diff Detail

Event Timeline

Pierre-vh created this revision.Apr 28 2020, 4:23 AM
samparker added inline comments.Apr 28 2020, 7:19 AM
llvm/lib/Target/ARM/MVETailPredication.cpp
199

I think getLoopPreheader() would be preferable.

326

We haven't even inserted the VCTP yet! Your test shouldn't contain the intrinsic. If TryConvert returns the VCTP, then you could use it directly and not have to iterate through the loop doing your add search.

Pierre-vh marked an inline comment as done.Apr 29 2020, 4:40 AM
Pierre-vh added inline comments.
llvm/lib/Target/ARM/MVETailPredication.cpp
326

I think this pattern (generated by LSR) can only happen when there is a VCTP intrinsic already present.

The reproducer I had was using it, and I can't think of a way to trigger the same issue without the VCTP intrinsic.
Also, if I were to implement this by using the result of TryConvert, it wouldn't fix the original issue as it wouldn't catch existing VCTP intrinsics.

I should make this clearer in the patch, or perhaps this fix is just misplaced? (Not in the right pass?)
What do you think?

samparker added inline comments.Apr 30 2020, 4:55 AM
llvm/lib/Target/ARM/MVETailPredication.cpp
326

Ah okay, then yes, this is not the right place to fix this issue. I think it makes sense to do it in ARMLowOverheadLoops before we decide that we can't do tail predication.

dmgreen added inline comments.Apr 30 2020, 11:05 AM
llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-tp-lsr-patterns.ll
2

I think that if we have what is essentially a phase-ordering problem between two passes in the backend, it's worth running an llc test to check the whole thing.

Bonus points for adding the test before and just showing the diffs in the patch.

90

This test can probably be cleaned up a bit too? These attribute are often unnecessary if you get the args correct.

Pierre-vh marked an inline comment as done.May 1 2020, 12:03 AM
Pierre-vh added inline comments.
llvm/lib/Target/ARM/MVETailPredication.cpp
326

It looks like fixing it there will be a bit harder.

The MIR going into ARMLowOverheadLoops looks like this:

  renamable $r0, dead $cpsr = nsw tADDi3 renamable $r1, 2, 14, $noreg
  ; ...
  renamable $lr = nuw nsw t2ADDrs renamable $r9, killed renamable $r1, 19, 14, $noreg, $noreg
  ; ...
  t2DoLoopStart renamable $lr
bb.3 (%ir-block.51, align 2):
  ; ...
  renamable $r0, dead $cpsr = tSUBi8 killed renamable $r0(tied-def 0), 4, 14, $noreg
  ; ...
  renamable $vpr = MVE_VCTP32 renamable $r0, 0, $noreg

I'd need to prove that $r0 == $lr to remove the $r0 definitions and replace the uses with $lr. I don't know how challenging that is?
I can try to do it if you want, but I think it's easier to do this in an IR pass because the pattern is easier to find and rewrite there.

samparker added inline comments.May 1 2020, 12:33 AM
llvm/lib/Target/ARM/MVETailPredication.cpp
326

$lr is the iteration count, while $r0 is the element count so they shouldn't be the same value. We need to ensure that r0 (element count) only changes by -4 for each iteration and that we can replicate the tSUB into the preheader, while sinking the original past the VCTP.

Pierre-vh marked an inline comment as done.May 1 2020, 12:57 AM
Pierre-vh added inline comments.
llvm/lib/Target/ARM/MVETailPredication.cpp
326

What would the final MIR look like? I can't just move the sub down and add another one in the header, else the result will be different as I'll be decrementing $r0 twice on the first iteration, so something else has to change too, right?

Pierre-vh abandoned this revision.May 5 2020, 3:43 AM

Closing this for now as we plan to fix this in LSR instead.