This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Transforming memset to Tail predicated Loop
ClosedPublic

Authored by malharJ on Apr 13 2021, 7:16 PM.

Details

Summary

This patch converts llvm.memcpy intrinsic into Tail Predicated
Hardware loops for a target that supports the Arm M-profile
Vector Extension (MVE).

The llvm.memset intrinsic is converted to a TP loop for both
constant and non-constant input sizes (of llvm.memset).

Depends on D99723

Diff Detail

Event Timeline

malharJ created this revision.Apr 13 2021, 7:16 PM
malharJ requested review of this revision.Apr 13 2021, 7:16 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2021, 7:16 PM

Can you base this on top of D99723? Some of the code may be able to be shared, even if there will be differences.

llvm/lib/Target/ARM/ARMISelLowering.cpp
11078

It might be worth creating a dup in EmitTargetCodeForMemset and having this use the vector value it produced.

malharJ updated this revision to Diff 338382.Apr 18 2021, 9:02 AM
malharJ marked an inline comment as done.

Moved vdup creation to inside EmitTargetCodeForMemset()

malharJ updated this revision to Diff 338385.Apr 18 2021, 9:43 AM
malharJ marked 2 inline comments as not done.
  • Combined test files of this patch and parent patch (D99723)
  • Corrected some minor formatting errors
malharJ marked an inline comment as done.Apr 18 2021, 9:58 AM
malharJ added inline comments.
llvm/test/CodeGen/Thumb2/mve-phireg.ll
194

@dmgreen ,

This seems to be happening because of a 16-byte spill ( line 169 ).

And it's being generated after I updated the code to create the VDUP in
EmitTargetCodeForMemset() instead of during the MIR level transform.

I'm not sure why the spill is happening (not really familiar with RA), but do you
think it's worth investigating ? It doesnt seem to happen in the simpler tests.

dmgreen added inline comments.Apr 19 2021, 2:06 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
11210

This and genTPLoopBodyMemcpy are very similar. Is it possible to combine them more?

11211

Formatting.

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
22

One option for both is probably fine.

299

OptSize?

301

It's best to create a shuffle vector or build vector, not a ARMISD::VDUP directly. That may optimize better in places.

Is the input always an i8?

302

Does it need Src.getValue(0)?

llvm/test/CodeGen/Thumb2/mve-phireg.ll
194

Yeah, this test might be difficult like that, it's designed to spill a lot. It's probably fine so long as we don't see it happening in other places.

malharJ updated this revision to Diff 339253.EditedApr 21 2021, 8:38 AM
malharJ marked 7 inline comments as done.

Addressed review comments:

  • replaced vdup node with buildSplatVector
    • looks like it gets converted to vdup.8 in most cases, apart from the memclr case where it become vmov.i32 <vec_reg> #0x0
  • combined TP loop body generation functions into one
  • added conditions for generating inline TP loop for memset, similar to memcpy case
  • minor formatting changes
llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
301

I've made this update with a build vector. Have 2 minor queries:

  • Why would shuffle vector be appropriate here (given that all we want to create is a vector of constants) ?
  • Even though I'm not utilising vdup directly, just to understand better, why would the input need to be i8 to generate a v16i8 vector ... can the vector not be generated by a i32 source register ?
dmgreen added inline comments.Apr 25 2021, 12:19 AM
llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
133

Maybe shouldGenerateInlineTPLoop would be a more descriptive name?

145

for -> For, probably with a full stop on the previous line.

150

Make sure you clang-format the patch.

301

Do you mean "Why use a shuffle as opposed to a ARMISD::VDUP"? Normal codegen will usually start off by producing a shuffle, which will then be optimized and eventually turned into a VDUP. Because it works that way around there are not a lot of optimizations on VDUP directly, as can be seen with the constant becoming a VMOVimm. We could theoretically add them, but its added complexity and it's easier to just use the optimizations that are already present by starting from a shuffle.

The input should be able to be an i32, but the types would need to be correct.

Do we know the type of the value is always an i8?

malharJ updated this revision to Diff 340399.Apr 25 2021, 4:23 PM
malharJ marked 2 inline comments as done.

Minor formatting updates to address review comments.

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
301

Ok. Thanks for clarifying about the shuffle vector.

Do we know the type of the value is always an i8?

I have a truncate operation on line 298 which ensures the input is i8.

malharJ marked an inline comment as done.Apr 25 2021, 4:24 PM
dmgreen added inline comments.Apr 26 2021, 6:41 AM
llvm/lib/Target/ARM/ARMISelLowering.h
54

.. for tail predicated loops.

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
301

Yep, but if the value isn't an i8 it will discard some bits it should not. Something like a @llvm.memset.p0i8.i32 or @llvm.memset.p0i32.i32, if they are valid.

Is it possible to add an assert at least?

malharJ added inline comments.Apr 27 2021, 10:32 AM
llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
301

So having a look at the language ref for llvm.memset suggests that the Src value is always an i8 (It may get zero extended before reaching here but that's not a problem) ..

malharJ updated this revision to Diff 341075.Apr 27 2021, 10:30 PM
malharJ marked an inline comment as done.

Rebased on top of latest parent patch to get the fix for clang-format issues.
Minor formatting update.

dmgreen added inline comments.Apr 30 2021, 3:52 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
11279

Why has this changed from false to true?

What happens if CPSR is live across the MVE_MEMCPYLOOPINST? WhileLoopStart and LoopEnd both clobber that physical register, incase they get reverted to subs; bne. Do we need to add the same clobber to these new MVE mem loop instructions?

malharJ updated this revision to Diff 343139.EditedMay 5 2021, 12:06 PM

Rebased patch to get latest updates to parent patch.

llvm/lib/Target/ARM/ARMISelLowering.cpp
11279

Im a bit unclear on this, so I've still left this as UpdateLiveIns=true in the latest patchset.

Inside splitAt(), I can see this relevant piece of code where:

  • LiveRegs.addLiveOuts() will add the live outs of the original block.
  • The for-loop will add the liveins for any physical register uses by the instructions in the newly split block.
LivePhysRegs LiveRegs;
if (UpdateLiveIns) {
    ...
    LiveRegs.addLiveOuts(*this);
    for (auto I = rbegin(), E = Prev.getReverse(); I != E; ++I)
      LiveRegs.stepBackward(*I);
}

...

  if (UpdateLiveIns)
    addLiveIns(*SplitBB, LiveRegs);

The condition uses I != E (where E is the memcpy/set pseudo) ... so I suppose any liveness information
from it is not added ? and in case it gets reverted and clobbers CPSR in a later pass, shouldn't that take care of updating the liveness information of successive blocks ?

dmgreen accepted this revision.May 6 2021, 3:12 AM

OK. Sounds good. This looks good to me. Lets get this in, it's off by default and we can make adjustments as needed in-tree whilst getting some extra testing.

LGTM. Thanks

This revision is now accepted and ready to land.May 6 2021, 3:12 AM
This revision was automatically updated to reflect the committed changes.