This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Transforming memcpy to Tail predicated Loop
ClosedPublic

Authored by malharJ on Apr 1 2021, 6:02 AM.

Details

Summary

This patch converts llvm.memcpy intrinsic into Tail Predicated
Hardware loops for a target that supports the Arm M-profile
Vector Extension (MVE).

From an implementation point of view, the patch

  • adds an ARM specific SDAG Node (to which the llvm.memcpy intrinsic is lowered to, during first phase of ISel)
  • adds a corresponding TableGen entry to generate a pseudo instruction, with a custom inserter, on matching the above node.
  • Adds a custom inserter function that expands the pseudo instruction into MIR suitable to be (by later passes) into a WLSTP loop.

Diff Detail

Event Timeline

malharJ created this revision.Apr 1 2021, 6:02 AM
malharJ requested review of this revision.Apr 1 2021, 6:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2021, 6:02 AM
lebedev.ri retitled this revision from Transforming memcpy to Tail predicated Loop to [ARM] Transforming memcpy to Tail predicated Loop.Apr 1 2021, 6:07 AM

I know you've worked on this for a while and investigated different strategies, but I think we also need to argue here why we would like to emit a memcpy loop instead of e.g. having optimised versions in the clib. In other words, is this the best we can do for all different alignments, sizes, etc.?

malharJ updated this revision to Diff 334684.Apr 1 2021, 7:26 AM

Added some comments to better illustrate transform.
Also renamed some variables to maintain consistency.

dmgreen added inline comments.Apr 1 2021, 8:18 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
11078

Can you look into all these clang-tidy errors. LLVM usually uses CamelCase for variable names, and the MBB_ looks a little odd. They should start with a capital and I would drop the "t2", that's not adding much.

llvm/lib/Target/ARM/ARMInstrMVE.td
6873

Can you improve this formatting. If you clang-formatted it, it doesn't do well with .td files and manually making it look more like the others will do better.

llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll
21

Why does this not use r3 directly?

llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
36

What is this generating r3 for? I thought those should be removed.

56

Why is this using printf? It looks like an execution test, not a unit test. Is it testing anything specifically? If so it can probably use any call, not a variadic version of printf.

64

Remove hidden and local_unnamed_addr #0

malharJ updated this revision to Diff 335218.EditedApr 5 2021, 1:38 AM
malharJ marked 5 inline comments as done.
  1. Addressed comments (review comments + clang-tidy and clang-format fixes)
  1. Updated transform to not generate preHeader block due to issues with phi-node-elimination pass placing copy/movs in the generated preHeader.

Details provided in comment below.

malharJ added inline comments.Apr 5 2021, 1:40 AM
llvm/lib/Target/ARM/ARMInstrMVE.td
6873

Had clang formatted it but yeah doesnt look good.
Updated now.

llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll
21

This seems to be an issue with generating a preHeader during the transform ..

The phi-node-elimination pass is lowering the phi instructions (in the TP loopBody) as COPY operations (into the PreHeader).
In this instance, the copy/mov can be seen below on line 32:

mov r7, r3

I've fixed this issue as of now by not generating an extra preHeader during the transform ..
so the mov ends up above the t2WhileLoopStartLR and overall it seems to work.

Please see my comment about the latest changes for more details on this.

llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
36

Good spot,

I think this is because DCE was not happening for the instructions calculating iterationCount.

I had a quick look at ARMLowOverheadLoops::IterationCountDCE( ), and it seems
that the expectation from the generated MIR is:

//   $lr = big-itercount-expression
//   ..
//   $lr = t2DoLoopStart/t2WhileLoopStartLR renamable $lr
//   vector.body:

I've updated the final expression (in the generated MIR) calculating iteration count to now return
the result in LR (earlier it was returning in one of the rGPRs) and these are now getting removed.

56

So the intent of the test was just to check whether code surrounding memcpy call site is properly transformed. I simply used printf to prevent the code from getting optimized away (seems like a poor way now that I think about it).

I've removed this test now since the transformation involving nested loops (in memcall.ll) is already testing the mentioned intent.

malharJ added a comment.EditedApr 5 2021, 1:53 AM

So I've updated transform to not generate a preHeader block as there seems to be an issue
when generating a preHeader during the transform:

The issue:

The phi-node-elimination pass introduces COPY operations (for each PHI instruction in the TP loopBody) into the preHeader.

While most of them get removed by simple-register-coalescing pass, one copy in particular is not
getting removed. This is the one involving memcpy transfer size/vector element count. Regarding
why the register coalescing is unable to get rid of this particular copy/mov, I had a look at the
llc --debug output and it seems that it cant remove the mov/copy because the liveness range of
element count register intersects with liveness range of the target of the copy/mov.

An example of the generated (incorrect) assembly is shown below:

Relevant MIR:

TP Entry
         ...
	lr = t2WhileLoopStartLR r4 (r4 may be holding something other than element count)

TP preHeader
	...
	mov r4, r2 (assume r2 holds element count)
	...
TP body
	...
	VCTP r4
	...

Existing logic:

So this value (r4 above) feeds into the loopBody PHI nodes and then the VCTP receives it (which is fine).
But when the ARMLowOverHeadsLoop pass tries to use element count operand of VCTP to feed back to t2WhileLoopStartLR,
it is providing r4 (which is incorrect because the mov is happening after the t2WhileLoopStartLR).

So I tried to see if I could fix this by looking into LowOverheadLoop::ValidateTailPredicate(),
as it defines the "TPNumElements" variable. There is some logic there that handles the case for
local redefinitions of the elementCount physical register, by moving it forward/backward using ReachingDefAnalysis.
But in this instance, we have a redefinition (the mov) in a different BasicBlock so that code doesn't seem to fix this.


I'm not entirely certain if it's acceptable to not generate the preHeader, but unless there is a reasonably
simple fix for the above issue, I can't see another way.

malharJ updated this revision to Diff 335222.Apr 5 2021, 2:51 AM

Fixed some more clang-format errors.

malharJ edited the summary of this revision. (Show Details)Apr 7 2021, 6:40 PM

I'm a little worried that WLSTP is going to cause problems, with it not used anywhere else. Lets at least add an option for disabling it needed.

llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗(On Diff #335222)

Please split this out into a separate review. I think it makes sense (I'm pretty sure I remember writing it, it must have been lost in a refactoring).

llvm/lib/Target/ARM/ARMISelLowering.cpp
11282

When will this happen?

11301

-> "for a more natural layout"?

I think there may be benefits from getting the order roughly correct at this stage, if we are relying on WLS branches. They can be fixed up later, but if we get them more correct at this point, that can only help.

llvm/lib/Target/ARM/ARMISelLowering.h
336

Don't format any of this - it's unrelated.

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
141

Can we add an option that turns this inline memcpy on/off. If the option is true, we always use the MEMCPYLOOP, if it's false we never do, and if it's unset we use this default logic.

Also consider pulling the if logic into a lambda for readability.

llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
10

Can you make sure there are tests where the i32 is a different type, one that is not legal like an i64.

llvm/test/CodeGen/Thumb2/mve_tp_loop.mir
137

Some of this can be removed, to help keep the test smaller.

malharJ updated this revision to Diff 336495.Apr 9 2021, 9:23 AM
malharJ marked 4 inline comments as done.

addressed some of the review comments:

  • added a cli option for generation of TP loop for memcpy
  • simplified the mir test

I'm a little worried that WLSTP is going to cause problems ...

Would it better to use DLSTP in that case ? or perhaps a command line option
for choosing between DLSTP/WLSTP implementations ?

llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗(On Diff #335222)

Do you mean just this single line update as a separate review ?

llvm/lib/Target/ARM/ARMISelLowering.cpp
11282

This happens if MVE_MEMCPYLOOPINST is the only instruction in the block.
splitAt() returns the same block if there is nothing after the instruction at which the split is done..

This happens when for loops are implicitly converted to memcpys, the memcpy call
ends up being the only instruction in the preheader.

There is already a test case for this as test2 in llvm/test/CodeGen/Thumb2/mve_tp_loop.mir

llvm/lib/Target/ARM/ARMISelLowering.h
336

I had to fix it because patch was failing on clang-format error.

llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
10

Do you mean something like:

define void @test(i8* noalias %X, i8* noalias %Y, i64 %n){
    call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %X, i8* align 4 %Y, i64 %n, i1 false)
    ret void
}

I get an error when I try to generate the assembly. Since i64 is illegal,
what is the expectation here ?

As a side note, if I generate the IR from C code, the IR always truncates the memcpy size variable to a i32
before calling llvm.memcpy( )

dmgreen added inline comments.Apr 12 2021, 6:28 AM
llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗(On Diff #335222)

Yeah... it should preferably be a separate patch. Do you have a test case? Some reason why you changed it?

llvm/lib/Target/ARM/ARMISelLowering.cpp
11282

OK. I thought it was more eager about putting branches on the end of blocks, even if they are fallthroughs.

llvm/lib/Target/ARM/ARMISelLowering.h
336

That's fine. We can ignore the precommit bot where it's more noisy than helpful.

llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
10

Yeah sure. I think they can still come up, from other places creating memcpy calls.

You can probably use DAG.getZExtOrTrunc(Size, MVT::i32), instead of using the size directly.

malharJ updated this revision to Diff 337096.Apr 13 2021, 3:56 AM
malharJ marked 6 inline comments as done.

Addressed remaining review comments:

  • Separated a change into it's own patch and added as dependency
  • minor formatting updates
  • added a testcase with size of type other than i32
malharJ marked an inline comment as done.Apr 13 2021, 3:57 AM
malharJ added inline comments.
llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗(On Diff #335222)

Ok, I've now created a separate patch for this: https://reviews.llvm.org/D100376

dmgreen added inline comments.Apr 14 2021, 1:22 PM
llvm/lib/Target/ARM/ARMISelLowering.cpp
11257

Great comment by the way. Is it possible to make the loop MBB look like a bit like a loop? To show there is a backedge too.

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
140

[=] ->[&] is more standard, even if it doesn't make a lot of difference here.

142

Probably better as:

if (DAG.getMachineFunction().getFunction().hasOptNone())
  return false;
if (!ConstantSize && (Alignment >= Align(4))
  return true;
if (...)
 ...

The EnableMemcpyTPLoop logic could be in here too, as it's just returning true/false at the right time.
What do we do for -Oz and -Os?

llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
2

Shouldn't have -O1 or cpu, use the -mtriple from other similar tests. The test can be called llvm/test/CodeGen/Thumb2/mve-tp-loop.ll.

malharJ updated this revision to Diff 337669.EditedApr 15 2021, 2:06 AM
malharJ marked 4 inline comments as done.

Addressed review comments:

  • renamed test files
  • disabled inline memcpy for optimize size cases (-Os, -Oz) and added tests for the same
  • also added tests for constant size inputs to ensure the threshold values are tested as well.
  • minor formatting changes
llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
142

Ok.

And I guess it would be best to disable in the case of -Os/-Oz in case
there are multiple memcpys in the source. I've made the update and added tests as well.

dmgreen added inline comments.Apr 15 2021, 2:44 AM
llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
142

func -> Func. Or maybe just F, which is quite common in LLVM.

149
if (EnableMemcpyTPLoop == cl::BOU_FALSE)
  return false;

Is probably better, if it works as I expect. That keeps the indenting down, and the last if currently isn't in the block it looks like it should be.

Oh, and move EnableMemcpyTPLoop above the OptSIze/OptNone, in case we want to try and force it. (Even if OptNone doesn't work, using that combo is unlikely to be useful at any rate.)

154

Add a return false at the end?

malharJ updated this revision to Diff 337684.Apr 15 2021, 3:08 AM
malharJ marked 3 inline comments as done.

Addressed review comments:

  • moved cli option (when set) to be of higher priority than optNone/optSize
  • minor formatting updates.
llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
149

Ok, my bad there with the braces.

I've moved the cases when the cli option is set to be of higher priority than the optNone/optSize cases ...
but the unset case is of Lower priority than (the optNone/optSize) since the user is no longer passing the cli option.
Hopefully that sounds sensible.

154

yep, had missed that out.

malharJ updated this revision to Diff 338345.EditedApr 17 2021, 3:49 PM

Rebased patch and removed the dependency as it has been closed.

dmgreen accepted this revision.Apr 19 2021, 1:23 AM

Yeah thanks, but this is for a different architecture. On M class we have access to MVE tail predicated loops that can be much more efficient for emitting inline memcpys. A-Class Arm with Neon will be very different.

This looks good to me now, with a couple of extra nits.

llvm/lib/Target/ARM/ARMISelLowering.cpp
1817

Remove newline.

llvm/lib/Target/ARM/ARMSelectionDAGInfo.h
19

Is this needed here? Can it be in the cpp file?

This revision is now accepted and ready to land.Apr 19 2021, 1:23 AM
malharJ updated this revision to Diff 340370.Apr 25 2021, 10:34 AM
malharJ marked 2 inline comments as done.
malharJ edited the summary of this revision. (Show Details)

Minor formatting updates.

Thanks. Can you rebase and make sure the patch is clang-formatted?

malharJ updated this revision to Diff 340828.Apr 27 2021, 7:04 AM

Rebased patch + minor formatting updates.

malharJ updated this revision to Diff 342948.EditedMay 4 2021, 10:06 PM

Fix for bug spotted by dmgreen (thank you):

Added an update to ensure that the block containing memcpy pseudo is always
split using splitAt().

An example case where this is important is when updating
phi instructions in successive blocks, which is taken care of by splitAt()
which calls transferSuccessorsAndUpdatePHIs() internally.
A test has been added for the same.

malharJ added inline comments.May 4 2021, 10:11 PM
llvm/test/CodeGen/Thumb2/mve-tp-loop.ll
241–250 ↗(On Diff #342948)

Not entirely sure why this isn't a TP loop, might need to check ArmLOL pass as to why it's being reverted..

Thanks. It looks like the arm low overhead loop pass doesn't like that two loops have the same preheader. Which makes sense, I don't like that either.

What do you think about committing this with the flag off for the time being and flipping the switch when we have sorted out some of the problems this is running into? memset especially seems to come up in a lot of cases, and can run into problem with so many low overhead loops together.

malharJ updated this revision to Diff 343066.EditedMay 5 2021, 8:59 AM

Changed cli option for conversion of memcpy to TP loop to be disabled by default.
The disable may be temporary, and will be removed after some more testing.

A custom enum replaces the cl::boolOrDefault to implement the required functionality.

dmgreen accepted this revision.May 5 2021, 12:37 PM

Thanks. LGTM

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
143

Perhaps use == TPLoop::ForceDisable to make it clear.

This revision was landed with ongoing or failed builds.May 6 2021, 1:39 AM
This revision was automatically updated to reflect the committed changes.

Thanks a lot for the review !

malharJ reopened this revision.May 6 2021, 4:58 AM
This revision is now accepted and ready to land.May 6 2021, 4:58 AM
malharJ updated this revision to Diff 343364.EditedMay 6 2021, 5:06 AM

Fix for MachineVerifier error during Buildbot failure
https://lab.llvm.org/buildbot/#/builders/16/builds/10462

The failure is happening during testing because the NoPHIs property is being
set to true by MIRParserImpl::computeFunctionProperties( ) as there are No phis (prior to transformation),
but during the transform phis are generated.
This results in an error during MachineVerifier, since the function is labelled
with NoPHIs=true while there are phi insructions in the code.

This fix resets the property to false during the transform.

Ah yes. Sorry I didn't suggest adding that to the tests - it can be useful.

Setting NoPHIs seems a bit odd. It's a side effect of the mir test having no PHI's as it's loaded but them being added here. I don't have a better suggestion for fixing it though, other than adding existing PHI's to the mir test which needlessly complicates it.

This sounds like a good fix to me.

This revision was landed with ongoing or failed builds.May 6 2021, 3:26 PM
This revision was automatically updated to reflect the committed changes.