This is an archive of the discontinued LLVM Phabricator instance.

[ARM] MVE Tail Predication
ClosedPublic

Authored by samparker on Aug 7 2019, 8:20 AM.

Details

Summary

The MVE and LOB extensions of Armv8.1m can be combined to enable 'tail predication' which removes the need for a scalar remainder loop after vectorization. Lane predication is performed implicitly via a system register. The effects of predication is described in Section B5.6.3 of the Armv8.1-m Arch Reference Manual, the key points being:

  • For vector operations that perform reduction across the vector and produce a scalar result, whether the value is accumulated or not.
  • For non-load instructions, the predicate flags determine if the destination register byte is updated with the new value or if the previous value is preserved.
  • For vector store instructions, whether the store occurs or not.
  • For vector load instructions, whether the value that is loaded or whether zeros are written to that element of the destination register.

This patch implements a pass that takes a hardware loop, containing masked vector instructions, and converts it something that resembles an MVE tail predicated loop. Currently, if we had code generation, we'd generate a loop in which the VCTP would generate the predicate and VPST would then setup the value of VPR.PO. The loads and stores would be placed in VPT blocks so this is not tail predication, but normal VPT predication with the predicate based upon a element counting induction variable. Further work needs to be done to finally produce a true tail predicated loop.

Because only the loads and stores are predicated, in both the LLVM IR and MIR level, we will restrict support to only lane-wise operations (no horizontal reductions). We will perform a final check on MIR during loop finalisation too.

Another restriction, specific to MVE, is that all the vector instructions need operate on the same number of elements. This is because predication is performed at the byte level and this is set on entry to the loop, or by the VCTP instead.

Diff Detail

Event Timeline

samparker created this revision.Aug 7 2019, 8:20 AM

Hi Sam, nice one!

A first scan from my side, just some questions and nits.

Further work needs to be done to finally produce a true tail predicated loop.

Can you elaborate a little bit on this? I guess you mean instruction selection patterns to finally produce code? Can this work be committed without that being ready? I guess the question rephrased is: can you outline the plan?

You nicely state assumptions in the description, e.g.:

Because only the loads and stores are predicated, in both the LLVM IR and MIR level, we will restrict support to only lane-wise operations ..
..
Another restriction, specific to MVE, is that all the vector instructions need operate on the same number of elements.

I think it would be good to add that to the code too as comments.

lib/Target/ARM/Thumb2TailPredication.cpp
1

nit: MVETailPredication.cpp -> Thumb2TailPredication.cpp?

19

nit: typo "can generate as"?

292

Just wondering if this utility function is interesting and generic enough to be moved to e.g. LoopInfo.

303

nit: magic constant?

328

nit: newline

369

nit: this nested-if looks a bit funny, perhaps just getting rid of the brackets helps

First up, patch context! Everyone seems to be forgetting recently :)

Why does the llvm_arm_vctp32 not return a <4xi1> directly? I didn't understand the "from a software and hardware perspective, is a 16-bit scalar", or at least why that would matter in IR. Would it not just be represented as a vector of i1's? As it already is for cmps/selects for example. What does having to use the separate conversion help with?

include/llvm/IR/IntrinsicsARM.td
785

This, at least in isel, is called a PREDICATE_CAST. (It used to be called a REINTERPRET_CAST, but I thought that name was overly broad). I think from IR a better name might be int_arm_mve_vmsr / int_arm_mve_vmrs, to represent that it is converting from a predicate to the scalar, as the instruction does.

samparker edited the summary of this revision. (Show Details)Aug 8 2019, 12:16 AM
samparker marked an inline comment as done.Aug 8 2019, 12:42 AM

Why does the llvm_arm_vctp32 not return a <4xi1> directly?

The vctp family are defined like that because the ACLE specifies that they return a mve_pred16_t and I'm assuming this is a scalar - but I can't find a definition! I think that all the user facing predicate generators will produce a scalar and we will need to do the conversion to make it nice and LLVMy.

Can you elaborate a little bit on this? I guess you mean instruction selection patterns to finally produce code? Can this work be committed without that being ready? I guess the question rephrased is: can you outline the plan?

The overall plan is as follows:

  • Enable vectorizer.
  • Teach the vectorizer to produce a predicated loop, using a pragma and a target profitability hook. Think we also need to specify that we support masked load/store intrinsics too.
  • This pass converts some vector icmps to vctp intrinsics.
  • A bit of isel for vctp and final bits for other MVE instructions.
  • The loop is finalised in the LowOverheadLoop pass where we check for validity and remove the VCTP and VPST instructions, as well as their VPT blocks.
  • Beer.
include/llvm/IR/IntrinsicsARM.td
785

Sounds good.

samparker updated this revision to Diff 214078.Aug 8 2019, 12:43 AM

Updated with context.

Why does the llvm_arm_vctp32 not return a <4xi1> directly?

The vctp family are defined like that because the ACLE specifies that they return a mve_pred16_t and I'm assuming this is a scalar - but I can't find a definition! I think that all the user facing predicate generators will produce a scalar and we will need to do the conversion to make it nice and LLVMy.

Sure, the ACLE intrinsic needs to return an i16, but does that mean the IR intrinsic needs to? It could be expanded to two instructions, llvm_arm_vctp32 and llvm_arm_vmrs, with the i16 coming from the vmrs. This kind of thing sounds like it would be useful already for things like masked loads. i.e I'm saying can we invert where the conversion happens?

So if we started with acle:

mve_pred16_t pred = vctp8q(i)
l = vldrbq_z_s8(a, pred)

It would get expanded to become:

// vctp8q
<4 x i1> t1 = llvm.arm.vctp(i)
i16 pred = llvm.arm.vmrs(t1)
// vldrbq_z_s8
<4 x i1> t2 = llvm.arm.vmsr(pred)
l = llvm.masked.load(a, t2)

And you could use instcombine to fold out the converts (vmsr(vmrs(a)) == a), into

t1 = llvm.arm.vctp(i)
llvm.masked.load(a, t1)

It would work even better for compares that already have predicate that llvm knows about. They whole thing would just become llvm IR and we can let it optimise away. This is getting a bit much into intrinsic design, though, with isn't this patches problem!

Sam's suggestion to me for the ACLE intrinsics was that there should be an IR intrinsic that converts the i16 provided by the user into an <n x i1> for whatever n makes sense. In my unpushed (and unpolished) draft implementation there's also one that converts back again, which the ACLE intrinsics will need for the return value of vcmp. So it could be used here as well if that's useful.

The overall plan is as follows:

  • Enable vectorizer.
  • Teach the vectorizer to produce a predicated loop, using a pragma and a target profitability hook. Think we also need to specify that we support masked load/store intrinsics too.
  • This pass converts some vector icmps to vctp intrinsics.
  • A bit of isel for vctp and final bits for other MVE instructions.
  • The loop is finalised in the LowOverheadLoop pass where we check for validity and remove the VCTP and VPST instructions, as well as their VPT blocks.
  • Beer.

Thanks for this! That's quite a lot of moving parts!
Thinking out loud what the strategy could be so that we properly test the whole flow..... One would be to start with the ground work: enabling the vectorizer is flipping a switch, we can already produce a predicated loop with a pragma (target profitability hook is missing), and a bit of isel should also be easy. With the small(er) and easier groundwork in, we then have the 2 biggies remaining: this patch, and the LowOverheadLoop pass modifications. But when we have these 2 ready, we can test the flow and check if we haven't missed anything (that could change the design/flow).
Another option is to commit this once we're happy with it as it won't be enabled/triggering. Then we will have to see how it behaves when tail predication support is ready in LowOverheadLoop (and the other bits and pieces).

Sure, the ACLE intrinsic needs to return an i16, but does that mean the IR intrinsic needs to? It could be expanded to two instructions, llvm_arm_vctp32 and llvm_arm_vmrs, with the i16 coming from the vmrs. This kind of thing sounds like it would be useful already for things like masked loads. i.e I'm saying can we invert where the conversion happens?

It would work even better for compares that already have predicate that llvm knows about. They whole thing would just become llvm IR and we can let it optimise away.

I'm not following your suggestions here. I'm don't see how either one approach is better for the IR, we're just talking about two intrinsics that convert a scalar and vector, and I don't know why we'd need to get instcombine involved. I don't mind how these intrinsics end up getting implemented, as long as we have normal vectors in the end, and for this patch it makes sense for me to have vctp looking like the acle intrinsic.

Thinking out loud what the strategy could be so that we properly test the whole flow.....

I think it will be okay to land this when we can, it's disabled and the vectorizer won't be generating suitable input anyway. The finalisation can happen last and with no real rush - isel should still generate a valid loop, the vctp instructions will be more efficient that doing the arithmetic with vectors in the loop. So really, isel for masked load/stores and the vctp is the next most important bit because then we at least run llc. Then we can prod the vectorizer into generating masked loops and we can find all the bugs :)

You had me convinced until the last line. I think it's probably simpler for the vctp to produce a v4i1, as we don't need the convert at all in this patch (unless I'm missing something).

Essentially, the vctp can either look more like the acle intrinsic (produce a i16, makes the acle->IR simpler, but the IR->instruction needs to match on a convert(vctp)), or more like the VCTP instruction (produce a v4i1, makes the IR->Instruction simpler). I would go with the second option, but Simon is the expert on all things Intrinsics. Go with whatever he thinks is OK!

lib/Target/ARM/CMakeLists.txt
61

I would go with MVETailPredicationPass.cpp

SjoerdMeijer added a comment.EditedAug 12 2019, 11:37 PM
In D65884#1621209, @SjoerdMeijer wrote:

Thinking out loud what the strategy could be so that we properly test the whole flow.....

I think it will be okay to land this when we can, it's disabled and the vectorizer won't be generating suitable input anyway. The finalisation can happen last and with no real rush - isel should still generate a valid > loop, the vctp instructions will be more efficient that doing the arithmetic with vectors in the loop. So really, isel for masked load/stores and the vctp is the next most important bit because then we at least run llc. > Then we can prod the vectorizer into generating masked loops and we can find all the bugs :)

Ok, sounds like a good plan to me.

SjoerdMeijer added inline comments.Aug 28 2019, 5:39 AM
lib/Target/ARM/Thumb2TailPredication.cpp
96

nit: perhaps slightly more informative function name, something with "LoopIterations" in it?

135

"Wrong Subtarget" -> perhaps better/clearer to say that Subtarget does not support TP.

139

Perhaps better to move this message to the next block as it might not be really running at this point (if there's no preheader)?

146

Bit of a nit, but perhaps setting 'Setup' can be a function, so that we don't have the if (!Setup) check below twice, but simply a return when we found it.

153

nit: extra whitespace in "pre- preheader"

188

I am not sure, but I'm wondering if this function is misnamed? I think I would expect isTailPredicate to work on a Loop, not a Instruction/Value.

205

I need to read this function again, but it isn't clear to me yet why this is then a tail-predicated loop and not some other vectorized loop.

samparker marked 5 inline comments as done.Aug 28 2019, 7:54 AM
samparker added inline comments.
lib/Target/ARM/Thumb2TailPredication.cpp
188

TPCandidate has to be constructed with a loop and will only operate upon that loop. This method needs to answer whether V is the value that produces the tail predicate. The tail predicate being defined below in the comments. I will make it explicit that we're answering whether V == %pred.

205

It's only answering whether the given value, V, is equivalent to a vctp instruction. So we're not talking about MVE tail predication yet, we're just matching vctp. I expect that most of these loops will then be converted to a tail predicated form.

samparker marked 2 inline comments as done.Aug 29 2019, 2:06 AM
samparker updated this revision to Diff 217801.Aug 29 2019, 2:10 AM
samparker edited the summary of this revision. (Show Details)

Hopefully addressed all the comments. VCTP intrinsics now return a vector so I've removed the predication conversion intrinsics as I don't need them here.

SjoerdMeijer added inline comments.Aug 29 2019, 6:00 AM
lib/Target/ARM/MVETailPredication.cpp
364 ↗(On Diff #217801)

What we are doing here, is looking at generated code to get %Elems. But the vectorizer has generated this code, so somewhere this knowledge is present. The only question is, I guess, how easily accessible. Can we query SCEV or something else to avoid this pattern matching?

SjoerdMeijer added inline comments.Aug 29 2019, 7:57 AM
lib/Target/ARM/MVETailPredication.cpp
301 ↗(On Diff #217801)

A few nits I forgot to mention in my previous message:

Replace constant 128 with TTI.getRegisterBitWidth(true)?

And I don't think I understand the Lanes == 128 part.

samparker marked an inline comment as done.Aug 30 2019, 5:45 AM
samparker added inline comments.
lib/Target/ARM/MVETailPredication.cpp
364 ↗(On Diff #217801)

I've had a play with some other parts of SCEV, the helpers for delinearization, to see if it could answer the size of the array accessed by the loop. I can't seem to get this to work with the GEPs in the tests as they're not be represented by clean AddRecExprs. I presume it's not unreasonable for SCEV to not be that useful after vectorization..? It does looks like it could be used though, but I don't think I have the knowledge to do it safely! I'll add a TODO / FIXME.

samparker updated this revision to Diff 218073.Aug 30 2019, 5:53 AM

Added a couple of comments.

SjoerdMeijer added inline comments.Aug 30 2019, 5:56 AM
lib/Target/ARM/MVETailPredication.cpp
364 ↗(On Diff #217801)

Alright, fair enough.
Last question then, last crazy idea, could this information be attached to the loop as metadata?

samparker marked an inline comment as done.Sep 3 2019, 1:08 AM
samparker added inline comments.
lib/Target/ARM/MVETailPredication.cpp
364 ↗(On Diff #217801)

That doesn't seem to be the way metadata is used. It would require it to be kept up-to-date by any transform, which is unreasonable. Loop properties are provided by analysis passes, so this could live in one of them... but it seems a bit niche and should probably be done in a better way :)

SjoerdMeijer added inline comments.Sep 3 2019, 5:10 AM
lib/Target/ARM/MVETailPredication.cpp
364 ↗(On Diff #217801)

Okay, that's also fair enough.
I really cannot admit I am a big fan of this pattern matching here, which is a bit of understatement, but I guess it is what it is for now. I.e., your TODO captures it well for me. The pattern matching is extremely 'focused', and looking into SCEV and friends could be a project on itself...and so while that is considered, this is okay'ish.

I will now continue reviewing the remaining bits I haven't looked at.

If I am not mistaken, there is no test with a nested-loop. Would probably be good to test that too.

lib/Target/ARM/MVETailPredication.cpp
177 ↗(On Diff #218073)

For me, personally, creating a TPCandidate class doesn't add an abstraction that is really useful or clearer as literally the only thing it captures is 1 list, i.e. MaskedInsts, and we don't have multiple candidates and the lifetime of this object is extremely short. So, a simplification/clean-up in my opinion, is to get rid of the class and then we just have 3 local helper functions, and one variable that we pass thought. This is probably a nit, and/or a style preference.

301 ↗(On Diff #217801)

thanks for adding the comment, I still think you can use TTI.getRegisterBitWidth(true) here

samparker updated this revision to Diff 218654.Sep 4 2019, 5:18 AM

Cheers!

  • Now using TTI.
  • Removed TPCandidate.
  • Added some nested loop tests.
SjoerdMeijer accepted this revision.Sep 4 2019, 8:52 AM

Thanks, this now looks reasonable to me as an initial commit.
It is disabled by default, and we can now experiment with this and iterate on this if required.

This revision is now accepted and ready to land.Sep 4 2019, 8:52 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptSep 6 2019, 1:23 AM

I think this might have caused a regression (my git-bisect is nearly complete):

FAIL: LLVM :: CodeGen/ARM/O3-pipeline.ll (24373 of 51236)
******************** TEST 'LLVM :: CodeGen/ARM/O3-pipeline.ll' FAILED ********************
Script:
--
: 'RUN: at line 1';   /tmp/_update_lc/t/bin/llc -mtriple=arm -O3 -debug-pass=Structure < /home/dave/s/lp/llvm/test/CodeGen/ARM/O3-pipeline.ll -o /dev/null 2>&1 | grep -v "Verify generated machine code" | /tmp/_update_lc/t/bin/FileCheck /home/dave/s/lp/llvm/test/CodeGen/ARM/O3-pipeline.ll
--
Exit Code: 1

Command Output (stderr):
--
/home/dave/s/lp/llvm/test/CodeGen/ARM/O3-pipeline.ll:55:15: error: CHECK-NEXT: is not on the line after the previous match
; CHECK-NEXT: Safe Stack instrumentation pass
              ^
<stdin>:65:2: note: 'next' match was here
 Safe Stack instrumentation pass
 ^
<stdin>:61:25: note: previous match ended here
 Hardware Loop Insertion
                        ^
<stdin>:62:1: note: non-matching line after previous match is here
 Scalar Evolution Analysis
^

--

********************
Testing Time: 64.36s
********************
Failing Tests (1):
    LLVM :: CodeGen/ARM/O3-pipeline.ll

  Expected Passes    : 50558
  Expected Failures  : 168
  Unsupported Tests  : 509
  Unexpected Failures: 1