Page MenuHomePhabricator

[RFC] Intrinsics for Hardware Loops
Needs ReviewPublic

Authored by samparker on May 20 2019, 3:29 AM.

Details

Summary

Arm have recently announced the v8.1-M architecture specification for
our next generation microcontrollers. The architecture includes
vector extensions (MVE) and support for low-overhead branches (LoB),
which can be thought of a style of hardware loop. Hardware loops
aren't new to LLVM, other backends (at least Hexagon and PPC that I
know of) also include support. These implementations insert the loop
controlling instructions at the MachineInstr level and I'd like to
propose that we add intrinsics to support this notion at the IR
level; primarily to be able to use scalar evolution to understand the
loops instead of having to implement a machine-level analysis for
each target.

The attached prototype implementation contains intrinsics that are
currently Arm specific, but I hope they're general enough to be used
by all targets. The Arm v8.1-m architecture supports do-while and
while loops, but for conciseness, here, I'd like to just focus on
while loops. There's two parts to this RFC: (1) the intrinsics
and (2) a prototype implementation in the Arm backend to enable
tail-predicated machine loops.

  1. LLVM IR Intrinsics

In the following definitions, I use the term 'element' to describe
the work performed by an IR loop that has not been vectorized or
unrolled by the compiler. This should be equivalent to the loop at
the source level.

void @llvm.arm.set.loop.iterations(i32)

  • Takes as a single operand, the number of iterations to be executed.

i32 @llvm.arm.set.loop.elements(i32, i32)

  • Takes two operands:
    • The total number of elements to be processed by the loop.
    • The maximum number of elements processed in one iteration of the IR loop body.
  • Returns the number of iterations to be executed.

<X x i1> @llvm.arm.get.active.mask.X(i32)

  • Takes as an operand, the number of elements that still need processing.
  • Where 'X' denotes the vectorization factor, returns an array of i1 indicating which vector lanes are active for the current loop iteration.

i32 @llvm.arm.loop.end(i32, i32)

  • Takes two operands:
    • The number of elements that still need processing.
    • The maximum number of elements processed in one iteration of the IR loop body.

The following gives an illustration of their intended usage:

entry:

%0 = call i32 @llvm.arm.set.loop.elements(i32 %N, i32 4)
%1 = icmp ne i32 %0, 0
br i1 %1, label %vector.ph, label %for.loopexit

vector.ph:

br label %vector.body

vector.body:

%elts = phi i32 [ %N, %vector.ph ], [ %elts.rem, %vector.body ]
%active = call <4 x i1> @llvm.arm.get.active.mask(i32 %elts, i32 4)
%load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr, i32 4, <4 x i1> %active, <4 x i32> undef)
tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %load, <4 x i32>* %addr.1, i32 4, <4 x i1> %active)
%elts.rem = call i32 @llvm.arm.loop.end(i32 %elts, i32 4)
%cmp = icmp sgt i32 %elts.rem, 0
br i1 %cmp, label %vector.body, label %for.loopexit

for.loopexit:

ret void

As the example shows, control-flow is still ultimately performed
through the icmp and br pair. There's nothing connecting the
intrinsics to a given loop or any requirement that a set.loop.* call
needs to be paired with a loop.end call.

  1. Low-overhead loops in the Arm backend

Disclaimer: The prototype is barebones and reuses parts of NEON and
I'm currently targeting the Cortex-A72 which does not support this
feature! opt and llc build and the provided test case doesn't cause a
crash...

The low-overhead branch extension can be combined with MVE to
generate vectorized loops in which the epilogue is executed within
the predicated vector body. The proposal is for this to be supported
through a series of pass:

  1. IR LoopPass to identify suitable loops and insert the intrinsics proposed above.
  2. DAGToDAG ISel which makes the intrinsics, almost 1-1, to a pseduo instruction.
  3. A final MachineFunctionPass to expand the pseudo instructions.

To help / enable the lowering of of an i1 vector, the VPR register has
been added. This is a status register that contains the P0 predicate
and is also used to model the implicit predicates of tail-predicated
loops.

There are two main reasons why pseudo instructions are used instead
of generating MIs directly during ISel:

  1. They gives us a chance of later inspecting the whole loop and confirm that it's a good idea to generate such a loop. This is trivial for scalar loops, but not really applicable for tail-predicated loops.
  2. It allows us to separate the decrementing of the loop counter with the instruction that branches back, which should help us recover if LR gets spilt between these two pseudo ops.

For Armv8.1-M, the while.setup intrinsic is used to generate the wls
and wlstp instructions, while loop.end generates the le and letp
instructions. The active.mask can just be removed because the lane
predication is handled implicitly.

I'm not sure of the vectorizers limitations of generating vector
instructions that operate across lanes, such as reductions, when
generating a predicated loop but this needs to be considered.

Diff Detail

Event Timeline

samparker created this revision.May 20 2019, 3:29 AM

Hi Sam, many thanks for the detailed RFC and prototype!

Of course I need some more time to digest this, but just a first nitpick of something I noticed:

To help / enable the lowering of of an i1 vector, the VPR register has been added. This is a status register that contains the P0 predicate and is also used to model the implicit predicates of tail-predicated loops.

Loop tail predication and VPT block predication use different mechanism, architecturally. The former uses FPSCR.LTPSIZE, and the latter VPR, right? But I don't think it matters or changes anything for the rest of your story.

Loop tail predication and VPT block predication use different mechanism, architecturally. The former uses FPSCR.LTPSIZE, and the latter VPR, right? But I don't think it matters or changes anything for the rest of your story.

Indeed, the register is only used for code generation here. I expect that we will need separate registers in the compiler for each in the final implementation as both architecture registers are used at runtime if there's a VPT block within a tail-predicated loop.

samparker updated this revision to Diff 200268.May 20 2019, 6:33 AM

Changed the loop.end icmp to use ne instead of sgt.

markus added a subscriber: markus.May 21 2019, 12:25 AM

This is interesting. Our (Ericsson's) out-of-tree target has hardware loops and we currently do a similar thing i.e.

  1. Use SCEV on IR to determine trip count and insert a hint intrinsic.
  2. Iselect the hint 1:1.
  3. Custom machine function pass picks it up and inserts a hardware loop instruction if sufficient conditions are satisfied.

So that said we would be happy to see generic support upstream and would likely convert to that once available.

Great! I'll make start on a target-independent framework, combining this and the PPC implementation.

psnobl added a subscriber: psnobl.May 21 2019, 7:18 AM

I've implemented this out of tree (for Graphcore), based loosely on the PPC implementation. IR pass based on SCEV inserts intrinsics, SDag patches them up a little, MIR pass picks appropriate instructions or falls back to a decrement and branch loop.

As the example shows, control-flow is still ultimately performed through the icmp and br pair.

This is interesting. We've also gone with a pair of intrinsics in IR - one in the loop preheader that takes an integer for trip count (backedge + 1), one in the body which returns an i1 that goes to brcond. Ideally I'd have liked to use an intrinsic to represent the control flow. An opaque intrinsic returning a boolean for brcond is approximately the same. An integer is threaded between the two in order to keep a GP register live until the back end where the lowering to ISA may need it, but that's probably unique to our arch.

Folding the icmp behaviour into the intrinsic instead stands a reasonable chance of making the induction variable dead (well, effectively hidden in the intrinsics) with subsequent IR simplification. Is the advantage to keeping it explicit in the interaction with other loop passes?

In SDag for our target, each IR intrinsic turns into two pseudo instructions. One representing the branch, one representing the arithmetic. It seems you've continued with two intrinsics, using a specific register and side effects to represent the combination. For example, the 'decrement arithmetic' pseudo turns into either a subtract one or into no code, depending on the instructions chosen. Keeping it separate from the terminator seems to help position it in a reasonable place in the BB for the decrement. I'm not sure which approach is better.

Would your implementation make sense without the masked load/store support? I'm wondering if that's a reasonable place to split the patch.

Thanks for the diff!

Hi Jon,

I used a (br (icmp (intrinsic))) combination just because we don't have native i1 support in the Arm backend, I've been lazy to get a quick prototype together. I want these intrinsics to return an i1 to be used directly the brcond. We also want to keep a GP live too (our loop counter lives in the link register) so it sounds very similar to you.

I've focused on supporting masked load/stores because it is the more involved side of this transform so I wanted a proof-of-concept. I'm currently converting the PPC pass to a target independent pass and that appears to handle the 'normal' loops fine. I'll try to post that up tomorrow and if people are reasonably happy with the approach, I'll break it up again and leave the predicate support until last.

Thanks for taking a look.

samparker updated this revision to Diff 200890.EditedMay 23 2019, 2:15 AM
samparker added a reviewer: hfinkel.

Introduced a target independent pass to insert intrinsics to generate hardware loops, which is based upon PPCCTRLoops. A hook has been added to TTI to decide what loops should be converted:

bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
                                                      AssumptionCache &AC,
                                                      TargetLibraryInfo *LibInfo,
                                                      HardwareLoopInfo &HWLoopInfo);

HardwareLoopInfo is introduced to allow the backend to describe the properties of the loop:

struct HardwareLoopInfo {
      HardwareLoopInfo(Loop *L) : L(L) { }
      Loop *L                 = nullptr;
      BasicBlock *ExitBlock   = nullptr;
      BranchInst *ExitBranch  = nullptr;
      const SCEV *ExitCount   = nullptr;
      Instruction *Predicate  = nullptr; // Value controlling masked ops.
      IntegerType *CountType  = nullptr;
      bool PerformTest        = false;   // Can guard loop entry.
      bool IsNestingLegal     = false;   // Can HW loops be nested.
      bool InsertPHICounter   = false;   // Keep the loop counter in reg?
      unsigned NumElements    = 1;       // Max number of elements
                                         // processed in an iteration.
};

The pass can insert four different intrinsics for setting up a loop:

  • int_set_loop_iterations: Takes an integer trip count.
  • int_test_set_loop_iterations : Takes an integer trip count and tests whether the loop should be entered.
  • int_set_loop_elements : Takes two integers, (1) the total elements to be processed by the loop and (2) the maximum number of elements processed in each iteration.
  • int_test_set_loop_elements : Same as above but also tests whether the loop should be entered.

PowerPC codegen tests are still passing and the hacks in the arm backend allow by previous example to still work.

My plan now is:

  • Introduce the target independent with just the features (set_loop_iterations and loop_dec) to enable PowerPC and remove PPCTRLoops.
  • Add Arm support for the previous two intrinsics.
  • Introduce the testing form of set_loop_iterations with Arm support.
  • Introduce the 'element' versions of the intrinsics with Arm support.

The Arm support will be dependent on getting the initial architecture support upstream.

jsji added a subscriber: shchenz.May 23 2019, 7:14 AM

Would it be an idea if we start splitting this up now? Because it looks like there is a lot of support for this, there's consensus on the direction (there are only minor implementation differences). We've got at least 3 patches here: the target independent hwloop pass, the changes to the PPC backend, the changes to the ARM backend, and perhaps a 4th patch is the intrinsics. That's a lot of code, and then we can start reviewing things separately.

dmitry added a subscriber: dmitry.May 27 2019, 11:09 AM

Created the initial patch to convert the PowerPC pass into something generic: https://reviews.llvm.org/D62604

Patch to enable Arm code generation for do-while loops: https://reviews.llvm.org/D63476

Introduce an intrinsic, and generic support, so that we can set the loop counter and also test that it is not zero: https://reviews.llvm.org/D63809