Page MenuHomePhabricator

[LoopVectorizer] Inloop vector reductions

Authored by dmgreen on Feb 24 2020, 11:17 AM.



Arm MVE has multiple instructions such as VMLAVA.s8, which (in this case) can take two 128bit vectors, sign extend the inputs to i32, multiplying them together and sum the result into a 32bit general purpose register. So taking 16 i8's as inputs, they can multiply and accumulate the result into a single i32 without any rounding/truncating along the way. There are also reduction instructions for plain integer add and min/max, and operations that sum into a pair of 32bit registers together treated as a 64bit integer (even though MVE does not have a plain 64bit addition instruction). So giving the vectorizer the ability to use these instructions both enables us to vectorize at higher bitwidths, and to vectorize things we previously could not.

In order to do that we need a way to represent that the "reduction" operation, specified with a llvm.experimental.vector.reduce when vectorizing for Arm, occurs inside the loop not after it like most reductions. This patch attempts to do that, teaching the vectorizer about in-loop reductions. It does this through a vplan recipe representing the reductions that the original chain of reduction operations is replaced by. Cost modelling is currently just done through a "prefersInloopReduction" TTI hook (which follows in a later patch).

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dmgreen marked 36 inline comments as done.May 20 2020, 4:09 PM
dmgreen edited the summary of this revision. (Show Details)

Round one. Lots of renaming, plenty of cleanup. I've tried to add Min/Max handling too, and added/cleaned up some more test. Hopefully addressed most of the comments.

dmgreen added inline comments.May 20 2020, 4:45 PM

Yeah, that direction makes it a lot simpler. Thanks.


I've added the code to handle minmax too (but not tested it a lot yet. I will try that now).

MVE has instructions for integer min/max reductions, but they can be slow enough to make them not worth using over a normal vmin/vmax. Adds are always not-slower-enough to warrant the inloop reduction (and have other advantages like handling higher type sizes and folding in more instructions.)

My point is that min/max, like some of the other fadd/mul/and/etc might not be used by MVE yet. If you think the code is more hassle than it deserves, then we could take them out for the time being. I'd like to leave them in for consistency though, even if it's not used straight away.


Right. We look from the phi down (now), so can't go through the 'and' we would need to detect the different bitwidth. That would leave 'and' reduction, which I think would not got through that codepath to create type promoted phis?

I've put an explicit check in for it to be sure, and added some testing for those cases.


This does look a little strange here on it's own. The followup patch to add the TTI hook makes it look like:

if (!PreferInloopReductions &&
      !TTI.preferInloopReduction(Opcode, Phi->getType(),

Do you mean adding an iterator for iterating over reductions, stepping over the ones not inloop?

It would seem like it's similar to the existing code, but as a new iterator class. My gut says the current code is simpler and clearer what is going on?


I've changed it to be stored there. It does mean multiple things are holding TTI. Let me know what you think.

47 ↗(On Diff #263027)

Ah. Old code no longer needed.


This testcase is new for this patch (although that wasn't obvious from how I presented it). I was just trying to show the diffs here and the option didn't exist before this patch, hence it wasn't already there.

Are you suggesting we should still have these same test for without -force-inloop-reductions? I think those should already be tested elsewhere, but I can add it if you think it's useful.


(unnecessary trounces stripped)?

The const case is going to look a little odd until there is a instcombine type fold for constant reductions. I'll see about adding that.

Otherwise it is now hopefully a bit leaner. Even if I have added a few extra tests.

Ayal added a comment.Jun 5 2020, 3:38 PM

The proposed RecurrenceDescriptor::getReductionOpChain() method identifies horizontal reductions, a subset of the vertical reductions identified by RecurrenceDescriptor::isReductionPHI(). Being two totally independent methods, it's unclear what the latter supports that the former does not. Would it be better to have isReductionPHI() also take care of recognizing horizontal reductions, and record them as in CastsToIgnore? See also TODO commented inline.


Perhaps the above "better way" would also help recognize and record horizontal reductions?


|| !Cur->hasNUses(ExpectedUses) ?

nit: can alternatively let getNextInstruction check its result and return only valid ones, e.g.:

bool RedOpIsCmp = (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp);
unsigned ExpectedUses = RedOpIsCmp ? 2 : 1;

auto getNextInstruction = [&](Instruction *Cur) {
  if (!Cur->hasNUses(ExpectedUses))
    return nullptr;
  auto *FirstUser = Cur->user_begin();
  if (!RedOpIsCmp)
    return FirstUser->getOpcode() == RedOp ? FirstUser : nullptr;
  // Handle cmp/select pair:
  auto *Sel = dyn_cast<SelectInst>(*FirstUser) ||
  if (SelectPatternResult::isMinOrMax(matchSelectPattern(Sel, LHS, RHS).Flavor))
    return Sel;
  return nullptr;

for (auto *Cur = getNextInstruction(Phi); Cur && Cur != LoopExitInstr; Cur = getNextInstruction(Cur))

... along with their associated chains of reduction operations,
in program order from top (PHI) to bottom?


Would be good to make sure code is being exercised and tested. Could inloop min/max (and/or other reductions) help reduce code size, and be applied when vectorizing under optsize?


'and' reduction may or may not be performed in a smaller type, iinm.


Check in-loop reduction along with VF == 1, instead of modifying VF, as in other places? E.g.,

bool ScalarPHI = (VF == 1) || Cost->isInLoopReduction(cast<PHINode>(PN));

Type *VecTy =
    ScalarPHI ? PN->getType() : VectorType::get(PN->getType(), VF);



The "else" clause will be empty in non-debug mode.
Can hoist the LLVM_DEBUG's, e.g.,

bool InLoop = !ReductionOperations.empty();
LLVM_DEBUG(dbgs() << "LV: Using " << (InLoop ? "inloop" : "out of loop") << " reduction for phi: " << *Phi << "\n");

Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all reduction PHIs of Legal->getReductionVars().

(Better early-exit via "if (!CM.isInLoopReduction(Reduction.first)) continue;")


Add a comment that this is the compare operand of the select, whose recipe will also be discarded.


Consider outlining this part, unless it can be shortened.


Checking also if hasOuterloopReductions() is an independent optimization that should be done separately of this patch.


Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all reduction PHIs of Legal->getReductionVars().


ChainOp can be simply set:

VPValue *ChainOp = Plan->getVPValue(Chain);

leaving only VecOp to figure out which operand index to use. Can preset the options before the loop:

RecurrenceDescriptor::RecurrenceKind Kind = RdxDesc.getRecurrenceKind();
bool InMinMax = (Kind == RecurrenceDescriptor::RK_IntegerMinMax || Kind == RecurrenceDescriptor::RK_FloatMinMax);
FirstOpId = IsMinMax ? 1 : 0;
SecondOpId = FirstOpId + 1;

and then do

auto *FirstOp = R->getOperand(FirstOpId);
auto *SecondOp =  R->getOperand(SecondOpId);
VPValue *VecOp = Plan->getVPValue(FirstOp == Chain ? SecondOp : FirstOp);

ok, understood; no need to retain non-existing tests.

dmgreen marked 15 inline comments as done.
dmgreen added inline comments.

Hmm. The reason I kept them separate was that this method is already pretty complex. I was trying to keep thing simpler. Adding the ability to detect a single chain of operations from Phi to LoopExitValue that can be used for horizontal reductions looks.. difficult. And error prone. :) If you think it's worth it then I can certainly give it a go! I like the separation of concerns in keeping them separate though.

The extra things that AddReductionVar will detect that getReductionOpChain will not are:

  • Phi/select predicated reductions like in if-conversion-reductions.ll and if-reduction.ll. These would need some form of predicated reduction intrinsic.
  • Narrow bitwidths. This one I could add.
  • Subs/FSubs are treated like Adds/FAdds for vertical reductions.

This is the loop exit instr, so can have as many uses as it likes I believe.


-Os sounds like a good plan. It will take some backend work to make it efficient enough first though. And predicated reductions?


The check for lookThroughAnd in AddReductionVar is guarded by isArithmeticRecurrenceKind, so at least currently it cannot be an And reduction.


I believe that InloopReductionChains would not iterate in a deterministic order, which is why I avoided it.

Perhaps that would not matter here? The reductions should be independent anyway. Seems safer to try and use deterministic ordering anyway if we can.


Ah sorry. I split it off but apparently didn't upstream the other part yet. Now in D81415.

dmgreen updated this revision to Diff 269300.Jun 8 2020, 11:38 AM
dmgreen marked 2 inline comments as done.


bmahjour removed a subscriber: bmahjour.Jun 22 2020, 6:47 AM
Ayal added inline comments.Jul 6 2020, 1:15 PM

Would be good to see if above TODO can be addressed - providing the set of all instructions that take part in the reduction. This set could then be used for checking in-loop reductions. Hopefully this could help simplify both, and keep them in some sync. But could be done later, possibly with another TODO..


Is treating sub as an add reduction something in-loop reduction could support as a future extension?


Ahh, ok. (It should have ExpectedUses+1 users being in lcssa.)



Is Subs the only issue?.
Can check this earlier, before traversing the chain, although it is pushed back last, here.


Thanks. Worth updating the comment.


Hoisting the horizontal reduction from the middle block into the loop could potentially eliminate the middle block (as in tests below), so could presumably lead to code of smaller size? At-least for in-loop chains of a single link.

And predicated reductions?

These are yet to be handled in-loop, right?


Then better placed above right after defining Phi?


Agreed it would be better to use deterministic ordering. How about letting InloopReductionChains be a MapVector and iterate over
for (auto &Reduction : CM.getInloopReductions())?
The number of reductions is expected to be small, w/o removals.


This is the other potential use of for (auto &Reduction : CM.getInloopReductions()).

dmgreen updated this revision to Diff 275979.Jul 7 2020, 3:41 AM
dmgreen marked 12 inline comments as done.

Thanks for taking another look.


Hmm. I don't want to say never. A normal inloop reduction looks like:

p = PHI(0, a)
l = VLDR (..)
a = VADDVA(p, l)

Where the VADDV is an across-vector reductions, and the extra A means also add p. Reducing a sub would need to become:

p = PHI(0, a)
l = VLDR (..)
a = VADDV(l)
p = SUB(p, a)

With the SUB as a separate scalar instruction, which would be quite slow on some hardware (getting a value over from the VADDV to the SUB). So this would almost certainly be slower than a out-of-loop reduction.

But if we could end up using a higher vector factor for the reduction, or end up vectorizing loops that would previously not be vectorized.. that may lead to a gain overall to overcome the extra cost of adding the sub to the loop. It will require some very careful costing I think. And maybe the ability to create multiple vplans and cost them against one another :)


I believe sub is the only issue, unless I am forgetting something.


And predicated reductions?

These are yet to be handled in-loop, right?

Yep. It will need a predicated reduction intrinsic. A vecreduce that takes a mask. That will allow us to tail-fold the reductions with trip counts that do not divide the vector factor, which will make them look a lot better under -Os. And nice in general I think once it all starts being tail predicated.

The backend work I was mentioning was that we need to more efficiently transform

x = min(vecreduce.min(z), y)


x = VMINV(y, z)

Where y is (confusingly) accumulated in the case (even though the instruction doesn't have an A suffix). We currently generate

x = min(VMINV(UINT_MAX, z), y)

Once that is sorted out then, yep, using these for Os sounds like a good plan.


I can put it somewhere sensible in this patch and move it in the next :)


MapVector sounds good. I've changed it to use that and tried to use that in a few more places. Let me know what you think.

Ayal added inline comments.Jul 7 2020, 12:33 PM

An original sub code, say, acc -= a[i], can be treated as acc += (-a[i]). This could be in-loop reduced by first negating a[i]'s, at LV's LLVM-IR level, presumably lowered later to something like

p = PHI(0, a)
l = VLDR (..)
s = VSUBV (zero, l)
a = VADDVA(p, s)

, right?


rephrase sentence so it parses


Ahh, this should actually be capitalized IsInLoopReductionPhi


Re: predicated reductions - could they be handled by replacing masked-off elements with Identity using a select prior to reduction? To be potentially folded later by suitable targets into a predicated reduction operation which they may support.
Somewhat akin to "passthru" values of masked loads.


Uses as MapVector look good to me, thanks. Can also retain isInLoopReduction(PHI).


recode >> record

dmgreen updated this revision to Diff 276198.Jul 7 2020, 2:07 PM
dmgreen marked 4 inline comments as done.

Reinstate isInLoopReduction and fixup some wording/typos/capitalisation.


Yep. We would have the option to trading a scalar instruction for a vector instruction + an extra register (to hold the 0, we only have 8 registers!)

Unfortunately both would be slower than in out-of-loop reduction unless we were vectorizing at a higher factor, though.


Oh. So select

s = select m, a, 0
v = vecreduce.add s

to a predicated vaddv?

Yeah, sounds interesting. I'll look into that as an alternative to predicated intrinsics. Nice suggestion.

Ayal added inline comments.Jul 7 2020, 3:05 PM

ok, so sub's can be handled in-loop, but doing so is expected to be more costly than out-of-loop, at-least if a horizontal add operation is to be used rather than a horizontal subtract; probably worth a comment.
If a reduction chain has only sub's, they could all sink - negating the sum once after the loop, using VADDVA inside. Doing so however will retain the middle block, i.e., w/o decreasing code size.

dmgreen updated this revision to Diff 276364.Jul 8 2020, 3:52 AM

Update a FIXME with sub info.

gilr added inline comments.Jul 8 2020, 6:42 AM

It seems that TTI is only used later for deciding whether to use a shuffle sequence or an intrinsic based on data available during planning. If so, then it would be best if the Planner calls TTI->useReductionIntrinsic() and records that boolean decision in the Recipe. This is also required in order to estimate in-loop reduction cost. This could be done separately.

dmgreen marked an inline comment as done.Jul 9 2020, 10:15 AM
dmgreen added inline comments.

Do you mean to change the interface to createTargetReduction, to take a bool instead? Yeah I think that sounds good. I'd prefer to do it as a separate review as it does involve changing the interface. I will put a patch together.

I was imagining that we would change the cost to use getArithmeticReductionCost, which hopefully handles the details of how the target lowers reductions. I haven't looked deeply into the details yet though. That is on the list of things to do.

Ayal accepted this revision.Sun, Jul 12, 7:35 AM

This looks good to me, thanks! with last couple of nits.


nit: "... that leads from the loop exit value back.." - chain is now found top-down.


nit: can record the recipe of Phi first, just to follow chain order.


nit: can assert CompareRecipe->getVPRecipeID()

This revision is now accepted and ready to land.Sun, Jul 12, 7:35 AM

Thanks. I will update the patch, but I will wait until at least after the branch before I commit it.

D83646 is an attempt at changing the interface to createTargetReduction, so that we don't need to store TTI in the reduction recipe.

dmgreen updated this revision to Diff 278689.Fri, Jul 17, 3:00 AM
dmgreen marked 2 inline comments as done.

Updates for those last nitpicks.

This revision was automatically updated to reflect the committed changes.

Thanks. Sorry about that, it appears clang doesn't like this code as much as gcc did and then my internet went out whilst I was trying to figure out what was wrong.

I will try to run some extra testing using different compilers.