This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Inloop vector reductions
ClosedPublic

Authored by dmgreen on Feb 24 2020, 11:17 AM.

Details

Summary

Arm MVE has multiple instructions such as VMLAVA.s8, which (in this case) can take two 128bit vectors, sign extend the inputs to i32, multiplying them together and sum the result into a 32bit general purpose register. So taking 16 i8's as inputs, they can multiply and accumulate the result into a single i32 without any rounding/truncating along the way. There are also reduction instructions for plain integer add and min/max, and operations that sum into a pair of 32bit registers together treated as a 64bit integer (even though MVE does not have a plain 64bit addition instruction). So giving the vectorizer the ability to use these instructions both enables us to vectorize at higher bitwidths, and to vectorize things we previously could not.

In order to do that we need a way to represent that the "reduction" operation, specified with a llvm.experimental.vector.reduce when vectorizing for Arm, occurs inside the loop not after it like most reductions. This patch attempts to do that, teaching the vectorizer about in-loop reductions. It does this through a vplan recipe representing the reductions that the original chain of reduction operations is replaced by. Cost modelling is currently just done through a "prefersInloopReduction" TTI hook (which follows in a later patch).

Diff Detail

Event Timeline

dmgreen created this revision.Feb 24 2020, 11:17 AM
fhahn added a comment.Feb 28 2020, 1:42 PM

This patch does not work like that yet though, as I am unsure how that should really look, and it did not seem simple to create a vplan from another yet. The def-use relations are usually not in place for example. Neither is the costing of vplans against one another, which makes sense.

That's correct. Work on modelling the def-use relationships exclusively in VPlan is currently underway, but it probably does not make sense to wait with this patch until the transition is complete.

Some small high-level comments:

  • I think it's best to move the TTI hook into a separate change
  • Do you know how 'experimental' the reduction intrinsics still are? It would be good to move them out of experimental at some point, especially when making use of them in LV.

I'm currently traveling, but I try to take a closer look next week or the week after (please ping in case I forget ;))

Thanks for taking a look!

I noticed a comment in VPValue that says:

DESIGN PRINCIPLE: Access to the underlying IR must be strictly limited to
the front-end and back-end of VPlan so that the middle-end is as
independent as possible of the underlying IR. We grant access to the
underlying IR using friendship. In that way, we should be able to use VPlan
for multiple underlying IRs (Polly?) by providing a new VPlan front-end,
back-end and analysis information for the new IR.

That's the part that confused me about optimising vplans in-place. Does that mean that we have to replicate all the opcode/flags data into vplan too? Not just the use-defs. So it knows what is an Add and what is a Mul (and if they are nsw etc). What would you say counts as "middle-end" for vplan? I would presume the costmodelling and creating vplans from one another, but I didn't follow vplans early days very closely.

I think it's best to move the TTI hook into a separate change

Can do. I have some other changes that I was trying out too about adding a VPReductionPhi. Not sure it makes sense yet though.

Do you know how 'experimental' the reduction intrinsics still are? It would be good to move them out of experimental at some point, especially when making use of them in LV.

As far as I understand, on architectures that request it (AArch64 and Arm MVE now), we have been using the "experimental" reduction intrinsics since they were added, in the code generated after the vector body. There is a useReductionIntrinsic which the targets can override to prefer them over the IR equivalents.

I would like to add a mask to the reduction intrinsics. That will be important for getting tail-predicated inloop reductions to work. Whether this is done by adjusting the existing ones or adding new ones, I'm not sure. Obviously some architectures do not handle the mask, only handling a all-one mask. After that is sorted out then I agree it would be nice to drop the "experimental".

dmgreen updated this revision to Diff 247848.Mar 3 2020, 4:06 AM

Taken out the TTI change and simplified the tests, just shoing the diffs here now.

dmgreen updated this revision to Diff 263027.May 9 2020, 10:55 AM

Rebase and a bit of a cleanup. Now uses VPValues for the VPRecipe operands.

Ping. Anyone have any thought/comments about this one? Does it at least look like its going in the right direction, or does anyone have some better suggestions?

Ayal added a comment.May 19 2020, 4:12 PM

Ping. Anyone have any thought/comments about this one? Does it at least look like its going in the right direction, or does anyone have some better suggestions?

Yes, this direction does look right!
Would be good to clarify that the unrolled Parts are still reduced out of loop, and document the current limitations / future TODO extensions, including min/max, conditional bumps, truncate/extend.
Added various minor comments.

llvm/lib/Analysis/IVDescriptors.cpp
814

Looking for a chain of hasOneUse op's would be easier starting from the Phi and going downwards, until reaching LoopExitInstr?

Note that when extended to handle reductions with conditional bumps, some ops will have more than one use.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
269

"Force" >> "Prefer"? If a reduction cannot be handled in loop, it is vectorized out of loop, rather than bailing out.

1082

collectInLoopReductions()? Perhaps worth holding a map of in loop reduction phi's with their chains.

1352

isInLoopReduction(Phi)?

1357

Simply call InLoopReductions.empty() directly?

1360

hasOutOfLoopReductions()? Outerloop stands for a non innermost loop in a nest.

3829

isInLoopReductionPhi

3847

This is dead code if cmp/select chains are not recognized yet, as noted above.

3925

Checking if !UseInLoopReductions is redundant, as current in loop reductions have phi's of the same type as the recurrence; right?
May leave it as an assert, until in loop reductions handle truncation/extension.

3976–3978

Worth adding a comment that in loop reductions create their target reductions inside the loop using a recipe, rather that here in the middle block.

6696

Move this invariant check out as another early-exit?

7451

Iterate over in loop reductions?

7584

In loop reductions are currently disabled in fold tail. Checking also if hasOuterloopReductions() is an independent optimization that should be done separately of this patch.

7586

Iterate over in loop reductions?
Comment that they are ordered top-down, starting from the phi.

7589

This is currently dead code, no in loop reductions under fold tail.

7803

NewChain, NewAcc >> PrevInChain, NextInChain?

llvm/lib/Transforms/Vectorize/VPlan.h
240

Too bad this requires passing TTI through the State everywhere.
Perhaps storing TTI in the recipe would be somewhat better.

1037

Comment is redundant.

1039

private is redundant.

llvm/lib/Transforms/Vectorize/VPlanValue.h
47 ↗(On Diff #263027)

Why is this friendship needed?

llvm/test/Transforms/LoopVectorize/icmp-uniforms.ll
43 ↗(On Diff #263027)

Optimizing this comparison away is independent of this patch.

llvm/test/Transforms/LoopVectorize/reduction-inloop.ll
2

It is indeed good to examine the diff in this review, but existing non -force-inloop-reductions RUN should be retained; e.g., by cloning the file for a separate -force-inloop-reductions RUN.

Would be good to also test with -force-vector-interleave greater than 1.

Thanks for the review! You rock. I'll update this as per the comments.

fhahn added inline comments.May 20 2020, 3:55 AM
llvm/lib/Analysis/IVDescriptors.cpp
801

nit: would be good to document what this function does

804

nit: would be good to add a message

llvm/lib/Transforms/Vectorize/VPlan.cpp
805

The ::print methods do not need to emit the " +\n" at the start and the "\\l\" parts after 4c8285c750bc

llvm/lib/Transforms/Vectorize/VPlan.h
43

Would a forward decl of ReductionDesciptor be sufficient or is there another need to include LoopUtils.h?

1041

would be good to document the members.

llvm/test/Transforms/LoopVectorize/reduction-inloop.ll
6–52

I think it would be also good to add a few additional simpler test cases, e.g. reductions with only 1 or 2 reduction steps in the loop, some with constant scalar operands. And with some of the (unnecessary trounces stripped). Also the test diff can be slightly reduced by having constant trip counts.

It's good to have the more complex ones, but it would also be good to start with simpler cases. That way it is probably easier to follow what's going on. And it would reduce the diff quite a bit if the tests for the various different opcodes would be a bit more compact.

fhahn added inline comments.May 20 2020, 3:58 AM
llvm/lib/Analysis/IVDescriptors.cpp
814

Instead of doing a recursive traversal, would it be simpler to just do the traversal iteratively, at least as long as we are only using at a single use chain?

simoll added a subscriber: simoll.May 20 2020, 5:18 AM
dmgreen updated this revision to Diff 265371.May 20 2020, 4:09 PM
dmgreen marked 36 inline comments as done.
dmgreen edited the summary of this revision. (Show Details)

Round one. Lots of renaming, plenty of cleanup. I've tried to add Min/Max handling too, and added/cleaned up some more test. Hopefully addressed most of the comments.

dmgreen added inline comments.May 20 2020, 4:45 PM
llvm/lib/Analysis/IVDescriptors.cpp
814

Yeah, that direction makes it a lot simpler. Thanks.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3847

I've added the code to handle minmax too (but not tested it a lot yet. I will try that now).

MVE has instructions for integer min/max reductions, but they can be slow enough to make them not worth using over a normal vmin/vmax. Adds are always not-slower-enough to warrant the inloop reduction (and have other advantages like handling higher type sizes and folding in more instructions.)

My point is that min/max, like some of the other fadd/mul/and/etc might not be used by MVE yet. If you think the code is more hassle than it deserves, then we could take them out for the time being. I'd like to leave them in for consistency though, even if it's not used straight away.

3925

Right. We look from the phi down (now), so can't go through the 'and' we would need to detect the different bitwidth. That would leave 'and' reduction, which I think would not got through that codepath to create type promoted phis?

I've put an explicit check in for it to be sure, and added some testing for those cases.

6696

This does look a little strange here on it's own. The followup patch to add the TTI hook makes it look like:

if (!PreferInloopReductions &&
      !TTI.preferInloopReduction(Opcode, Phi->getType(),
                                 TargetTransformInfo::ReductionFlags()))
    continue;
7451

Do you mean adding an iterator for iterating over reductions, stepping over the ones not inloop?

It would seem like it's similar to the existing code, but as a new iterator class. My gut says the current code is simpler and clearer what is going on?

llvm/lib/Transforms/Vectorize/VPlan.h
240

I've changed it to be stored there. It does mean multiple things are holding TTI. Let me know what you think.

llvm/lib/Transforms/Vectorize/VPlanValue.h
47 ↗(On Diff #263027)

Ah. Old code no longer needed.

llvm/test/Transforms/LoopVectorize/reduction-inloop.ll
2

This testcase is new for this patch (although that wasn't obvious from how I presented it). I was just trying to show the diffs here and the option didn't exist before this patch, hence it wasn't already there.

Are you suggesting we should still have these same test for without -force-inloop-reductions? I think those should already be tested elsewhere, but I can add it if you think it's useful.

6–52

(unnecessary trounces stripped)?

The const case is going to look a little odd until there is a instcombine type fold for constant reductions. I'll see about adding that.

Otherwise it is now hopefully a bit leaner. Even if I have added a few extra tests.

Ayal added a comment.Jun 5 2020, 3:38 PM

The proposed RecurrenceDescriptor::getReductionOpChain() method identifies horizontal reductions, a subset of the vertical reductions identified by RecurrenceDescriptor::isReductionPHI(). Being two totally independent methods, it's unclear what the latter supports that the former does not. Would it be better to have isReductionPHI() also take care of recognizing horizontal reductions, and record them as in CastsToIgnore? See also TODO commented inline.

llvm/lib/Analysis/IVDescriptors.cpp
439–441

Perhaps the above "better way" would also help recognize and record horizontal reductions?

850

|| !Cur->hasNUses(ExpectedUses) ?

nit: can alternatively let getNextInstruction check its result and return only valid ones, e.g.:

bool RedOpIsCmp = (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp);
unsigned ExpectedUses = RedOpIsCmp ? 2 : 1;

auto getNextInstruction = [&](Instruction *Cur) {
  if (!Cur->hasNUses(ExpectedUses))
    return nullptr;
  auto *FirstUser = Cur->user_begin();
  if (!RedOpIsCmp)
    return FirstUser->getOpcode() == RedOp ? FirstUser : nullptr;
  // Handle cmp/select pair:
  auto *Sel = dyn_cast<SelectInst>(*FirstUser) ||
    dyn_cast<SelectInst>(*std::next(FirstUser));
  if (SelectPatternResult::isMinOrMax(matchSelectPattern(Sel, LHS, RHS).Flavor))
    return Sel;
  return nullptr;
}

for (auto *Cur = getNextInstruction(Phi); Cur && Cur != LoopExitInstr; Cur = getNextInstruction(Cur))
  ReductionOperations.push_back(Cur);
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1495

... along with their associated chains of reduction operations,
in program order from top (PHI) to bottom?

3847

Would be good to make sure code is being exercised and tested. Could inloop min/max (and/or other reductions) help reduce code size, and be applied when vectorizing under optsize?

3925

'and' reduction may or may not be performed in a smaller type, iinm.

4276

Check in-loop reduction along with VF == 1, instead of modifying VF, as in other places? E.g.,

bool ScalarPHI = (VF == 1) || Cost->isInLoopReduction(cast<PHINode>(PN));

Type *VecTy =
    ScalarPHI ? PN->getType() : VectorType::get(PN->getType(), VF);

?

6714

The "else" clause will be empty in non-debug mode.
Can hoist the LLVM_DEBUG's, e.g.,

bool InLoop = !ReductionOperations.empty();
LLVM_DEBUG(dbgs() << "LV: Using " << (InLoop ? "inloop" : "out of loop") << " reduction for phi: " << *Phi << "\n");
7451

Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all reduction PHIs of Legal->getReductionVars().

(Better early-exit via "if (!CM.isInLoopReduction(Reduction.first)) continue;")

7460

Add a comment that this is the compare operand of the select, whose recipe will also be discarded.

7582

Consider outlining this part, unless it can be shortened.

7584

Checking also if hasOuterloopReductions() is an independent optimization that should be done separately of this patch.

7586

Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all reduction PHIs of Legal->getReductionVars().

7613

ChainOp can be simply set:

VPValue *ChainOp = Plan->getVPValue(Chain);

leaving only VecOp to figure out which operand index to use. Can preset the options before the loop:

RecurrenceDescriptor::RecurrenceKind Kind = RdxDesc.getRecurrenceKind();
bool InMinMax = (Kind == RecurrenceDescriptor::RK_IntegerMinMax || Kind == RecurrenceDescriptor::RK_FloatMinMax);
FirstOpId = IsMinMax ? 1 : 0;
SecondOpId = FirstOpId + 1;

and then do

auto *FirstOp = R->getOperand(FirstOpId);
auto *SecondOp =  R->getOperand(SecondOpId);
VPValue *VecOp = Plan->getVPValue(FirstOp == Chain ? SecondOp : FirstOp);
llvm/test/Transforms/LoopVectorize/reduction-inloop.ll
2

ok, understood; no need to retain non-existing tests.

dmgreen marked 15 inline comments as done.
dmgreen added inline comments.
llvm/lib/Analysis/IVDescriptors.cpp
439–441

Hmm. The reason I kept them separate was that this method is already pretty complex. I was trying to keep thing simpler. Adding the ability to detect a single chain of operations from Phi to LoopExitValue that can be used for horizontal reductions looks.. difficult. And error prone. :) If you think it's worth it then I can certainly give it a go! I like the separation of concerns in keeping them separate though.

The extra things that AddReductionVar will detect that getReductionOpChain will not are:

  • Phi/select predicated reductions like in if-conversion-reductions.ll and if-reduction.ll. These would need some form of predicated reduction intrinsic.
  • Narrow bitwidths. This one I could add.
  • Subs/FSubs are treated like Adds/FAdds for vertical reductions.
850

This is the loop exit instr, so can have as many uses as it likes I believe.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3847

-Os sounds like a good plan. It will take some backend work to make it efficient enough first though. And predicated reductions?

3925

The check for lookThroughAnd in AddReductionVar is guarded by isArithmeticRecurrenceKind, so at least currently it cannot be an And reduction.

7451

I believe that InloopReductionChains would not iterate in a deterministic order, which is why I avoided it.

Perhaps that would not matter here? The reductions should be independent anyway. Seems safer to try and use deterministic ordering anyway if we can.

7584

Ah sorry. I split it off but apparently didn't upstream the other part yet. Now in D81415.

dmgreen updated this revision to Diff 269300.Jun 8 2020, 11:38 AM
dmgreen marked 2 inline comments as done.

Updates

bmahjour removed a subscriber: bmahjour.Jun 22 2020, 6:47 AM
Ayal added inline comments.Jul 6 2020, 1:15 PM
llvm/lib/Analysis/IVDescriptors.cpp
439–441

Would be good to see if above TODO can be addressed - providing the set of all instructions that take part in the reduction. This set could then be used for checking in-loop reductions. Hopefully this could help simplify both, and keep them in some sync. But could be done later, possibly with another TODO..

814

Is treating sub as an add reduction something in-loop reduction could support as a future extension?

850

Ahh, ok. (It should have ExpectedUses+1 users being in lcssa.)

"instruciton"

858

Is Subs the only issue?.
Can check this earlier, before traversing the chain, although it is pushed back last, here.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1082

Thanks. Worth updating the comment.

3847

Hoisting the horizontal reduction from the middle block into the loop could potentially eliminate the middle block (as in tests below), so could presumably lead to code of smaller size? At-least for in-loop chains of a single link.

And predicated reductions?

These are yet to be handled in-loop, right?

6696

Then better placed above right after defining Phi?

7451

Agreed it would be better to use deterministic ordering. How about letting InloopReductionChains be a MapVector and iterate over
for (auto &Reduction : CM.getInloopReductions())?
The number of reductions is expected to be small, w/o removals.

7651

This is the other potential use of for (auto &Reduction : CM.getInloopReductions()).

dmgreen updated this revision to Diff 275979.Jul 7 2020, 3:41 AM
dmgreen marked 12 inline comments as done.

Thanks for taking another look.

llvm/lib/Analysis/IVDescriptors.cpp
814

Hmm. I don't want to say never. A normal inloop reduction looks like:

p = PHI(0, a)
l = VLDR (..)
a = VADDVA(p, l)

Where the VADDV is an across-vector reductions, and the extra A means also add p. Reducing a sub would need to become:

p = PHI(0, a)
l = VLDR (..)
a = VADDV(l)
p = SUB(p, a)

With the SUB as a separate scalar instruction, which would be quite slow on some hardware (getting a value over from the VADDV to the SUB). So this would almost certainly be slower than a out-of-loop reduction.

But if we could end up using a higher vector factor for the reduction, or end up vectorizing loops that would previously not be vectorized.. that may lead to a gain overall to overcome the extra cost of adding the sub to the loop. It will require some very careful costing I think. And maybe the ability to create multiple vplans and cost them against one another :)

858

I believe sub is the only issue, unless I am forgetting something.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3847

And predicated reductions?

These are yet to be handled in-loop, right?

Yep. It will need a predicated reduction intrinsic. A vecreduce that takes a mask. That will allow us to tail-fold the reductions with trip counts that do not divide the vector factor, which will make them look a lot better under -Os. And nice in general I think once it all starts being tail predicated.

The backend work I was mentioning was that we need to more efficiently transform

x = min(vecreduce.min(z), y)

into

x = VMINV(y, z)

Where y is (confusingly) accumulated in the case (even though the instruction doesn't have an A suffix). We currently generate

x = min(VMINV(UINT_MAX, z), y)

Once that is sorted out then, yep, using these for Os sounds like a good plan.

6696

I can put it somewhere sensible in this patch and move it in the next :)

7451

MapVector sounds good. I've changed it to use that and tried to use that in a few more places. Let me know what you think.

Ayal added inline comments.Jul 7 2020, 12:33 PM
llvm/lib/Analysis/IVDescriptors.cpp
814

An original sub code, say, acc -= a[i], can be treated as acc += (-a[i]). This could be in-loop reduced by first negating a[i]'s, at LV's LLVM-IR level, presumably lowered later to something like

p = PHI(0, a)
l = VLDR (..)
s = VSUBV (zero, l)
a = VADDVA(p, s)

, right?

844

rephrase sentence so it parses

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3829

Ahh, this should actually be capitalized IsInLoopReductionPhi

3847

Re: predicated reductions - could they be handled by replacing masked-off elements with Identity using a select prior to reduction? To be potentially folded later by suitable targets into a predicated reduction operation which they may support.
Somewhat akin to "passthru" values of masked loads.

7451

Uses as MapVector look good to me, thanks. Can also retain isInLoopReduction(PHI).

7457

recode >> record

dmgreen updated this revision to Diff 276198.Jul 7 2020, 2:07 PM
dmgreen marked 4 inline comments as done.

Reinstate isInLoopReduction and fixup some wording/typos/capitalisation.

llvm/lib/Analysis/IVDescriptors.cpp
814

Yep. We would have the option to trading a scalar instruction for a vector instruction + an extra register (to hold the 0, we only have 8 registers!)

Unfortunately both would be slower than in out-of-loop reduction unless we were vectorizing at a higher factor, though.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3847

Oh. So select

s = select m, a, 0
v = vecreduce.add s

to a predicated vaddv?

Yeah, sounds interesting. I'll look into that as an alternative to predicated intrinsics. Nice suggestion.

Ayal added inline comments.Jul 7 2020, 3:05 PM
llvm/lib/Analysis/IVDescriptors.cpp
814

ok, so sub's can be handled in-loop, but doing so is expected to be more costly than out-of-loop, at-least if a horizontal add operation is to be used rather than a horizontal subtract; probably worth a comment.
If a reduction chain has only sub's, they could all sink - negating the sum once after the loop, using VADDVA inside. Doing so however will retain the middle block, i.e., w/o decreasing code size.

dmgreen updated this revision to Diff 276364.Jul 8 2020, 3:52 AM

Update a FIXME with sub info.

gilr added inline comments.Jul 8 2020, 6:42 AM
llvm/lib/Transforms/Vectorize/VPlan.h
240

It seems that TTI is only used later for deciding whether to use a shuffle sequence or an intrinsic based on data available during planning. If so, then it would be best if the Planner calls TTI->useReductionIntrinsic() and records that boolean decision in the Recipe. This is also required in order to estimate in-loop reduction cost. This could be done separately.

dmgreen marked an inline comment as done.Jul 9 2020, 10:15 AM
dmgreen added inline comments.
llvm/lib/Transforms/Vectorize/VPlan.h
240

Do you mean to change the interface to createTargetReduction, to take a bool instead? Yeah I think that sounds good. I'd prefer to do it as a separate review as it does involve changing the interface. I will put a patch together.

I was imagining that we would change the cost to use getArithmeticReductionCost, which hopefully handles the details of how the target lowers reductions. I haven't looked deeply into the details yet though. That is on the list of things to do.

Ayal accepted this revision.Jul 12 2020, 7:35 AM

This looks good to me, thanks! with last couple of nits.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6700

nit: "... that leads from the loop exit value back.." - chain is now found top-down.

7463

nit: can record the recipe of Phi first, just to follow chain order.

7689

nit: can assert CompareRecipe->getVPRecipeID()

This revision is now accepted and ready to land.Jul 12 2020, 7:35 AM

Thanks. I will update the patch, but I will wait until at least after the branch before I commit it.

D83646 is an attempt at changing the interface to createTargetReduction, so that we don't need to store TTI in the reduction recipe.

dmgreen updated this revision to Diff 278689.Jul 17 2020, 3:00 AM
dmgreen marked 2 inline comments as done.

Updates for those last nitpicks.

This revision was automatically updated to reflect the committed changes.

Thanks. Sorry about that, it appears clang doesn't like this code as much as gcc did and then my internet went out whilst I was trying to figure out what was wrong.

I will try to run some extra testing using different compilers.