Page MenuHomePhabricator

[LoopVectorizer] Inloop vector reductions
Needs ReviewPublic

Authored by dmgreen on Feb 24 2020, 11:17 AM.



Arm MVE has multiple instructions such as VMLAVA.s8, which (in this case) can take two 128bit vectors, sign extend the inputs to i32, multiplying them together and sum the result into a 32bit general purpose register. So taking 16 i8's as inputs, they can multiply and accumulate the result into a single i32 without any rounding/truncating along the way. There are also reduction instructions for plain integer add and min/max, and operations that sum into a pair of 32bit registers together treated as a 64bit integer (even though MVE does not have a plain 64bit addition instruction). So giving the vectorizer the ability to use these instructions both enables us to vectorize at higher bitwidths, and to vectorize things we previously could not.

In order to do that we need a way to represent that the "reduction" operation, specified with a llvm.experimental.vector.reduce when vectorizing for Arm, occurs inside the loop not after it like most reductions. This patch attempts to do that, teaching the vectorizer about in-loop reductions. It does this through a vplan recipe representing the reductions that the original chain of reduction operations is replaced by. Cost modelling is currently just done through a "prefersInloopReduction" TTI hook (which follows in a later patch).

Diff Detail

Event Timeline

dmgreen created this revision.Feb 24 2020, 11:17 AM
fhahn added a comment.Feb 28 2020, 1:42 PM

This patch does not work like that yet though, as I am unsure how that should really look, and it did not seem simple to create a vplan from another yet. The def-use relations are usually not in place for example. Neither is the costing of vplans against one another, which makes sense.

That's correct. Work on modelling the def-use relationships exclusively in VPlan is currently underway, but it probably does not make sense to wait with this patch until the transition is complete.

Some small high-level comments:

  • I think it's best to move the TTI hook into a separate change
  • Do you know how 'experimental' the reduction intrinsics still are? It would be good to move them out of experimental at some point, especially when making use of them in LV.

I'm currently traveling, but I try to take a closer look next week or the week after (please ping in case I forget ;))

Thanks for taking a look!

I noticed a comment in VPValue that says:

DESIGN PRINCIPLE: Access to the underlying IR must be strictly limited to
the front-end and back-end of VPlan so that the middle-end is as
independent as possible of the underlying IR. We grant access to the
underlying IR using friendship. In that way, we should be able to use VPlan
for multiple underlying IRs (Polly?) by providing a new VPlan front-end,
back-end and analysis information for the new IR.

That's the part that confused me about optimising vplans in-place. Does that mean that we have to replicate all the opcode/flags data into vplan too? Not just the use-defs. So it knows what is an Add and what is a Mul (and if they are nsw etc). What would you say counts as "middle-end" for vplan? I would presume the costmodelling and creating vplans from one another, but I didn't follow vplans early days very closely.

I think it's best to move the TTI hook into a separate change

Can do. I have some other changes that I was trying out too about adding a VPReductionPhi. Not sure it makes sense yet though.

Do you know how 'experimental' the reduction intrinsics still are? It would be good to move them out of experimental at some point, especially when making use of them in LV.

As far as I understand, on architectures that request it (AArch64 and Arm MVE now), we have been using the "experimental" reduction intrinsics since they were added, in the code generated after the vector body. There is a useReductionIntrinsic which the targets can override to prefer them over the IR equivalents.

I would like to add a mask to the reduction intrinsics. That will be important for getting tail-predicated inloop reductions to work. Whether this is done by adjusting the existing ones or adding new ones, I'm not sure. Obviously some architectures do not handle the mask, only handling a all-one mask. After that is sorted out then I agree it would be nice to drop the "experimental".

dmgreen updated this revision to Diff 247848.Mar 3 2020, 4:06 AM

Taken out the TTI change and simplified the tests, just shoing the diffs here now.

dmgreen updated this revision to Diff 263027.Sat, May 9, 10:55 AM

Rebase and a bit of a cleanup. Now uses VPValues for the VPRecipe operands.

Ping. Anyone have any thought/comments about this one? Does it at least look like its going in the right direction, or does anyone have some better suggestions?

Ayal added a comment.Tue, May 19, 4:12 PM

Ping. Anyone have any thought/comments about this one? Does it at least look like its going in the right direction, or does anyone have some better suggestions?

Yes, this direction does look right!
Would be good to clarify that the unrolled Parts are still reduced out of loop, and document the current limitations / future TODO extensions, including min/max, conditional bumps, truncate/extend.
Added various minor comments.


Looking for a chain of hasOneUse op's would be easier starting from the Phi and going downwards, until reaching LoopExitInstr?

Note that when extended to handle reductions with conditional bumps, some ops will have more than one use.


"Force" >> "Prefer"? If a reduction cannot be handled in loop, it is vectorized out of loop, rather than bailing out.


collectInLoopReductions()? Perhaps worth holding a map of in loop reduction phi's with their chains.




Simply call InLoopReductions.empty() directly?


hasOutOfLoopReductions()? Outerloop stands for a non innermost loop in a nest.




This is dead code if cmp/select chains are not recognized yet, as noted above.


Checking if !UseInLoopReductions is redundant, as current in loop reductions have phi's of the same type as the recurrence; right?
May leave it as an assert, until in loop reductions handle truncation/extension.


Worth adding a comment that in loop reductions create their target reductions inside the loop using a recipe, rather that here in the middle block.


Move this invariant check out as another early-exit?


Iterate over in loop reductions?


Iterate over in loop reductions?
Comment that they are ordered top-down, starting from the phi.


In loop reductions are currently disabled in fold tail. Checking also if hasOuterloopReductions() is an independent optimization that should be done separately of this patch.


This is currently dead code, no in loop reductions under fold tail.


NewChain, NewAcc >> PrevInChain, NextInChain?


Too bad this requires passing TTI through the State everywhere.
Perhaps storing TTI in the recipe would be somewhat better.


Comment is redundant.


private is redundant.

47 ↗(On Diff #263027)

Why is this friendship needed?

43 ↗(On Diff #263027)

Optimizing this comparison away is independent of this patch.


It is indeed good to examine the diff in this review, but existing non -force-inloop-reductions RUN should be retained; e.g., by cloning the file for a separate -force-inloop-reductions RUN.

Would be good to also test with -force-vector-interleave greater than 1.

Thanks for the review! You rock. I'll update this as per the comments.

fhahn added inline comments.Wed, May 20, 3:55 AM

nit: would be good to document what this function does


nit: would be good to add a message


The ::print methods do not need to emit the " +\n" at the start and the "\\l\" parts after 4c8285c750bc


Would a forward decl of ReductionDesciptor be sufficient or is there another need to include LoopUtils.h?


would be good to document the members.


I think it would be also good to add a few additional simpler test cases, e.g. reductions with only 1 or 2 reduction steps in the loop, some with constant scalar operands. And with some of the (unnecessary trounces stripped). Also the test diff can be slightly reduced by having constant trip counts.

It's good to have the more complex ones, but it would also be good to start with simpler cases. That way it is probably easier to follow what's going on. And it would reduce the diff quite a bit if the tests for the various different opcodes would be a bit more compact.

fhahn added inline comments.Wed, May 20, 3:58 AM

Instead of doing a recursive traversal, would it be simpler to just do the traversal iteratively, at least as long as we are only using at a single use chain?

simoll added a subscriber: simoll.Wed, May 20, 5:18 AM
dmgreen updated this revision to Diff 265371.Wed, May 20, 4:09 PM
dmgreen marked 36 inline comments as done.
dmgreen edited the summary of this revision. (Show Details)

Round one. Lots of renaming, plenty of cleanup. I've tried to add Min/Max handling too, and added/cleaned up some more test. Hopefully addressed most of the comments.

dmgreen added inline comments.Wed, May 20, 4:45 PM

Yeah, that direction makes it a lot simpler. Thanks.


I've added the code to handle minmax too (but not tested it a lot yet. I will try that now).

MVE has instructions for integer min/max reductions, but they can be slow enough to make them not worth using over a normal vmin/vmax. Adds are always not-slower-enough to warrant the inloop reduction (and have other advantages like handling higher type sizes and folding in more instructions.)

My point is that min/max, like some of the other fadd/mul/and/etc might not be used by MVE yet. If you think the code is more hassle than it deserves, then we could take them out for the time being. I'd like to leave them in for consistency though, even if it's not used straight away.


Right. We look from the phi down (now), so can't go through the 'and' we would need to detect the different bitwidth. That would leave 'and' reduction, which I think would not got through that codepath to create type promoted phis?

I've put an explicit check in for it to be sure, and added some testing for those cases.


This does look a little strange here on it's own. The followup patch to add the TTI hook makes it look like:

if (!PreferInloopReductions &&
      !TTI.preferInloopReduction(Opcode, Phi->getType(),

Do you mean adding an iterator for iterating over reductions, stepping over the ones not inloop?

It would seem like it's similar to the existing code, but as a new iterator class. My gut says the current code is simpler and clearer what is going on?


I've changed it to be stored there. It does mean multiple things are holding TTI. Let me know what you think.

47 ↗(On Diff #263027)

Ah. Old code no longer needed.


This testcase is new for this patch (although that wasn't obvious from how I presented it). I was just trying to show the diffs here and the option didn't exist before this patch, hence it wasn't already there.

Are you suggesting we should still have these same test for without -force-inloop-reductions? I think those should already be tested elsewhere, but I can add it if you think it's useful.


(unnecessary trounces stripped)?

The const case is going to look a little odd until there is a instcombine type fold for constant reductions. I'll see about adding that.

Otherwise it is now hopefully a bit leaner. Even if I have added a few extra tests.