Page MenuHomePhabricator

[LoopVectorize] Add strict in-order reduction support for fixed-width vectorization
ClosedPublic

Authored by kmclaughlin on Mar 11 2021, 9:47 AM.

Details

Summary

Previously we could only vectorize FP reductions if fast math was enabled, as this allows us to
reorder FP operations. However, it may still be beneficial to vectorize the loop by moving
the reduction inside the vectorized loop and making sure that the scalar reduction value
be an input to the horizontal reduction, e.g:

%phi = phi float [ 0.0, %entry ], [ %reduction, %vector_body ]
%load = load <8 x float>
%reduction = call float @llvm.vector.reduce.fadd.v8f32(float %phi, <8 x float> %load)

This patch adds a new flag (IsOrdered) to RecurrenceDescriptor and makes use of the changes added
by D75069 as much as possible, which already teaches the vectorizer about in-loop reductions.
For now in-order reduction support is off by default and controlled with the -enable-strict-reductions flag.

Diff Detail

Event Timeline

kmclaughlin created this revision.Mar 11 2021, 9:47 AM
kmclaughlin requested review of this revision.Mar 11 2021, 9:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2021, 9:47 AM

Some minor nits. Ran out of time for a more thorough review at this moment.

llvm/lib/Transforms/Utils/LoopUtils.cpp
1081

Nit: this variable appears to be unused?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
334

nit: As 'reduction' is spelt out on the adjacent arguments I would prefer consistency over brevity unless there is a reason to differ here.

fhahn added inline comments.Mar 11 2021, 2:02 PM
llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
2

+sve is not needed for the test?

Can you also add a negative test?

119

I don't think this test is testing what you intended it to. The condition > 0.5f is in the loop pre-header. I think it should be in the loop body, so it tests generating the correct mask? (The C source here is a bit misleading IMO)

david-arm added inline comments.Mar 12 2021, 12:53 AM
llvm/lib/Analysis/IVDescriptors.cpp
215

HI @kmclaughlin, I think you can combine the two IsOrdered = statements into one by declaring the variable below the if (auto *EIP = ...) { ... } block.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4394–4405

nit: I think you can actually now just kill the { brace here and the other one } at line 4377. That wil avoid the additional indentation.

9235

nit: I think you can just do:

Value *C = Builder.getInt1(1);
9246–9250

nit: You could avoid some of the extra indentation below if you changed the } else { to something like:

} else if (IsOrdered)
  NextInChain = NewRed;
} else {
  ...
}
llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
119

Hi @fhahn, I suspect it's because when compiled with clang the condition is hoisted out before we reach the vectoriser, but you're right I think that the condition should be in the loop. At the moment this test is really the same as @fadd_strict.

Hello. Fantastic to see this getting used in more cases. In-order reductions sound like a great use of it.

llvm/include/llvm/Analysis/IVDescriptors.h
258

Am I correct in saying that Ordered reductions are a subset of floating point reductions that

  • don't have AllowReassoc on the fadd
  • have only a single fadd added in each loop iteration (but possibly predicated to be one of several additions). So either fadd(redphi(..)) or phi(fadd(redphi), redphi) or phi(fadd(redphi), fadd(redphi)) ?

And that we can't easily get this info from just the FMF or the ExactFPMathInst?

llvm/lib/Analysis/IVDescriptors.cpp
204

I think this may need to check all branches of the phi, if I'm understanding what this is doing exactly. They could have different instructions down each path with different flags (although that would be quite rare I suspect).

There might also be selects, if this is matching on if-block phis. I'm not sure if that is handled in AddReductionVar though.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4406

Cost->isInLoopReduction(Phi) -> IsInLoopReductionPhi

Does this work in-order for UF > 1? I feel like the order of the adds will be changed, so that it's no-longer in-order.

If not, is this code needed, if it can only be using UF == 1 and ReducedPartRdx is set to State.get(LoopExitInstDef, 0) above.

9237

I'm suspecting that the MaskOfOnes isn't needed, and any masking needed would be handled by the Select created above? Otherwise it can use the Mask from the Cond operand to detect when the instruction needs to be predicated and when not.

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
9

Note that the identity value for a fadd (without nsz) should be -0.0, not 0.0. Otherwise 0.0 + n doesn't always equal n. https://alive2.llvm.org/ce/z/LS2RK3

I believe the select would only be needed if the reduction is predicated though, as mentioned above.

spatel added inline comments.Mar 16 2021, 4:49 AM
llvm/include/llvm/Analysis/IVDescriptors.h
258

I have the same question. I was looking at this code recently ( 36a489d19475 , 1bee549737ac ) -- but I'm still not sure if we are behaving correctly or optimally.

The "ExactFPMathInst" seems to provide the same information. I experimented with changing that to a bool. The only reason we appear to save an instruction pointer rather than a bool is so we can provide a prettier optimization remark in the case a loop is not vectorized. Ie, we say something like: "The FP math at line 42 is not associative" rather than "The loop starting at line 39 requires exact FP math".

david-arm added inline comments.Mar 16 2021, 5:02 AM
llvm/include/llvm/Analysis/IVDescriptors.h
258

I think the IsOrdered flag here is more of a convenience so that we can avoid calling the more expensive checkOrderedReduction function every time we may want to use strict, in-order reduction intrinsics. The checkOrderedReduction does cast the instruction to a FPMathOperator and look for the allows-reassoc flag.

david-arm added inline comments.Mar 16 2021, 9:06 AM
llvm/include/llvm/Analysis/IVDescriptors.h
258

Hi @spatel, oh I see now what you mean. I didn't realise we now had a ExactFPMathInst in RecurrenceDescriptor. It looks like you're saying instead of setting the IsOrdered flag we can just set ExactFPMathInst to the instruction in ReduxDesc, which then gets passed on to the RecurrenceDescriptor.

dmgreen added inline comments.Mar 16 2021, 9:26 AM
llvm/include/llvm/Analysis/IVDescriptors.h
258

As far as I understand, we still can not vectorize in-order reductions with multiple adds in the loop. Something like https://godbolt.org/z/c9qd1v. The ordering would change if we tried.

So the flag may still be needed for the second bullet point above, which is a large part of the checkOrderedReduction. Either that or a way to compute that there is only a single fadd (possibly through a select/if block phi).

kmclaughlin marked 8 inline comments as done.
kmclaughlin edited the summary of this revision. (Show Details)
  • Changed the name of the flag to -enable-strict-reductions
  • Removed unnecessary changes to create a mask as this was already handled by VPReductionRecipe::execute
  • Simplified checkOrderedReductions for this patch. For now, if Exit is a Phi node we will not set IsOrdered to true.
  • Removed +sve from the RUN line of the tests
  • Added a negative test (@fadd_multiple)
  • Added more tests

Thanks all for reviewing this!

llvm/include/llvm/Analysis/IVDescriptors.h
258

Hi @dmgreen & @spatel, I also didn't realise that ExactFPMathInst is very similar and already checks the hasAllowReassoc flag. I've left the IsOrdered flag in this patch as I think there is still a need for it in the case mentioned above where there is a chain of fadds (I've added a test for this to strict-fadd.ll). I also changed the conditions in checkOrderedReduction slightly, to check if Exit == ExactFPMathInst.

llvm/lib/Analysis/IVDescriptors.cpp
204

I realised whilst trying to add a better test than @fadd_conditional_rdx for conditional reductions that with the tests I have written this part of the function is redundant.
I've removed it from this patch for now since I don't think I can write any tests for it yet - I believe in addition to looking through Phi nodes here and checking all branches, I would also need to change getReductionOpChain to look through Phis and Selects in order to vectorize with inloop reductions, is this correct?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4406

The changes I made to VPReductionRecipe::execute should ensure that ordering is enforced when the UF > 1, through the way the chain is constructed for each unrolled part. I was missing a change to fixReduction to correctly set the incoming values of Phi for each unrolled part in my first patch, however I believe this is fixed now and the test for unrolling (@fadd_strict_unroll) has been updated accordingly.

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
2

Hi @fhahn, I've added a negative test where we have multiple fadds in the loop (@fadd_multiple) & removed +sve

9

I've created a separate patch which changes the identity value to -0.0 (D98963)

119

I tried to write a test where the condition is in the loop, but I don't think this is possible yet with just the changes in this patch (I believe there would also be some changes needed to getReductionOpChain to try and look through Phis before I could try this).
Since this isn't testing anything that @fadd_strict doesn't already, I've removed it and instead added @fadd_predicated (which uses loop.vectorize.predicate.enable) and @fadd_conditional to test the masking in VPReductionRecipe::execute with in-order reductions.

dmgreen added inline comments.Wed, Mar 24, 1:15 AM
llvm/lib/Analysis/IVDescriptors.cpp
204

Oh yeah, that sounds like it might be correct. We handle predicated reductions, but were more interested in tail-folding for MVE. Predication through a select/if phi might well need more work.

207

IsOrdered &= (LHS == Phi) || (RHS == Phi);

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4301

Can this use isScalar(), or is it to handle scalable single items too?

We generate multiple phi's, but only use the first one? The others get DCE'd?

4406

Yeah, looks OK I think. All the back-to-back operations might be slow, forced down a critical path, but the order seems good.

david-arm added inline comments.Wed, Mar 24, 2:32 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4301

Yes, I think in this case we can just use

State.VF.isVector()

since that also covers the case when VF=(1,scalable)

kmclaughlin marked 2 inline comments as done.
  • Addressing review comments & fixing clang-format warnings
kmclaughlin added inline comments.Wed, Mar 24, 8:25 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4301

Yes, we generate multiple phis and the unused phis are removed by InstCombine.

Changed this to use State.VF.isVector() as suggested above.

Thanks. This LGTM, if there are no other comments.

llvm/include/llvm/Analysis/IVDescriptors.h
258

Can we clarify the comment, to specify that it means this reduction can be treated like an inorder reduction, and what that currently means.

david-arm accepted this revision.Fri, Mar 26, 1:34 AM

LGTM as well! Can address Dave's comment before merging I think?

This revision is now accepted and ready to land.Fri, Mar 26, 1:34 AM
This revision was landed with ongoing or failed builds.Tue, Apr 6, 6:46 AM
This revision was automatically updated to reflect the committed changes.
kmclaughlin marked an inline comment as done.
fhahn added inline comments.Thu, Apr 8, 4:36 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
333

What's the plan for this option going forward? Would it be safe to remove it and let the cost model deal with deciding whether it is worth to vectorize with strict reductions only?

4301

I think there's test coverage missing for this code. All tests pass if this change gets removed. @kmclaughlin could you add a test?

kmclaughlin added inline comments.Thu, Apr 15, 8:50 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
333

Hi @fhahn, I think there is more testing needed before we can be confident in removing the -enable-strict-reductions flag, for instance adding more tests for scalable types and running LNT with the flag enabled. After this, the plan is to remove the option (after making any necessary changes to the cost model to decide if using in-order reductions is beneficial).

4301

The @fadd_strict_unroll test should have been testing this part of fixReduction(), but there was a mistake in the CHECK lines I added for the vector.reduce.fadd intrinsics. I've pushed rG93f54fae9dda to address this, and also created D100570 to try and prevent multiple unused Phis from being generated to begin with.