This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Add vector reduction support for fmuladd intrinsic
ClosedPublic

Authored by RosieSumpter on Oct 11 2021, 8:46 AM.

Details

Summary

Enables LoopVectorize to handle reduction patterns involving the
llvm.fmuladd intrinsic.

Depends on D112548

Diff Detail

Event Timeline

RosieSumpter created this revision.Oct 11 2021, 8:46 AM
RosieSumpter requested review of this revision.Oct 11 2021, 8:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2021, 8:46 AM

Note: changes to the cost model for ordered reductions will follow in a separate patch.

  • Added FMulAdd as an accepted recurrence kind to AArch64TTIImpl::isLegalToVectorizeReduction so that scalable vectorization is enabled for llvm.fmuladd
  • Updated the scalable-strict-fadd.ll test.

Just a flyby review triggered by commenting on D111630 so my comments are more stylistic in nature rather than digging into the technical details.

llvm/include/llvm/Analysis/IVDescriptors.h
267–268

I feel like isa<IntrinsicInst>(I) && cast<IntrinsicInst>(I)->getIntrinsicID() == Intrinsic::fmuladd would be a cheaper way to get the same result?

llvm/lib/Analysis/IVDescriptors.cpp
205–214

Just a suggestion but given this function no longer has a single instruction to care about perhaps it's worth being more explicit. For example:

if (Kind == RecurKind::FAdd && Exit->getOpcode() != Instruction::FAdd)
  return false;
if (Kind == RecurKind::FMulAdd && RecurrenceDescriptor::isFMulAddIntrinsic(Exit))
  return false
if (Exit != ExactFPMathInst)
  return false;
218–223

Much like my previous suggestion what about:

if (Kind == RecurKind::FAdd && Op0 != Phi && Op1 != Phi)
  return false;
if (Kind == RecurKind::FMulAdd && Exit->getOperand(2) != Phi)
  return false;
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9825

I'm assuming this change will remain live after this point and thus affect nodes created after exiting this function? Which would be bad. In any case I think that instead of calling CreateBinOp you can instead call CreateFMulFMF which will propagate the FMF flags for you.

llvm/lib/Analysis/IVDescriptors.cpp
205–214

Or rather:

if (Kind == RecurKind::FAdd && Exit->getOpcode() != Instruction::FAdd)
  return false;
if (Kind == RecurKind::FMulAdd && !RecurrenceDescriptor::isFMulAddIntrinsic(Exit))
  return false
if (Exit != ExactFPMathInst)
  return false;

:)

dmgreen added inline comments.Oct 17 2021, 4:17 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9641

Would it be possible to create a FMul VPInstruction and a VPReductionRecipe? That way the VPlan better represents the final instructions.

fhahn added inline comments.Oct 18 2021, 2:53 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9641

+1, that should hopefully help to remove some of the special handling for the FMulAdd from codegen.

9824–9825

This is a nice cleanup and could be split off as simple NFC.

RosieSumpter marked 4 inline comments as done.Oct 26 2021, 8:31 AM
RosieSumpter added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9641

Hi @dmgreen, thanks for the suggestion. Would you (or @fhahn) mind elaborating a bit on what you would expect this to look like? I see the point about wanting the FMul instruction to be present in the VPlan, but having spoken to @david-arm about it it seems this might mean the VPReductionRecipe having two underlying instructions - is this what you would expect? Any pointers you have would be very useful!

9824–9825

I've added an NFC patch for this here D112547

9825

Hi @paulwalker-arm thanks for the comments. It turns out this exposed a problem with fast-math flags not being propagated for ordered reductions, so I've added a patch for that here D112548

RosieSumpter marked an inline comment as done.
RosieSumpter added a reviewer: paulwalker-arm.
  • Rewritten IsFMulAddIntrinsic
  • Made conditions more explicit in checkOrderedReduction
  • Splits out NFC and fast-math flags changes so this patch now builds on D112547 and D112548
dmgreen added inline comments.Oct 28 2021, 12:29 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9641

I think there would be two VPRecipes, a VPReductionRecipe and a VPInstruction representing the fmul. That way the VPReductionRecipe is not trying to represent both, and it should remove some of the extra complexity from VPReductionRecipe because it won't need multiple vector inputs. The fmul VPInstruction creates the final fmul instruction, which will have the two inputs.

  • Create a VPInstruction to represent the FMul
  • Removed changes to VPReductionRecipe
  • In order to propagate fast-math flags, added LoopVectorizationPlanner as a friend class to VPInstruction (so that setUnderlyingInstr() can be used) and added a method hasUnderlyingInstr()
david-arm added inline comments.Nov 1 2021, 2:49 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9645

Hi @RosieSumpter, I realise you were asked to make this change, but it also doesn't feel right to be setting an underlying instruction here because there isn't one really. The underlying instruction is the fmuladd call and is already added to the VPReductionRecipe, so adding to two recipes feels a bit dangerous? Perhaps we should be adding a new interface to VPInstruction instead that allows setting the 'FastMathFlags' for the instruction?

fhahn added inline comments.Nov 1 2021, 2:56 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9645

Agreed, we shouldn't set the underlying instruction explicitly here, visitbility is intentionally restricted. I don't think the FMF changes should be pulled into this change. Can the setting of FMFs be moved to a follow-up patch?

  • Added a setFastMathFlags method to VPInstruction instead of making LoopVectorizationPlanner a friend
RosieSumpter added inline comments.Nov 2 2021, 2:37 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9645

Hi @fhahn, thanks for the comment. I have instead added a setFastMathFlags method to the VPInstruction class. At the moment I've left it as part of this patch as, after discussion with @david-arm, it seems that it may not be ideal to split out this change given that this would mean submitting code that requires a fix. Also, it doesn't look like VPInstruction has been used for FP operations elsewhere, so currently this change is only used in the case of fmuladd being used. If you do still think it would be better as a follow-up patch though I'm happy to do that instead.

fhahn added inline comments.Nov 3 2021, 1:36 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9644

You should be able to use an initialiser list here {VecOp, Plan->getVPValue(R->getOperand(1)}.

9645

Could you elaborate what you mean by 'requires a fix'? The code would still be correct without FMFs, just be not as optimal as it could be, right?

It's quite common to split out changes not directly related into separate patches as this makes it easier to review them individually rather than further extending an already big patch.

david-arm added inline comments.Nov 3 2021, 2:13 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9645

Hi @fhahn, @RosieSumpter has split out other patches before so I think she is aware why we sometimes split them up. It was actually my suggestion that we might want to keep this in the same patch. I think it depends upon your perspective about whether adding the flags is a "nice to have" or "just broken". Personally to me it feels like the latter because we're not propagating user-specified flags and getting the requested behaviour. If it has to be split out that's fine, but it does feel a bit counter-intuitive.

I think you can remove the "strict" from the title and summary of this patch, if I'm understanding what strict means here. As far as I understand it should enable vectorization for strict (inloop) and non-strict (out of loop/fast) reductions of llvm.fmuladd, which is nice.
https://godbolt.org/z/xTe89GEfM

llvm/lib/Analysis/IVDescriptors.cpp
1059

Is this just returning true because it believes the only instruction found in a fmuladd reduction chain will be a llvm.fmuladd? Could that change in the future if it was able to recognize fadd and fmuladd as single reduction sequence?

Would it be better to check the instruction is isFMulAddIntrinsic, to be on the safe side?

RosieSumpter marked 2 inline comments as done.
RosieSumpter retitled this revision from [LoopVectorize] Add strict reduction support for fmuladd intrinsic to [LoopVectorize] Add vector reduction support for fmuladd intrinsic.
RosieSumpter edited the summary of this revision. (Show Details)
  • Moved setFastMathFlags change to a follow-up patch D113125
  • Used isFMulAddIntrinsic in place of RecurKind::FMulAdd for safety
  • Used an initializer list for the fmul operands

I think you can remove the "strict" from the title and summary of this patch, if I'm understanding what strict means here. As far as I understand it should enable vectorization for strict (inloop) and non-strict (out of loop/fast) reductions of llvm.fmuladd, which is nice.

Hi @dmgreen, thanks for pointing this out, you're right (and the CHECK-UNORDERED lines of the strict tests show the out-of-loop case). I've changed the title and summary.

I think, unlike the other opcodes in a reduction chain, we may need to check that the operand number is correct. The other opcodes are commutative so it doesn't matter which of the operands the reduction passes through, but for fmuladd we need to ensure we are dealing with the last addition parameter.

Something like this test case I think shouldn't be treated like an add reduction, due to the induction passing through the multiply operand of the fmuladd:

define float @fmuladd_strict(float* %a, float* %b, i64 %n) #0 {  
entry:
  br label %for.body

for.body:
  %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
  %sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
  %arrayidx = getelementptr inbounds float, float* %a, i64 %iv
  %0 = load float, float* %arrayidx, align 4 
  %arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
  %1 = load float, float* %arrayidx2, align 4
  %muladd = tail call fast float @llvm.fmuladd.f32(float %0, float %sum.07, float %1)
  %iv.next = add nuw nsw i64 %iv, 1  
  %exitcond.not = icmp eq i64 %iv.next, %n
  br i1 %exitcond.not, label %for.end, label %for.body
 
for.end: 
  ret float %muladd
}
 
declare float @llvm.fmuladd.f32(float, float, float)
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9614–9615

This may be simpler if it avoids the if: assert((!IsFMulAdd || RecurrenceDescriptor::isFMulAddIntrinsic(R)) && "...");

fhahn added inline comments.Nov 10 2021, 2:33 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9644

Is it possible this change got missed when uploading the diff? The comment is marked as done, but the code still uses SmallVector<> FMulOps.

RosieSumpter marked 2 inline comments as done.
  • Added a check that the reduction phi isn't one of the multiply operands of fmuladd to RecurrenceDescriptor::isRecurrenceInstr
  • Added a test case to strict-fadd.ll for the above
  • Used initializer list instead of SmallVector<> FMulOps when creating FMul VPInstruction
  • Simplified assert in LoopVectorizationPlanner::adjustRecipesForReductions

I think, unlike the other opcodes in a reduction chain, we may need to check that the operand number is correct. The other opcodes are commutative so it doesn't matter which of the operands the reduction passes through, but for fmuladd we need to ensure we are dealing with the last addition parameter.

Good point. The example you gave was caught for ordered reductions by the Exit->getOperand(2) != Phi check in checkOrderedReduction, but for fast reductions it was being vectorized. I've added a check to RecurrenceDescriptor::isRecurrenceInstr to make sure the reduction phi is only the last operand, and added the example as a test.

david-arm added inline comments.Nov 15 2021, 3:34 AM
llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
397

Hi @RosieSumpter, I think this CHECK line should be:

; CHECK-ORDERED: [[VEC_PHI:%.*]] = phi float [ 0.000000e+00, %vector.ph ], [ [[RDX3:%.*]], %vector.body ]

and then lower down you should have

; CHECK-ORDERED: [[RDX3]] = call float @llvm.vector.reduce.fadd.nxv8f32(float [[RDX2]], <vscale x 8 x float> [[FMUL3]])

This ensures that the return value from the final intrinsic call ends up as the incoming value for the PHI. At the moment you have [[RDX3:%.*]] in both cases so in theory they could be different.

477

I think we should be checking for [[RDX3:%.*]] here.

493

Again, you should be able to use [[RDX3]] = call ... here

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
954

Again, you should be able to use [[RDX3]] = call ... here

1013

[[RDX3:%.*]]

1029

[[RDX3]] = fadd ...

1100

Is it worth also having a negative test for the case when the PHI is both a mul operand and the add operand too?

  • Corrected tests
  • Added negative test case where reduction phi appears as 2 operands of llvm.fmuladd
RosieSumpter edited the summary of this revision. (Show Details)Nov 16 2021, 3:07 AM
RosieSumpter marked 7 inline comments as done.Nov 16 2021, 3:10 AM

These changes are to account for when there are multiple calls to fmuladd:

  • Added an fmuladd operand check to RecurrenceDescriptor::AddReductionVar
  • Removed the now redundant operand check from RecurrenceDescriptor::isRecurrenceInstr
  • Added test cases to strict-fadd.ll

Thanks. This LGTM.

(The AddReductionVar code I find to be a bit convoluted - it tries to do too much at once and is difficult to follow everything it does. As far as I can tell, this looks good).

dmgreen accepted this revision.Nov 23 2021, 6:45 AM
This revision is now accepted and ready to land.Nov 23 2021, 6:45 AM
  • Rebase + fixed conflict (added Exit->hasNUsesOrMore(3) check to checkOrderedReduction)
kmclaughlin accepted this revision.Nov 23 2021, 8:48 AM
mdchen added a subscriber: mdchen.Dec 13 2021, 6:58 PM

@RosieSumpter Thanks for the patch! I'm in question that whether llvm.fma.* should also be considered a valid candidate here. LangRef describes that

If a fused multiply-add is required, the corresponding llvm.fma intrinsic function should be used instead.

I'm not sure if this is a restriction that fma should not be expanded into mul+add by other passes which consider the transformation profitable, like LoopVectorize in this case.

craig.topper added a subscriber: craig.topper.EditedDec 13 2021, 7:51 PM

I suppose this will fix PR33338 and PR52266?

I suppose this will fix PR33338 and PR52266?

Hi @craig.topper, this patch does fix PR52266. For PR33338, there is a fast keyword on the fmuladd, which results in the InstCombine pass replacing the fmuladd with separate fmul and fadd operations, so the example gets vectorized with or without this patch. However, if the fast keyword is removed (so that the fmuladd is retained after InstCombine) and ordered reductions are specified using the -force-ordered-reductions flag, this patch allows the example to be vectorized.

@mdchen: This looks like a tricky question depending on your interpretation of the LangRef. There's this line in the semantics for llvm.fma:

When specified with the fast-math-flag ‘afn’, the result may be approximated using a less accurate calculation.

Which suggests, from an accuracy point of view, no other fast-math-flag is considered. Then it comes down to what "approximated using a less accurate" means when afn is specified. My reading is the underlying operation must be maintained (i.e. a fused multiply-add) but perhaps at less precision, for example rounding double operands to float. That's to say the precision between the multiply and add is still infinite even though the operands themselves can be rounded to something of lower precision.

To me this suggests it's never correct for LoopVectorize to split an llvm.fma into separate fmul and fadd operations.

Thanks for your reply!

There's this line in the semantics for llvm.fma:

When specified with the fast-math-flag ‘afn’, the result may be approximated using a less accurate calculation.

My reading is the underlying operation must be maintained (i.e. a fused multiply-add)

This is probably not the case, for example in D71706 pow is allowed to be transformed to sqrt if afn exits. But I agree it depends on the interpretation since the text now is opaque. Maybe we can raise this question up in the dev mailist? Thanks again!