Enables LoopVectorize to handle reduction patterns involving the
llvm.fmuladd intrinsic.
Depends on D112548
Differential D111555
[LoopVectorize] Add vector reduction support for fmuladd intrinsic RosieSumpter on Oct 11 2021, 8:46 AM. Authored by
Details Enables LoopVectorize to handle reduction patterns involving the Depends on D112548
Diff Detail
Unit Tests Event TimelineComment Actions Note: changes to the cost model for ordered reductions will follow in a separate patch. Comment Actions
Comment Actions Just a flyby review triggered by commenting on D111630 so my comments are more stylistic in nature rather than digging into the technical details.
Comment Actions
Comment Actions
Comment Actions I think you can remove the "strict" from the title and summary of this patch, if I'm understanding what strict means here. As far as I understand it should enable vectorization for strict (inloop) and non-strict (out of loop/fast) reductions of llvm.fmuladd, which is nice.
Comment Actions
Comment Actions Hi @dmgreen, thanks for pointing this out, you're right (and the CHECK-UNORDERED lines of the strict tests show the out-of-loop case). I've changed the title and summary. Comment Actions I think, unlike the other opcodes in a reduction chain, we may need to check that the operand number is correct. The other opcodes are commutative so it doesn't matter which of the operands the reduction passes through, but for fmuladd we need to ensure we are dealing with the last addition parameter. Something like this test case I think shouldn't be treated like an add reduction, due to the induction passing through the multiply operand of the fmuladd: define float @fmuladd_strict(float* %a, float* %b, i64 %n) #0 { entry: br label %for.body for.body: %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ] %sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ] %arrayidx = getelementptr inbounds float, float* %a, i64 %iv %0 = load float, float* %arrayidx, align 4 %arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv %1 = load float, float* %arrayidx2, align 4 %muladd = tail call fast float @llvm.fmuladd.f32(float %0, float %sum.07, float %1) %iv.next = add nuw nsw i64 %iv, 1 %exitcond.not = icmp eq i64 %iv.next, %n br i1 %exitcond.not, label %for.end, label %for.body for.end: ret float %muladd } declare float @llvm.fmuladd.f32(float, float, float)
Comment Actions
Comment Actions Good point. The example you gave was caught for ordered reductions by the Exit->getOperand(2) != Phi check in checkOrderedReduction, but for fast reductions it was being vectorized. I've added a check to RecurrenceDescriptor::isRecurrenceInstr to make sure the reduction phi is only the last operand, and added the example as a test.
Comment Actions
Comment Actions These changes are to account for when there are multiple calls to fmuladd:
Comment Actions Thanks. This LGTM. (The AddReductionVar code I find to be a bit convoluted - it tries to do too much at once and is difficult to follow everything it does. As far as I can tell, this looks good). Comment Actions
Comment Actions @RosieSumpter Thanks for the patch! I'm in question that whether llvm.fma.* should also be considered a valid candidate here. LangRef describes that
I'm not sure if this is a restriction that fma should not be expanded into mul+add by other passes which consider the transformation profitable, like LoopVectorize in this case. Comment Actions Hi @craig.topper, this patch does fix PR52266. For PR33338, there is a fast keyword on the fmuladd, which results in the InstCombine pass replacing the fmuladd with separate fmul and fadd operations, so the example gets vectorized with or without this patch. However, if the fast keyword is removed (so that the fmuladd is retained after InstCombine) and ordered reductions are specified using the -force-ordered-reductions flag, this patch allows the example to be vectorized. Comment Actions @mdchen: This looks like a tricky question depending on your interpretation of the LangRef. There's this line in the semantics for llvm.fma: When specified with the fast-math-flag ‘afn’, the result may be approximated using a less accurate calculation. Which suggests, from an accuracy point of view, no other fast-math-flag is considered. Then it comes down to what "approximated using a less accurate" means when afn is specified. My reading is the underlying operation must be maintained (i.e. a fused multiply-add) but perhaps at less precision, for example rounding double operands to float. That's to say the precision between the multiply and add is still infinite even though the operands themselves can be rounded to something of lower precision. To me this suggests it's never correct for LoopVectorize to split an llvm.fma into separate fmul and fadd operations. Comment Actions Thanks for your reply!
This is probably not the case, for example in D71706 pow is allowed to be transformed to sqrt if afn exits. But I agree it depends on the interpretation since the text now is opaque. Maybe we can raise this question up in the dev mailist? Thanks again! |
I feel like isa<IntrinsicInst>(I) && cast<IntrinsicInst>(I)->getIntrinsicID() == Intrinsic::fmuladd would be a cheaper way to get the same result?