This is an archive of the discontinued LLVM Phabricator instance.

Added FMulAdd as an accepted recurrence kind to AArch64TTIImpl::isLegalToVectorizeReduction so that scalable vectorization is enabled for llvm.fmuladd
Updated the scalable-strict-fadd.ll test.

Harbormaster completed remote builds in B129064: Diff 379998.Oct 15 2021, 8:27 AM

Just a flyby review triggered by commenting on D111630 so my comments are more stylistic in nature rather than digging into the technical details.

llvm/include/llvm/Analysis/IVDescriptors.h
267–268	I feel like `isa<IntrinsicInst>(I) && cast<IntrinsicInst>(I)->getIntrinsicID() == Intrinsic::fmuladd` would be a cheaper way to get the same result?
llvm/lib/Analysis/IVDescriptors.cpp
201–211	Just a suggestion but given this function no longer has a single instruction to care about perhaps it's worth being more explicit. For example: if (Kind == RecurKind::FAdd && Exit->getOpcode() != Instruction::FAdd) return false; if (Kind == RecurKind::FMulAdd && RecurrenceDescriptor::isFMulAddIntrinsic(Exit)) return false if (Exit != ExactFPMathInst) return false;
215–220	Much like my previous suggestion what about: if (Kind == RecurKind::FAdd && Op0 != Phi && Op1 != Phi) return false; if (Kind == RecurKind::FMulAdd && Exit->getOperand(2) != Phi) return false;
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9985	I'm assuming this change will remain live after this point and thus affect nodes created after exiting this function? Which would be bad. In any case I think that instead of calling `CreateBinOp` you can instead call `CreateFMulFMF` which will propagate the FMF flags for you.

paulwalker-arm added inline comments.Oct 15 2021, 10:25 AM

llvm/lib/Analysis/IVDescriptors.cpp

201–211

Or rather:

if (Kind == RecurKind::FAdd && Exit->getOpcode() != Instruction::FAdd)
  return false;
if (Kind == RecurKind::FMulAdd && !RecurrenceDescriptor::isFMulAddIntrinsic(Exit))
  return false
if (Exit != ExactFPMathInst)
  return false;

dmgreen added inline comments.Oct 17 2021, 4:17 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9805	Would it be possible to create a FMul VPInstruction and a VPReductionRecipe? That way the VPlan better represents the final instructions.

fhahn added inline comments.Oct 18 2021, 2:53 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9805	+1, that should hopefully help to remove some of the special handling for the `FMulAdd` from codegen.
9984–9985	This is a nice cleanup and could be split off as simple NFC.

RosieSumpter marked 4 inline comments as done.Oct 26 2021, 8:31 AM

RosieSumpter added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9805	Hi @dmgreen, thanks for the suggestion. Would you (or @fhahn) mind elaborating a bit on what you would expect this to look like? I see the point about wanting the FMul instruction to be present in the VPlan, but having spoken to @david-arm about it it seems this might mean the VPReductionRecipe having two underlying instructions - is this what you would expect? Any pointers you have would be very useful!
9984–9985	I've added an NFC patch for this here D112547
9985	Hi @paulwalker-arm thanks for the comments. It turns out this exposed a problem with fast-math flags not being propagated for ordered reductions, so I've added a patch for that here D112548

Rewritten IsFMulAddIntrinsic
Made conditions more explicit in checkOrderedReduction
Splits out NFC and fast-math flags changes so this patch now builds on D112547 and D112548

Harbormaster completed remote builds in B130720: Diff 382338.Oct 26 2021, 8:32 AM

dmgreen added inline comments.Oct 28 2021, 12:29 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9805	I think there would be two VPRecipes, a VPReductionRecipe and a VPInstruction representing the fmul. That way the VPReductionRecipe is not trying to represent both, and it should remove some of the extra complexity from VPReductionRecipe because it won't need multiple vector inputs. The fmul VPInstruction creates the final fmul instruction, which will have the two inputs.

RosieSumpter edited the summary of this revision. (Show Details)Oct 28 2021, 1:35 AM

RosieSumpter added a parent revision: D112548: [LoopVectorize] Propagate fast-math flags for inloop reductions.

RosieSumpter added a child revision: D111630: [LoopVectorize][CostModel] Update cost model for fmuladd intrinsic.

Create a VPInstruction to represent the FMul
Removed changes to VPReductionRecipe
In order to propagate fast-math flags, added LoopVectorizationPlanner as a friend class to VPInstruction (so that setUnderlyingInstr() can be used) and added a method hasUnderlyingInstr()

david-arm added inline comments.Nov 1 2021, 2:49 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9809	Hi @RosieSumpter, I realise you were asked to make this change, but it also doesn't feel right to be setting an underlying instruction here because there isn't one really. The underlying instruction is the fmuladd call and is already added to the VPReductionRecipe, so adding to two recipes feels a bit dangerous? Perhaps we should be adding a new interface to VPInstruction instead that allows setting the 'FastMathFlags' for the instruction?

fhahn added inline comments.Nov 1 2021, 2:56 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9809	Agreed, we shouldn't set the underlying instruction explicitly here, visitbility is intentionally restricted. I don't think the FMF changes should be pulled into this change. Can the setting of FMFs be moved to a follow-up patch?

Harbormaster completed remote builds in B131707: Diff 383740.Nov 1 2021, 3:12 AM

Added a setFastMathFlags method to VPInstruction instead of making LoopVectorizationPlanner a friend

RosieSumpter added inline comments.Nov 2 2021, 2:37 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9809	Hi @fhahn, thanks for the comment. I have instead added a `setFastMathFlags` method to the `VPInstruction` class. At the moment I've left it as part of this patch as, after discussion with @david-arm, it seems that it may not be ideal to split out this change given that this would mean submitting code that requires a fix. Also, it doesn't look like VPInstruction has been used for FP operations elsewhere, so currently this change is only used in the case of fmuladd being used. If you do still think it would be better as a follow-up patch though I'm happy to do that instead.

Harbormaster completed remote builds in B131914: Diff 384014.Nov 2 2021, 3:16 AM

fhahn added inline comments.Nov 3 2021, 1:36 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9808	You should be able to use an initialiser list here `{VecOp, Plan->getVPValue(R->getOperand(1)}`.
9809	Could you elaborate what you mean by 'requires a fix'? The code would still be correct without FMFs, just be not as optimal as it could be, right? It's quite common to split out changes not directly related into separate patches as this makes it easier to review them individually rather than further extending an already big patch.

david-arm added inline comments.Nov 3 2021, 2:13 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9809	Hi @fhahn, @RosieSumpter has split out other patches before so I think she is aware why we sometimes split them up. It was actually my suggestion that we might want to keep this in the same patch. I think it depends upon your perspective about whether adding the flags is a "nice to have" or "just broken". Personally to me it feels like the latter because we're not propagating user-specified flags and getting the requested behaviour. If it has to be split out that's fine, but it does feel a bit counter-intuitive.

I think you can remove the "strict" from the title and summary of this patch, if I'm understanding what strict means here. As far as I understand it should enable vectorization for strict (inloop) and non-strict (out of loop/fast) reductions of llvm.fmuladd, which is nice.
https://godbolt.org/z/xTe89GEfM

llvm/lib/Analysis/IVDescriptors.cpp
1061	Is this just returning true because it believes the only instruction found in a fmuladd reduction chain will be a llvm.fmuladd? Could that change in the future if it was able to recognize fadd and fmuladd as single reduction sequence? Would it be better to check the instruction is isFMulAddIntrinsic, to be on the safe side?

RosieSumpter added a child revision: D113125: [LoopVectorize] Propagate fast-math flags for VPInstruction.Nov 3 2021, 10:27 AM

Moved setFastMathFlags change to a follow-up patch D113125
Used isFMulAddIntrinsic in place of RecurKind::FMulAdd for safety
Used an initializer list for the fmul operands

In D111555#3105451, @dmgreen wrote:

I think you can remove the "strict" from the title and summary of this patch, if I'm understanding what strict means here. As far as I understand it should enable vectorization for strict (inloop) and non-strict (out of loop/fast) reductions of llvm.fmuladd, which is nice.

Hi @dmgreen, thanks for pointing this out, you're right (and the CHECK-UNORDERED lines of the strict tests show the out-of-loop case). I've changed the title and summary.

Harbormaster completed remote builds in B132270: Diff 384506.Nov 3 2021, 11:03 AM

I think, unlike the other opcodes in a reduction chain, we may need to check that the operand number is correct. The other opcodes are commutative so it doesn't matter which of the operands the reduction passes through, but for fmuladd we need to ensure we are dealing with the last addition parameter.

Something like this test case I think shouldn't be treated like an add reduction, due to the induction passing through the multiply operand of the fmuladd:

define float @fmuladd_strict(float* %a, float* %b, i64 %n) #0 {  
entry:
  br label %for.body

for.body:
  %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
  %sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
  %arrayidx = getelementptr inbounds float, float* %a, i64 %iv
  %0 = load float, float* %arrayidx, align 4 
  %arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
  %1 = load float, float* %arrayidx2, align 4
  %muladd = tail call fast float @llvm.fmuladd.f32(float %0, float %sum.07, float %1)
  %iv.next = add nuw nsw i64 %iv, 1  
  %exitcond.not = icmp eq i64 %iv.next, %n
  br i1 %exitcond.not, label %for.end, label %for.body
 
for.end: 
  ret float %muladd
}
 
declare float @llvm.fmuladd.f32(float, float, float)

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9780–9781	This may be simpler if it avoids the if: `assert((!IsFMulAdd \|\| RecurrenceDescriptor::isFMulAddIntrinsic(R)) && "...");`

fhahn added inline comments.Nov 10 2021, 2:33 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9808	Is it possible this change got missed when uploading the diff? The comment is marked as done, but the code still uses `SmallVector<> FMulOps`.

Added a check that the reduction phi isn't one of the multiply operands of fmuladd to RecurrenceDescriptor::isRecurrenceInstr
Added a test case to strict-fadd.ll for the above
Used initializer list instead of SmallVector<> FMulOps when creating FMul VPInstruction
Simplified assert in LoopVectorizationPlanner::adjustRecipesForReductions

In D111555#3117640, @dmgreen wrote:

I think, unlike the other opcodes in a reduction chain, we may need to check that the operand number is correct. The other opcodes are commutative so it doesn't matter which of the operands the reduction passes through, but for fmuladd we need to ensure we are dealing with the last addition parameter.

Good point. The example you gave was caught for ordered reductions by the Exit->getOperand(2) != Phi check in checkOrderedReduction, but for fast reductions it was being vectorized. I've added a check to RecurrenceDescriptor::isRecurrenceInstr to make sure the reduction phi is only the last operand, and added the example as a test.

Harbormaster completed remote builds in B133892: Diff 386758.Nov 12 2021, 2:27 AM

david-arm added inline comments.Nov 15 2021, 3:34 AM

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
397	Hi @RosieSumpter, I think this CHECK line should be: ; CHECK-ORDERED: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, %vector.ph ], [ [[RDX3:%.]], %vector.body ] and then lower down you should have ; CHECK-ORDERED: [[RDX3]] = call float @llvm.vector.reduce.fadd.nxv8f32(float [[RDX2]], <vscale x 8 x float> [[FMUL3]]) This ensures that the return value from the final intrinsic call ends up as the incoming value for the PHI. At the moment you have `[[RDX3:%.*]]` in both cases so in theory they could be different.
477	I think we should be checking for `[[RDX3:%.*]]` here.
493	Again, you should be able to use `[[RDX3]] = call ...` here
llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
954	Again, you should be able to use `[[RDX3]] = call ...` here
1013	`[[RDX3:%.*]]`
1029	`[[RDX3]] = fadd ...`
1100	Is it worth also having a negative test for the case when the PHI is both a mul operand and the add operand too?

Corrected tests
Added negative test case where reduction phi appears as 2 operands of llvm.fmuladd

RosieSumpter edited the summary of this revision. (Show Details)Nov 16 2021, 3:07 AM

RosieSumpter marked 7 inline comments as done.Nov 16 2021, 3:10 AM

Harbormaster completed remote builds in B134465: Diff 387548.Nov 16 2021, 3:44 AM

peterwaller-arm added a subscriber: peterwaller-arm.Nov 17 2021, 3:27 AM

These changes are to account for when there are multiple calls to fmuladd:

Added an fmuladd operand check to RecurrenceDescriptor::AddReductionVar
Removed the now redundant operand check from RecurrenceDescriptor::isRecurrenceInstr
Added test cases to strict-fadd.ll

Harbormaster completed remote builds in B135604: Diff 389160.Nov 23 2021, 4:14 AM

Thanks. This LGTM.

(The AddReductionVar code I find to be a bit convoluted - it tries to do too much at once and is difficult to follow everything it does. As far as I can tell, this looks good).

dmgreen accepted this revision.Nov 23 2021, 6:45 AM

This revision is now accepted and ready to land.Nov 23 2021, 6:45 AM

Rebase + fixed conflict (added Exit->hasNUsesOrMore(3) check to checkOrderedReduction)

kmclaughlin accepted this revision.Nov 23 2021, 8:48 AM

Harbormaster completed remote builds in B135645: Diff 389202.Nov 23 2021, 1:40 PM

Closed by commit rGc2441b6b89bf: [LoopVectorize] Add vector reduction support for fmuladd intrinsic (authored by RosieSumpter). · Explain WhyNov 24 2021, 12:59 AM

This revision was automatically updated to reflect the committed changes.

RosieSumpter added a commit: rGc2441b6b89bf: [LoopVectorize] Add vector reduction support for fmuladd intrinsic.

@RosieSumpter Thanks for the patch! I'm in question that whether llvm.fma.* should also be considered a valid candidate here. LangRef describes that

If a fused multiply-add is required, the corresponding llvm.fma intrinsic function should be used instead.

I'm not sure if this is a restriction that fma should not be expanded into mul+add by other passes which consider the transformation profitable, like LoopVectorize in this case.

I suppose this will fix PR33338 and PR52266?

In D111555#3191141, @craig.topper wrote:

I suppose this will fix PR33338 and PR52266?

Hi @craig.topper, this patch does fix PR52266. For PR33338, there is a fast keyword on the fmuladd, which results in the InstCombine pass replacing the fmuladd with separate fmul and fadd operations, so the example gets vectorized with or without this patch. However, if the fast keyword is removed (so that the fmuladd is retained after InstCombine) and ordered reductions are specified using the -force-ordered-reductions flag, this patch allows the example to be vectorized.

@RosieSumpter gentle ping: )

@mdchen: This looks like a tricky question depending on your interpretation of the LangRef. There's this line in the semantics for llvm.fma:

When specified with the fast-math-flag ‘afn’, the result may be approximated using a less accurate calculation.

Which suggests, from an accuracy point of view, no other fast-math-flag is considered. Then it comes down to what "approximated using a less accurate" means when afn is specified. My reading is the underlying operation must be maintained (i.e. a fused multiply-add) but perhaps at less precision, for example rounding double operands to float. That's to say the precision between the multiply and add is still infinite even though the operands themselves can be rounded to something of lower precision.

To me this suggests it's never correct for LoopVectorize to split an llvm.fma into separate fmul and fadd operations.

Thanks for your reply!

There's this line in the semantics for llvm.fma:
When specified with the fast-math-flag ‘afn’, the result may be approximated using a less accurate calculation.
My reading is the underlying operation must be maintained (i.e. a fused multiply-add)

This is probably not the case, for example in D71706 pow is allowed to be transformed to sqrt if afn exits. But I agree it depends on the interpretation since the text now is opaque. Maybe we can raise this question up in the dev mailist? Thanks again!

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

IVDescriptors.h

8 lines

lib/

Analysis/

IVDescriptors.cpp

46 lines

Target/

AArch64/

AArch64TargetTransformInfo.cpp

1 line

Transforms/

Utils/

LoopUtils.cpp

4 lines

Vectorize/

LoopVectorize.cpp

22 lines

SLPVectorizer.cpp

2 lines

test/

Transforms/

LoopVectorize/

AArch64/

scalable-strict-fadd.ll

162 lines

strict-fadd.ll

342 lines

reduction-inloop.ll

67 lines

Diff 389416

llvm/include/llvm/Analysis/IVDescriptors.h

Show All 14 Lines

#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/MapVector.h"		#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/IR/InstrTypes.h"		#include "llvm/IR/InstrTypes.h"
#include "llvm/IR/Instruction.h"		#include "llvm/IR/Instruction.h"
		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Operator.h"		#include "llvm/IR/Operator.h"
#include "llvm/IR/ValueHandle.h"		#include "llvm/IR/ValueHandle.h"
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"

namespace llvm {		namespace llvm {

class DemandedBits;		class DemandedBits;
class AssumptionCache;		class AssumptionCache;
Show All 14 Lines	enum class RecurKind {
SMin, ///< Signed integer min implemented in terms of select(cmp()).		SMin, ///< Signed integer min implemented in terms of select(cmp()).
SMax, ///< Signed integer max implemented in terms of select(cmp()).		SMax, ///< Signed integer max implemented in terms of select(cmp()).
UMin, ///< Unisgned integer min implemented in terms of select(cmp()).		UMin, ///< Unisgned integer min implemented in terms of select(cmp()).
UMax, ///< Unsigned integer max implemented in terms of select(cmp()).		UMax, ///< Unsigned integer max implemented in terms of select(cmp()).
FAdd, ///< Sum of floats.		FAdd, ///< Sum of floats.
FMul, ///< Product of floats.		FMul, ///< Product of floats.
FMin, ///< FP min implemented in terms of select(cmp()).		FMin, ///< FP min implemented in terms of select(cmp()).
FMax, ///< FP max implemented in terms of select(cmp()).		FMax, ///< FP max implemented in terms of select(cmp()).
		FMulAdd, ///< Fused multiply-add of floats (a * b + c).
SelectICmp, ///< Integer select(icmp(),x,y) where one of (x,y) is loop		SelectICmp, ///< Integer select(icmp(),x,y) where one of (x,y) is loop
///< invariant		///< invariant
SelectFCmp ///< Integer select(fcmp(),x,y) where one of (x,y) is loop		SelectFCmp ///< Integer select(fcmp(),x,y) where one of (x,y) is loop
///< invariant		///< invariant
};		};

/// The RecurrenceDescriptor is used to identify recurrences variables in a		/// The RecurrenceDescriptor is used to identify recurrences variables in a
/// loop. Reduction is a special case of recurrence that has uses of the		/// loop. Reduction is a special case of recurrence that has uses of the
▲ Show 20 Lines • Show All 194 Lines • ▼ Show 20 Lines	public:
/// Expose an ordered FP reduction to the instance users.		/// Expose an ordered FP reduction to the instance users.
bool isOrdered() const { return IsOrdered; }		bool isOrdered() const { return IsOrdered; }

/// Attempts to find a chain of operations from Phi to LoopExitInst that can		/// Attempts to find a chain of operations from Phi to LoopExitInst that can
/// be treated as a set of reductions instructions for in-loop reductions.		/// be treated as a set of reductions instructions for in-loop reductions.
SmallVector<Instruction , 4> getReductionOpChain(PHINode Phi,		SmallVector<Instruction , 4> getReductionOpChain(PHINode Phi,
Loop *L) const;		Loop *L) const;

		/// Returns true if the instruction is a call to the llvm.fmuladd intrinsic.
		static bool isFMulAddIntrinsic(Instruction *I) {
		return isa<IntrinsicInst>(I) &&
		cast<IntrinsicInst>(I)->getIntrinsicID() == Intrinsic::fmuladd;
		paulwalker-armUnsubmitted Done Reply Inline Actions I feel like `isa<IntrinsicInst>(I) && cast<IntrinsicInst>(I)->getIntrinsicID() == Intrinsic::fmuladd` would be a cheaper way to get the same result? paulwalker-arm: I feel like `isa<IntrinsicInst>(I) && cast<IntrinsicInst>(I)->getIntrinsicID() == Intrinsic…
		}

private:		private:
// The starting value of the recurrence.		// The starting value of the recurrence.
// It does not have to be zero!		// It does not have to be zero!
TrackingVH<Value> StartValue;		TrackingVH<Value> StartValue;
// The instruction who's value is used outside the loop.		// The instruction who's value is used outside the loop.
Instruction *LoopExitInstr = nullptr;		Instruction *LoopExitInstr = nullptr;
// The kind of the recurrence.		// The kind of the recurrence.
RecurKind Kind = RecurKind::None;		RecurKind Kind = RecurKind::None;
▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

llvm/lib/Analysis/IVDescriptors.cpp

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
bool RecurrenceDescriptor::isArithmeticRecurrenceKind(RecurKind Kind) {		bool RecurrenceDescriptor::isArithmeticRecurrenceKind(RecurKind Kind) {
switch (Kind) {		switch (Kind) {
default:		default:
break;		break;
case RecurKind::Add:		case RecurKind::Add:
case RecurKind::Mul:		case RecurKind::Mul:
case RecurKind::FAdd:		case RecurKind::FAdd:
case RecurKind::FMul:		case RecurKind::FMul:
		case RecurKind::FMulAdd:
return true;		return true;
}		}
return false;		return false;
}		}

/// Determines if Phi may have been type-promoted. If Phi has a single user		/// Determines if Phi may have been type-promoted. If Phi has a single user
/// that ANDs the Phi with a type mask, return the user. RT is updated to		/// that ANDs the Phi with a type mask, return the user. RT is updated to
/// account for the narrower bit width represented by the mask, and the AND		/// account for the narrower bit width represented by the mask, and the AND
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	for (Value *O : cast<User>(Val)->operands())
Worklist.push_back(I);		Worklist.push_back(I);
}		}
}		}

// Check if a given Phi node can be recognized as an ordered reduction for		// Check if a given Phi node can be recognized as an ordered reduction for
// vectorizing floating point operations without unsafe math.		// vectorizing floating point operations without unsafe math.
static bool checkOrderedReduction(RecurKind Kind, Instruction *ExactFPMathInst,		static bool checkOrderedReduction(RecurKind Kind, Instruction *ExactFPMathInst,
Instruction Exit, PHINode Phi) {		Instruction Exit, PHINode Phi) {
// Currently only FAdd is supported		// Currently only FAdd and FMulAdd are supported.
if (Kind != RecurKind::FAdd)		if (Kind != RecurKind::FAdd && Kind != RecurKind::FMulAdd)
return false;		return false;

// Ensure the exit instruction is an FAdd, and that it only has one user		if (Kind == RecurKind::FAdd && Exit->getOpcode() != Instruction::FAdd)
// other than the reduction PHI		return false;
if (Exit->getOpcode() != Instruction::FAdd \|\| Exit->hasNUsesOrMore(3) \|\|
Exit != ExactFPMathInst)		if (Kind == RecurKind::FMulAdd &&
		!RecurrenceDescriptor::isFMulAddIntrinsic(Exit))
		return false;

		// Ensure the exit instruction has only one user other than the reduction PHI
		if (Exit != ExactFPMathInst \|\| Exit->hasNUsesOrMore(3))
return false;		return false;
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Just a suggestion but given this function no longer has a single instruction to care about perhaps it's worth being more explicit. For example: if (Kind == RecurKind::FAdd && Exit->getOpcode() != Instruction::FAdd) return false; if (Kind == RecurKind::FMulAdd && RecurrenceDescriptor::isFMulAddIntrinsic(Exit)) return false if (Exit != ExactFPMathInst) return false; paulwalker-arm: Just a suggestion but given this function no longer has a single instruction to care about…
		paulwalker-armUnsubmitted Done Reply Inline Actions Or rather: if (Kind == RecurKind::FAdd && Exit->getOpcode() != Instruction::FAdd) return false; if (Kind == RecurKind::FMulAdd && !RecurrenceDescriptor::isFMulAddIntrinsic(Exit)) return false if (Exit != ExactFPMathInst) return false; :) paulwalker-arm: Or rather: ``` if (Kind == RecurKind::FAdd && Exit->getOpcode() != Instruction::FAdd) return…

// The only pattern accepted is the one in which the reduction PHI		// The only pattern accepted is the one in which the reduction PHI
// is used as one of the operands of the exit instruction		// is used as one of the operands of the exit instruction
auto *LHS = Exit->getOperand(0);		auto *Op0 = Exit->getOperand(0);
auto *RHS = Exit->getOperand(1);		auto *Op1 = Exit->getOperand(1);
if (LHS != Phi && RHS != Phi)		if (Kind == RecurKind::FAdd && Op0 != Phi && Op1 != Phi)
		return false;
		if (Kind == RecurKind::FMulAdd && Exit->getOperand(2) != Phi)
return false;		return false;
		paulwalker-armUnsubmitted Done Reply Inline Actions Much like my previous suggestion what about: if (Kind == RecurKind::FAdd && Op0 != Phi && Op1 != Phi) return false; if (Kind == RecurKind::FMulAdd && Exit->getOperand(2) != Phi) return false; paulwalker-arm: Much like my previous suggestion what about: ``` if (Kind == RecurKind::FAdd && Op0 != Phi &&…

LLVM_DEBUG(dbgs() << "LV: Found an ordered reduction: Phi: " << *Phi		LLVM_DEBUG(dbgs() << "LV: Found an ordered reduction: Phi: " << *Phi
<< ", ExitInst: " << *Exit << "\n");		<< ", ExitInst: " << *Exit << "\n");

return true;		return true;
}		}

bool RecurrenceDescriptor::AddReductionVar(PHINode *Phi, RecurKind Kind,		bool RecurrenceDescriptor::AddReductionVar(PHINode *Phi, RecurKind Kind,
▲ Show 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	while (!Worklist.empty()) {
// Process users of current instruction. Push non-PHI nodes after PHI nodes		// Process users of current instruction. Push non-PHI nodes after PHI nodes
// onto the stack. This way we are going to have seen all inputs to PHI		// onto the stack. This way we are going to have seen all inputs to PHI
// nodes once we get to them.		// nodes once we get to them.
SmallVector<Instruction *, 8> NonPHIs;		SmallVector<Instruction *, 8> NonPHIs;
SmallVector<Instruction *, 8> PHIs;		SmallVector<Instruction *, 8> PHIs;
for (User *U : Cur->users()) {		for (User *U : Cur->users()) {
Instruction *UI = cast<Instruction>(U);		Instruction *UI = cast<Instruction>(U);

		// If the user is a call to llvm.fmuladd then the instruction can only be
		// the final operand.
		if (isFMulAddIntrinsic(UI))
		if (Cur == UI->getOperand(0) \|\| Cur == UI->getOperand(1))
		return false;

// Check if we found the exit user.		// Check if we found the exit user.
BasicBlock *Parent = UI->getParent();		BasicBlock *Parent = UI->getParent();
if (!TheLoop->contains(Parent)) {		if (!TheLoop->contains(Parent)) {
// If we already know this instruction is used externally, move on to		// If we already know this instruction is used externally, move on to
// the next user.		// the next user.
if (ExitInstruction == Cur)		if (ExitInstruction == Cur)
continue;		continue;

▲ Show 20 Lines • Show All 305 Lines • ▼ Show 20 Lines	case Instruction::Call:
if (isSelectCmpRecurrenceKind(Kind))		if (isSelectCmpRecurrenceKind(Kind))
return isSelectCmpPattern(L, OrigPhi, I, Prev);		return isSelectCmpPattern(L, OrigPhi, I, Prev);
if (isIntMinMaxRecurrenceKind(Kind) \|\|		if (isIntMinMaxRecurrenceKind(Kind) \|\|
(((FuncFMF.noNaNs() && FuncFMF.noSignedZeros()) \|\|		(((FuncFMF.noNaNs() && FuncFMF.noSignedZeros()) \|\|
(isa<FPMathOperator>(I) && I->hasNoNaNs() &&		(isa<FPMathOperator>(I) && I->hasNoNaNs() &&
I->hasNoSignedZeros())) &&		I->hasNoSignedZeros())) &&
isFPMinMaxRecurrenceKind(Kind)))		isFPMinMaxRecurrenceKind(Kind)))
return isMinMaxPattern(I, Kind, Prev);		return isMinMaxPattern(I, Kind, Prev);
		else if (isFMulAddIntrinsic(I))
		return InstDesc(Kind == RecurKind::FMulAdd, I,
		I->hasAllowReassoc() ? nullptr : I);
return InstDesc(false, I);		return InstDesc(false, I);
}		}
}		}

bool RecurrenceDescriptor::hasMultipleUsesOf(		bool RecurrenceDescriptor::hasMultipleUsesOf(
Instruction I, SmallPtrSetImpl<Instruction > &Insts,		Instruction I, SmallPtrSetImpl<Instruction > &Insts,
unsigned MaxNumUses) {		unsigned MaxNumUses) {
unsigned NumUses = 0;		unsigned NumUses = 0;
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	if (AddReductionVar(Phi, RecurKind::FMin, TheLoop, FMF, RedDes, DB, AC, DT)) {
return true;		return true;
}		}
if (AddReductionVar(Phi, RecurKind::SelectFCmp, TheLoop, FMF, RedDes, DB, AC,		if (AddReductionVar(Phi, RecurKind::SelectFCmp, TheLoop, FMF, RedDes, DB, AC,
DT)) {		DT)) {
LLVM_DEBUG(dbgs() << "Found a float conditional select reduction PHI."		LLVM_DEBUG(dbgs() << "Found a float conditional select reduction PHI."
<< " PHI." << *Phi << "\n");		<< " PHI." << *Phi << "\n");
return true;		return true;
}		}
		if (AddReductionVar(Phi, RecurKind::FMulAdd, TheLoop, FMF, RedDes, DB, AC,
		DT)) {
		LLVM_DEBUG(dbgs() << "Found an FMulAdd reduction PHI." << *Phi << "\n");
		return true;
		}
// Not a reduction of known type.		// Not a reduction of known type.
return false;		return false;
}		}

bool RecurrenceDescriptor::isFirstOrderRecurrence(		bool RecurrenceDescriptor::isFirstOrderRecurrence(
PHINode Phi, Loop TheLoop,		PHINode Phi, Loop TheLoop,
MapVector<Instruction , Instruction > &SinkAfter, DominatorTree *DT) {		MapVector<Instruction , Instruction > &SinkAfter, DominatorTree *DT) {

▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	case RecurKind::Mul:
// Multiplying a number by 1 does not change it.		// Multiplying a number by 1 does not change it.
return ConstantInt::get(Tp, 1);		return ConstantInt::get(Tp, 1);
case RecurKind::And:		case RecurKind::And:
// AND-ing a number with an all-1 value does not change it.		// AND-ing a number with an all-1 value does not change it.
return ConstantInt::get(Tp, -1, true);		return ConstantInt::get(Tp, -1, true);
case RecurKind::FMul:		case RecurKind::FMul:
// Multiplying a number by 1 does not change it.		// Multiplying a number by 1 does not change it.
return ConstantFP::get(Tp, 1.0L);		return ConstantFP::get(Tp, 1.0L);
		case RecurKind::FMulAdd:
case RecurKind::FAdd:		case RecurKind::FAdd:
// Adding zero to a number does not change it.		// Adding zero to a number does not change it.
// FIXME: Ideally we should not need to check FMF for FAdd and should always		// FIXME: Ideally we should not need to check FMF for FAdd and should always
// use -0.0. However, this will currently result in mixed vectors of 0.0/-0.0.		// use -0.0. However, this will currently result in mixed vectors of 0.0/-0.0.
// Instead, we should ensure that 1) the FMF from FAdd are propagated to the PHI		// Instead, we should ensure that 1) the FMF from FAdd are propagated to the PHI
// nodes where possible, and 2) PHIs with the nsz flag + -0.0 use 0.0. This would		// nodes where possible, and 2) PHIs with the nsz flag + -0.0 use 0.0. This would
// mean we can then remove the check for noSignedZeros() below (see D98963).		// mean we can then remove the check for noSignedZeros() below (see D98963).
if (FMF.noSignedZeros())		if (FMF.noSignedZeros())
Show All 31 Lines	unsigned RecurrenceDescriptor::getOpcode(RecurKind Kind) {
case RecurKind::Or:		case RecurKind::Or:
return Instruction::Or;		return Instruction::Or;
case RecurKind::And:		case RecurKind::And:
return Instruction::And;		return Instruction::And;
case RecurKind::Xor:		case RecurKind::Xor:
return Instruction::Xor;		return Instruction::Xor;
case RecurKind::FMul:		case RecurKind::FMul:
return Instruction::FMul;		return Instruction::FMul;
		case RecurKind::FMulAdd:
case RecurKind::FAdd:		case RecurKind::FAdd:
return Instruction::FAdd;		return Instruction::FAdd;
case RecurKind::SMax:		case RecurKind::SMax:
case RecurKind::SMin:		case RecurKind::SMin:
case RecurKind::UMax:		case RecurKind::UMax:
case RecurKind::UMin:		case RecurKind::UMin:
case RecurKind::SelectICmp:		case RecurKind::SelectICmp:
return Instruction::ICmp;		return Instruction::ICmp;
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	auto getNextInstruction = [&](Instruction *Cur) {
return cast<Instruction>(*Cur->user_begin());		return cast<Instruction>(*Cur->user_begin());
};		};
auto isCorrectOpcode = [&](Instruction *Cur) {		auto isCorrectOpcode = [&](Instruction *Cur) {
if (RedOp == Instruction::ICmp \|\| RedOp == Instruction::FCmp) {		if (RedOp == Instruction::ICmp \|\| RedOp == Instruction::FCmp) {
Value LHS, RHS;		Value LHS, RHS;
return SelectPatternResult::isMinOrMax(		return SelectPatternResult::isMinOrMax(
matchSelectPattern(Cur, LHS, RHS).Flavor);		matchSelectPattern(Cur, LHS, RHS).Flavor);
}		}
		// Recognize a call to the llvm.fmuladd intrinsic.
		if (isFMulAddIntrinsic(Cur))
		return true;
		dmgreenUnsubmitted Done Reply Inline Actions Is this just returning true because it believes the only instruction found in a fmuladd reduction chain will be a llvm.fmuladd? Could that change in the future if it was able to recognize fadd and fmuladd as single reduction sequence? Would it be better to check the instruction is isFMulAddIntrinsic, to be on the safe side? dmgreen: Is this just returning true because it believes the only instruction found in a fmuladd…

return Cur->getOpcode() == RedOp;		return Cur->getOpcode() == RedOp;
};		};

// The loop exit instruction we check first (as a quick test) but add last. We		// The loop exit instruction we check first (as a quick test) but add last. We
// check the opcode is correct (and dont allow them to be Subs) and that they		// check the opcode is correct (and dont allow them to be Subs) and that they
// have expected to have the expected number of uses. They will have one use		// have expected to have the expected number of uses. They will have one use
// from the phi and one from a LCSSA value, no matter the type.		// from the phi and one from a LCSSA value, no matter the type.
if (!isCorrectOpcode(LoopExitInstr) \|\| !LoopExitInstr->hasNUses(2))		if (!isCorrectOpcode(LoopExitInstr) \|\| !LoopExitInstr->hasNUses(2))
▲ Show 20 Lines • Show All 349 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 2,136 Lines • ▼ Show 20 Lines	bool AArch64TTIImpl::isLegalToVectorizeReduction(
case RecurKind::SMin:		case RecurKind::SMin:
case RecurKind::SMax:		case RecurKind::SMax:
case RecurKind::UMin:		case RecurKind::UMin:
case RecurKind::UMax:		case RecurKind::UMax:
case RecurKind::FMin:		case RecurKind::FMin:
case RecurKind::FMax:		case RecurKind::FMax:
case RecurKind::SelectICmp:		case RecurKind::SelectICmp:
case RecurKind::SelectFCmp:		case RecurKind::SelectFCmp:
		case RecurKind::FMulAdd:
return true;		return true;
default:		default:
return false;		return false;
}		}
}		}

InstructionCost		InstructionCost
AArch64TTIImpl::getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		AArch64TTIImpl::getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
▲ Show 20 Lines • Show All 305 Lines • Show Last 20 Lines

llvm/lib/Transforms/Utils/LoopUtils.cpp

Show First 20 Lines • Show All 1,043 Lines • ▼ Show 20 Lines	Value *llvm::createSimpleTargetReduction(IRBuilderBase &Builder,
case RecurKind::Mul:		case RecurKind::Mul:
return Builder.CreateMulReduce(Src);		return Builder.CreateMulReduce(Src);
case RecurKind::And:		case RecurKind::And:
return Builder.CreateAndReduce(Src);		return Builder.CreateAndReduce(Src);
case RecurKind::Or:		case RecurKind::Or:
return Builder.CreateOrReduce(Src);		return Builder.CreateOrReduce(Src);
case RecurKind::Xor:		case RecurKind::Xor:
return Builder.CreateXorReduce(Src);		return Builder.CreateXorReduce(Src);
		case RecurKind::FMulAdd:
case RecurKind::FAdd:		case RecurKind::FAdd:
return Builder.CreateFAddReduce(ConstantFP::getNegativeZero(SrcVecEltTy),		return Builder.CreateFAddReduce(ConstantFP::getNegativeZero(SrcVecEltTy),
Src);		Src);
case RecurKind::FMul:		case RecurKind::FMul:
return Builder.CreateFMulReduce(ConstantFP::get(SrcVecEltTy, 1.0), Src);		return Builder.CreateFMulReduce(ConstantFP::get(SrcVecEltTy, 1.0), Src);
case RecurKind::SMax:		case RecurKind::SMax:
return Builder.CreateIntMaxReduce(Src, true);		return Builder.CreateIntMaxReduce(Src, true);
case RecurKind::SMin:		case RecurKind::SMin:
Show All 26 Lines	if (RecurrenceDescriptor::isSelectCmpRecurrenceKind(RK))
return createSelectCmpTargetReduction(B, TTI, Src, Desc, OrigPhi);		return createSelectCmpTargetReduction(B, TTI, Src, Desc, OrigPhi);

return createSimpleTargetReduction(B, TTI, Src, RK);		return createSimpleTargetReduction(B, TTI, Src, RK);
}		}

Value *llvm::createOrderedReduction(IRBuilderBase &B,		Value *llvm::createOrderedReduction(IRBuilderBase &B,
const RecurrenceDescriptor &Desc,		const RecurrenceDescriptor &Desc,
Value Src, Value Start) {		Value Src, Value Start) {
assert(Desc.getRecurrenceKind() == RecurKind::FAdd &&		assert((Desc.getRecurrenceKind() == RecurKind::FAdd \|\|
		Desc.getRecurrenceKind() == RecurKind::FMulAdd) &&
"Unexpected reduction kind");		"Unexpected reduction kind");
assert(Src->getType()->isVectorTy() && "Expected a vector type");		assert(Src->getType()->isVectorTy() && "Expected a vector type");
assert(!Start->getType()->isVectorTy() && "Expected a scalar type");		assert(!Start->getType()->isVectorTy() && "Expected a scalar type");

return B.CreateFAddReduce(Start, Src);		return B.CreateFAddReduce(Start, Src);
}		}

void llvm::propagateIRFlags(Value I, ArrayRef<Value > VL, Value *OpValue) {		void llvm::propagateIRFlags(Value I, ArrayRef<Value > VL, Value *OpValue) {
▲ Show 20 Lines • Show All 681 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,769 Lines • ▼ Show 20 Lines	for (auto &Reduction : CM.getInLoopReductionChains()) {
for (Instruction *R : ReductionOperations) {		for (Instruction *R : ReductionOperations) {
VPRecipeBase *WidenRecipe = RecipeBuilder.getRecipe(R);		VPRecipeBase *WidenRecipe = RecipeBuilder.getRecipe(R);
RecurKind Kind = RdxDesc.getRecurrenceKind();		RecurKind Kind = RdxDesc.getRecurrenceKind();

VPValue *ChainOp = Plan->getVPValue(Chain);		VPValue *ChainOp = Plan->getVPValue(Chain);
unsigned FirstOpId;		unsigned FirstOpId;
assert(!RecurrenceDescriptor::isSelectCmpRecurrenceKind(Kind) &&		assert(!RecurrenceDescriptor::isSelectCmpRecurrenceKind(Kind) &&
"Only min/max recurrences allowed for inloop reductions");		"Only min/max recurrences allowed for inloop reductions");
		// Recognize a call to the llvm.fmuladd intrinsic.
		bool IsFMulAdd = (Kind == RecurKind::FMulAdd);
		assert((!IsFMulAdd \|\| RecurrenceDescriptor::isFMulAddIntrinsic(R)) &&
		"Expected instruction to be a call to the llvm.fmuladd intrinsic");
		dmgreenUnsubmitted Done Reply Inline Actions This may be simpler if it avoids the if: `assert((!IsFMulAdd \|\| RecurrenceDescriptor::isFMulAddIntrinsic(R)) && "...");` dmgreen: This may be simpler if it avoids the if: `assert((!IsFMulAdd \|\| RecurrenceDescriptor…
if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind)) {		if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind)) {
assert(isa<VPWidenSelectRecipe>(WidenRecipe) &&		assert(isa<VPWidenSelectRecipe>(WidenRecipe) &&
"Expected to replace a VPWidenSelectSC");		"Expected to replace a VPWidenSelectSC");
FirstOpId = 1;		FirstOpId = 1;
} else {		} else {
assert((MinVF.isScalar() \|\| isa<VPWidenRecipe>(WidenRecipe)) &&		assert((MinVF.isScalar() \|\| isa<VPWidenRecipe>(WidenRecipe) \|\|
		(IsFMulAdd && isa<VPWidenCallRecipe>(WidenRecipe))) &&
"Expected to replace a VPWidenSC");		"Expected to replace a VPWidenSC");
FirstOpId = 0;		FirstOpId = 0;
}		}
unsigned VecOpId =		unsigned VecOpId =
R->getOperand(FirstOpId) == Chain ? FirstOpId + 1 : FirstOpId;		R->getOperand(FirstOpId) == Chain ? FirstOpId + 1 : FirstOpId;
VPValue *VecOp = Plan->getVPValue(R->getOperand(VecOpId));		VPValue *VecOp = Plan->getVPValue(R->getOperand(VecOpId));

auto *CondOp = CM.foldTailByMasking()		auto *CondOp = CM.foldTailByMasking()
? RecipeBuilder.createBlockInMask(R->getParent(), Plan)		? RecipeBuilder.createBlockInMask(R->getParent(), Plan)
: nullptr;		: nullptr;
VPReductionRecipe *RedRecipe = new VPReductionRecipe(
&RdxDesc, R, ChainOp, VecOp, CondOp, TTI);		if (IsFMulAdd) {
		// If the instruction is a call to the llvm.fmuladd intrinsic then we
		// need to create an fmul recipe to use as the vector operand for the
		// fadd reduction.
		VPInstruction *FMulRecipe = new VPInstruction(
		Instruction::FMul, {VecOp, Plan->getVPValue(R->getOperand(1))});
		dmgreenUnsubmitted Not Done Reply Inline Actions Would it be possible to create a FMul VPInstruction and a VPReductionRecipe? That way the VPlan better represents the final instructions. dmgreen: Would it be possible to create a FMul VPInstruction and a VPReductionRecipe? That way the VPlan…
		fhahnUnsubmitted Not Done Reply Inline Actions +1, that should hopefully help to remove some of the special handling for the `FMulAdd` from codegen. fhahn: +1, that should hopefully help to remove some of the special handling for the `FMulAdd` from…
		RosieSumpterAuthorUnsubmitted Not Done Reply Inline Actions Hi @dmgreen, thanks for the suggestion. Would you (or @fhahn) mind elaborating a bit on what you would expect this to look like? I see the point about wanting the FMul instruction to be present in the VPlan, but having spoken to @david-arm about it it seems this might mean the VPReductionRecipe having two underlying instructions - is this what you would expect? Any pointers you have would be very useful! RosieSumpter: Hi @dmgreen, thanks for the suggestion. Would you (or @fhahn) mind elaborating a bit on what…
		dmgreenUnsubmitted Not Done Reply Inline Actions I think there would be two VPRecipes, a VPReductionRecipe and a VPInstruction representing the fmul. That way the VPReductionRecipe is not trying to represent both, and it should remove some of the extra complexity from VPReductionRecipe because it won't need multiple vector inputs. The fmul VPInstruction creates the final fmul instruction, which will have the two inputs. dmgreen: I think there would be two VPRecipes, a VPReductionRecipe and a VPInstruction representing the…
		WidenRecipe->getParent()->insert(FMulRecipe,
		WidenRecipe->getIterator());
		VecOp = FMulRecipe;
		fhahnUnsubmitted Done Reply Inline Actions You should be able to use an initialiser list here `{VecOp, Plan->getVPValue(R->getOperand(1)}`. fhahn: You should be able to use an initialiser list here `{VecOp, Plan->getVPValue(R->getOperand(1)}`.
		fhahnUnsubmitted Done Reply Inline Actions Is it possible this change got missed when uploading the diff? The comment is marked as done, but the code still uses `SmallVector<> FMulOps`. fhahn: Is it possible this change got missed when uploading the diff? The comment is marked as done…
		}
		david-armUnsubmitted Not Done Reply Inline Actions Hi @RosieSumpter, I realise you were asked to make this change, but it also doesn't feel right to be setting an underlying instruction here because there isn't one really. The underlying instruction is the fmuladd call and is already added to the VPReductionRecipe, so adding to two recipes feels a bit dangerous? Perhaps we should be adding a new interface to VPInstruction instead that allows setting the 'FastMathFlags' for the instruction? david-arm: Hi @RosieSumpter, I realise you were asked to make this change, but it also doesn't feel right…
		fhahnUnsubmitted Not Done Reply Inline Actions Agreed, we shouldn't set the underlying instruction explicitly here, visitbility is intentionally restricted. I don't think the FMF changes should be pulled into this change. Can the setting of FMFs be moved to a follow-up patch? fhahn: Agreed, we shouldn't set the underlying instruction explicitly here, visitbility is…
		RosieSumpterAuthorUnsubmitted Not Done Reply Inline Actions Hi @fhahn, thanks for the comment. I have instead added a `setFastMathFlags` method to the `VPInstruction` class. At the moment I've left it as part of this patch as, after discussion with @david-arm, it seems that it may not be ideal to split out this change given that this would mean submitting code that requires a fix. Also, it doesn't look like VPInstruction has been used for FP operations elsewhere, so currently this change is only used in the case of fmuladd being used. If you do still think it would be better as a follow-up patch though I'm happy to do that instead. RosieSumpter: Hi @fhahn, thanks for the comment. I have instead added a ##setFastMathFlags## method to the…
		fhahnUnsubmitted Not Done Reply Inline Actions Could you elaborate what you mean by 'requires a fix'? The code would still be correct without FMFs, just be not as optimal as it could be, right? It's quite common to split out changes not directly related into separate patches as this makes it easier to review them individually rather than further extending an already big patch. fhahn: Could you elaborate what you mean by 'requires a fix'? The code would still be correct without…
		david-armUnsubmitted Not Done Reply Inline Actions Hi @fhahn, @RosieSumpter has split out other patches before so I think she is aware why we sometimes split them up. It was actually my suggestion that we might want to keep this in the same patch. I think it depends upon your perspective about whether adding the flags is a "nice to have" or "just broken". Personally to me it feels like the latter because we're not propagating user-specified flags and getting the requested behaviour. If it has to be split out that's fine, but it does feel a bit counter-intuitive. david-arm: Hi @fhahn, @RosieSumpter has split out other patches before so I think she is aware why we…
		VPReductionRecipe *RedRecipe =
		new VPReductionRecipe(&RdxDesc, R, ChainOp, VecOp, CondOp, TTI);
WidenRecipe->getVPSingleValue()->replaceAllUsesWith(RedRecipe);		WidenRecipe->getVPSingleValue()->replaceAllUsesWith(RedRecipe);
Plan->removeVPValueFor(R);		Plan->removeVPValueFor(R);
Plan->addVPValue(R, RedRecipe);		Plan->addVPValue(R, RedRecipe);
WidenRecipe->getParent()->insert(RedRecipe, WidenRecipe->getIterator());		WidenRecipe->getParent()->insert(RedRecipe, WidenRecipe->getIterator());
WidenRecipe->getVPSingleValue()->replaceAllUsesWith(RedRecipe);		WidenRecipe->getVPSingleValue()->replaceAllUsesWith(RedRecipe);
WidenRecipe->eraseFromParent();		WidenRecipe->eraseFromParent();

if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind)) {		if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind)) {
▲ Show 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	for (unsigned Part = 0; Part < State.UF; ++Part) {
Value *NewRed;		Value *NewRed;
Value *NextInChain;		Value *NextInChain;
if (IsOrdered) {		if (IsOrdered) {
if (State.VF.isVector())		if (State.VF.isVector())
NewRed = createOrderedReduction(State.Builder, *RdxDesc, NewVecOp,		NewRed = createOrderedReduction(State.Builder, *RdxDesc, NewVecOp,
PrevInChain);		PrevInChain);
else		else
NewRed = State.Builder.CreateBinOp(		NewRed = State.Builder.CreateBinOp(
(Instruction::BinaryOps)RdxDesc->getOpcode(Kind), PrevInChain,		(Instruction::BinaryOps)RdxDesc->getOpcode(Kind), PrevInChain,
NewVecOp);		NewVecOp);
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I'm assuming this change will remain live after this point and thus affect nodes created after exiting this function? Which would be bad. In any case I think that instead of calling `CreateBinOp` you can instead call `CreateFMulFMF` which will propagate the FMF flags for you. paulwalker-arm: I'm assuming this change will remain live after this point and thus affect nodes created after…
		RosieSumpterAuthorUnsubmitted Not Done Reply Inline Actions Hi @paulwalker-arm thanks for the comments. It turns out this exposed a problem with fast-math flags not being propagated for ordered reductions, so I've added a patch for that here D112548 RosieSumpter: Hi @paulwalker-arm thanks for the comments. It turns out this exposed a problem with fast-math…
		fhahnUnsubmitted Done Reply Inline Actions This is a nice cleanup and could be split off as simple NFC. fhahn: This is a nice cleanup and could be split off as simple NFC.
		RosieSumpterAuthorUnsubmitted Not Done Reply Inline Actions I've added an NFC patch for this here D112547 RosieSumpter: I've added an NFC patch for this here D112547
PrevInChain = NewRed;		PrevInChain = NewRed;
} else {		} else {
PrevInChain = State.get(getChainOp(), Part);		PrevInChain = State.get(getChainOp(), Part);
NewRed = createTargetReduction(State.Builder, TTI, *RdxDesc, NewVecOp);		NewRed = createTargetReduction(State.Builder, TTI, *RdxDesc, NewVecOp);
}		}
if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind)) {		if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind)) {
NextInChain =		NextInChain =
createMinMaxOp(State.Builder, RdxDesc->getRecurrenceKind(),		createMinMaxOp(State.Builder, RdxDesc->getRecurrenceKind(),
▲ Show 20 Lines • Show All 835 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,785 Lines • ▼ Show 20 Lines	private:
}		}

/// Emit a horizontal reduction of the vectorized value.		/// Emit a horizontal reduction of the vectorized value.
Value emitReduction(Value VectorizedValue, IRBuilder<> &Builder,		Value emitReduction(Value VectorizedValue, IRBuilder<> &Builder,
unsigned ReduxWidth, const TargetTransformInfo *TTI) {		unsigned ReduxWidth, const TargetTransformInfo *TTI) {
assert(VectorizedValue && "Need to have a vectorized tree node");		assert(VectorizedValue && "Need to have a vectorized tree node");
assert(isPowerOf2_32(ReduxWidth) &&		assert(isPowerOf2_32(ReduxWidth) &&
"We only handle power-of-two reductions for now");		"We only handle power-of-two reductions for now");
		assert(RdxKind != RecurKind::FMulAdd &&
		"A call to the llvm.fmuladd intrinsic is not handled yet");

++NumVectorInstructions;		++NumVectorInstructions;
return createSimpleTargetReduction(Builder, TTI, VectorizedValue, RdxKind,		return createSimpleTargetReduction(Builder, TTI, VectorizedValue, RdxKind,
ReductionOps.back());		ReductionOps.back());
}		}
};		};

} // end anonymous namespace		} // end anonymous namespace
▲ Show 20 Lines • Show All 874 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

Show First 20 Lines • Show All 384 Lines • ▼ Show 20 Lines	for.body: ; preds = %entry, %for.body
%exitcond.not = icmp eq i64 %iv.next, %n		%exitcond.not = icmp eq i64 %iv.next, %n
br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0		br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

for.end: ; preds = %for.body		for.end: ; preds = %for.body
%rdx = phi float [ %add3, %for.body ]		%rdx = phi float [ %add3, %for.body ]
ret float %rdx		ret float %rdx
}		}

		; Test case where loop has a call to the llvm.fmuladd intrinsic.
		define float @fmuladd_strict(float* %a, float* %b, i64 %n) #0 {
		; CHECK-ORDERED-LABEL: @fmuladd_strict
		; CHECK-ORDERED: vector.body:
		; CHECK-ORDERED: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, %vector.ph ], [ [[RDX3:%.]], %vector.body ]
		david-armUnsubmitted Done Reply Inline Actions Hi @RosieSumpter, I think this CHECK line should be: ; CHECK-ORDERED: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, %vector.ph ], [ [[RDX3:%.]], %vector.body ] and then lower down you should have ; CHECK-ORDERED: [[RDX3]] = call float @llvm.vector.reduce.fadd.nxv8f32(float [[RDX2]], <vscale x 8 x float> [[FMUL3]]) This ensures that the return value from the final intrinsic call ends up as the incoming value for the PHI. At the moment you have `[[RDX3:%.]]` in both cases so in theory they could be different. david-arm:* Hi @RosieSumpter, I think this CHECK line should be: ; CHECK-ORDERED: [[VEC_PHI:%.*]] = phi…
		; CHECK-ORDERED: [[WIDE_LOAD:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD1:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD2:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD3:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD4:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD5:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD6:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD7:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[FMUL:%.*]] = fmul <vscale x 8 x float> [[WIDE_LOAD]], [[WIDE_LOAD4]]
		; CHECK-ORDERED: [[FMUL1:%.*]] = fmul <vscale x 8 x float> [[WIDE_LOAD1]], [[WIDE_LOAD5]]
		; CHECK-ORDERED: [[FMUL2:%.*]] = fmul <vscale x 8 x float> [[WIDE_LOAD2]], [[WIDE_LOAD6]]
		; CHECK-ORDERED: [[FMUL3:%.*]] = fmul <vscale x 8 x float> [[WIDE_LOAD3]], [[WIDE_LOAD7]]
		; CHECK-ORDERED: [[RDX:%.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float [[VEC_PHI]], <vscale x 8 x float> [[FMUL]])
		; CHECK-ORDERED: [[RDX1:%.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float [[RDX]], <vscale x 8 x float> [[FMUL1]])
		; CHECK-ORDERED: [[RDX2:%.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float [[RDX1]], <vscale x 8 x float> [[FMUL2]])
		; CHECK-ORDERED: [[RDX3]] = call float @llvm.vector.reduce.fadd.nxv8f32(float [[RDX2]], <vscale x 8 x float> [[FMUL3]])
		; CHECK-ORDERED: for.end
		; CHECK-ORDERED: [[RES:%.]] = phi float [ [[SCALAR:%.]], %for.body ], [ [[RDX3]], %middle.block ]
		; CHECK-ORDERED: ret float [[RES]]

		; CHECK-UNORDERED-LABEL: @fmuladd_strict
		; CHECK-UNORDERED: vector.body
		; CHECK-UNORDERED: [[VEC_PHI:%.]] = phi <vscale x 8 x float> [ insertelement (<vscale x 8 x float> shufflevector (<vscale x 8 x float> insertelement (<vscale x 8 x float> poison, float -0.000000e+00, i32 0), <vscale x 8 x float> poison, <vscale x 8 x i32> zeroinitializer), float 0.000000e+00, i32 0), %vector.ph ], [ [[FMULADD:%.]], %vector.body ]
		; CHECK-UNORDERED: [[VEC_PHI1:%.]] = phi <vscale x 8 x float> [ shufflevector (<vscale x 8 x float> insertelement (<vscale x 8 x float> poison, float -0.000000e+00, i32 0), <vscale x 8 x float> poison, <vscale x 8 x i32> zeroinitializer), %vector.ph ], [ [[FMULADD1:%.]], %vector.body ]
		; CHECK-UNORDERED: [[VEC_PHI2:%.]] = phi <vscale x 8 x float> [ shufflevector (<vscale x 8 x float> insertelement (<vscale x 8 x float> poison, float -0.000000e+00, i32 0), <vscale x 8 x float> poison, <vscale x 8 x i32> zeroinitializer), %vector.ph ], [ [[FMULADD2:%.]], %vector.body ]
		; CHECK-UNORDERED: [[VEC_PHI3:%.]] = phi <vscale x 8 x float> [ shufflevector (<vscale x 8 x float> insertelement (<vscale x 8 x float> poison, float -0.000000e+00, i32 0), <vscale x 8 x float> poison, <vscale x 8 x i32> zeroinitializer), %vector.ph ], [ [[FMULADD3:%.]], %vector.body ]
		; CHECK-UNORDERED: [[WIDE_LOAD:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD1:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD2:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD3:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD4:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD5:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD6:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD7:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[FMULADD]] = call <vscale x 8 x float> @llvm.fmuladd.nxv8f32(<vscale x 8 x float> [[WIDE_LOAD]], <vscale x 8 x float> [[WIDE_LOAD4]], <vscale x 8 x float> [[VEC_PHI]])
		; CHECK-UNORDERED: [[FMULADD1]] = call <vscale x 8 x float> @llvm.fmuladd.nxv8f32(<vscale x 8 x float> [[WIDE_LOAD1]], <vscale x 8 x float> [[WIDE_LOAD5]], <vscale x 8 x float> [[VEC_PHI1]])
		; CHECK-UNORDERED: [[FMULADD2]] = call <vscale x 8 x float> @llvm.fmuladd.nxv8f32(<vscale x 8 x float> [[WIDE_LOAD2]], <vscale x 8 x float> [[WIDE_LOAD6]], <vscale x 8 x float> [[VEC_PHI2]])
		; CHECK-UNORDERED: [[FMULADD3]] = call <vscale x 8 x float> @llvm.fmuladd.nxv8f32(<vscale x 8 x float> [[WIDE_LOAD3]], <vscale x 8 x float> [[WIDE_LOAD7]], <vscale x 8 x float> [[VEC_PHI3]])
		; CHECK-UNORDERED-NOT: llvm.vector.reduce.fadd
		; CHECK-UNORDERED: middle.block
		; CHECK-UNORDERED: [[BIN_RDX:%.*]] = fadd <vscale x 8 x float> [[FMULADD1]], [[FMULADD]]
		; CHECK-UNORDERED: [[BIN_RDX1:%.*]] = fadd <vscale x 8 x float> [[FMULADD2]], [[BIN_RDX]]
		; CHECK-UNORDERED: [[BIN_RDX2:%.*]] = fadd <vscale x 8 x float> [[FMULADD3]], [[BIN_RDX1]]
		; CHECK-UNORDERED: [[RDX:%.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float -0.000000e+00, <vscale x 8 x float> [[BIN_RDX2]]
		; CHECK-UNORDERED: for.body
		; CHECK-UNORDERED: [[SUM_07:%.]] = phi float [ [[SCALAR:%.]], %scalar.ph ], [ [[MULADD:%.*]], %for.body ]
		; CHECK-UNORDERED: [[LOAD:%.]] = load float, float
		; CHECK-UNORDERED: [[LOAD1:%.]] = load float, float
		; CHECK-UNORDERED: [[MULADD]] = tail call float @llvm.fmuladd.f32(float [[LOAD]], float [[LOAD1]], float [[SUM_07]])
		; CHECK-UNORDERED: for.end
		; CHECK-UNORDERED: [[RES:%.*]] = phi float [ [[MULADD]], %for.body ], [ [[RDX]], %middle.block ]
		; CHECK-UNORDERED: ret float [[RES]]

		; CHECK-NOT-VECTORIZED-LABEL: @fmuladd_strict
		; CHECK-NOT-VECTORIZED-NOT: vector.body

		entry:
		br label %for.body

		for.body:
		%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
		%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
		%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
		%0 = load float, float* %arrayidx, align 4
		%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
		%1 = load float, float* %arrayidx2, align 4
		%muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
		%iv.next = add nuw nsw i64 %iv, 1
		%exitcond.not = icmp eq i64 %iv.next, %n
		br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !1

		for.end:
		ret float %muladd
		}

		; Same as above but where the call to the llvm.fmuladd intrinsic has a fast-math flag.
		define float @fmuladd_strict_fmf(float* %a, float* %b, i64 %n) #0 {
		; CHECK-ORDERED-LABEL: @fmuladd_strict_fmf
		; CHECK-ORDERED: vector.body:
		; CHECK-ORDERED: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, %vector.ph ], [ [[RDX3:%.]], %vector.body ]
		david-armUnsubmitted Done Reply Inline Actions I think we should be checking for `[[RDX3:%.]]` here. david-arm:* I think we should be checking for `[[RDX3:%.*]]` here.
		; CHECK-ORDERED: [[WIDE_LOAD:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD1:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD2:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD3:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD4:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD5:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD6:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[WIDE_LOAD7:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-ORDERED: [[FMUL:%.*]] = fmul <vscale x 8 x float> [[WIDE_LOAD]], [[WIDE_LOAD4]]
		; CHECK-ORDERED: [[FMUL1:%.*]] = fmul <vscale x 8 x float> [[WIDE_LOAD1]], [[WIDE_LOAD5]]
		; CHECK-ORDERED: [[FMUL2:%.*]] = fmul <vscale x 8 x float> [[WIDE_LOAD2]], [[WIDE_LOAD6]]
		; CHECK-ORDERED: [[FMUL3:%.*]] = fmul <vscale x 8 x float> [[WIDE_LOAD3]], [[WIDE_LOAD7]]
		; CHECK-ORDERED: [[RDX:%.*]] = call nnan float @llvm.vector.reduce.fadd.nxv8f32(float [[VEC_PHI]], <vscale x 8 x float> [[FMUL]])
		; CHECK-ORDERED: [[RDX1:%.*]] = call nnan float @llvm.vector.reduce.fadd.nxv8f32(float [[RDX]], <vscale x 8 x float> [[FMUL1]])
		; CHECK-ORDERED: [[RDX2:%.*]] = call nnan float @llvm.vector.reduce.fadd.nxv8f32(float [[RDX1]], <vscale x 8 x float> [[FMUL2]])
		; CHECK-ORDERED: [[RDX3]] = call nnan float @llvm.vector.reduce.fadd.nxv8f32(float [[RDX2]], <vscale x 8 x float> [[FMUL3]])
		david-armUnsubmitted Done Reply Inline Actions Again, you should be able to use `[[RDX3]] = call ...` here david-arm: Again, you should be able to use `[[RDX3]] = call ...` here
		; CHECK-ORDERED: for.end
		; CHECK-ORDERED: [[RES:%.]] = phi float [ [[SCALAR:%.]], %for.body ], [ [[RDX3]], %middle.block ]
		; CHECK-ORDERED: ret float [[RES]]

		; CHECK-UNORDERED-LABEL: @fmuladd_strict_fmf
		; CHECK-UNORDERED: vector.body
		; CHECK-UNORDERED: [[VEC_PHI:%.]] = phi <vscale x 8 x float> [ insertelement (<vscale x 8 x float> shufflevector (<vscale x 8 x float> insertelement (<vscale x 8 x float> poison, float -0.000000e+00, i32 0), <vscale x 8 x float> poison, <vscale x 8 x i32> zeroinitializer), float 0.000000e+00, i32 0), %vector.ph ], [ [[FMULADD:%.]], %vector.body ]
		; CHECK-UNORDERED: [[VEC_PHI1:%.]] = phi <vscale x 8 x float> [ shufflevector (<vscale x 8 x float> insertelement (<vscale x 8 x float> poison, float -0.000000e+00, i32 0), <vscale x 8 x float> poison, <vscale x 8 x i32> zeroinitializer), %vector.ph ], [ [[FMULADD1:%.]], %vector.body ]
		; CHECK-UNORDERED: [[VEC_PHI2:%.]] = phi <vscale x 8 x float> [ shufflevector (<vscale x 8 x float> insertelement (<vscale x 8 x float> poison, float -0.000000e+00, i32 0), <vscale x 8 x float> poison, <vscale x 8 x i32> zeroinitializer), %vector.ph ], [ [[FMULADD2:%.]], %vector.body ]
		; CHECK-UNORDERED: [[VEC_PHI3:%.]] = phi <vscale x 8 x float> [ shufflevector (<vscale x 8 x float> insertelement (<vscale x 8 x float> poison, float -0.000000e+00, i32 0), <vscale x 8 x float> poison, <vscale x 8 x i32> zeroinitializer), %vector.ph ], [ [[FMULADD3:%.]], %vector.body ]
		; CHECK-UNORDERED: [[WIDE_LOAD:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD1:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD2:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD3:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD4:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD5:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD6:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[WIDE_LOAD7:%.]] = load <vscale x 8 x float>, <vscale x 8 x float>
		; CHECK-UNORDERED: [[FMULADD]] = call nnan <vscale x 8 x float> @llvm.fmuladd.nxv8f32(<vscale x 8 x float> [[WIDE_LOAD]], <vscale x 8 x float> [[WIDE_LOAD4]], <vscale x 8 x float> [[VEC_PHI]])
		; CHECK-UNORDERED: [[FMULADD1]] = call nnan <vscale x 8 x float> @llvm.fmuladd.nxv8f32(<vscale x 8 x float> [[WIDE_LOAD1]], <vscale x 8 x float> [[WIDE_LOAD5]], <vscale x 8 x float> [[VEC_PHI1]])
		; CHECK-UNORDERED: [[FMULADD2]] = call nnan <vscale x 8 x float> @llvm.fmuladd.nxv8f32(<vscale x 8 x float> [[WIDE_LOAD2]], <vscale x 8 x float> [[WIDE_LOAD6]], <vscale x 8 x float> [[VEC_PHI2]])
		; CHECK-UNORDERED: [[FMULADD3]] = call nnan <vscale x 8 x float> @llvm.fmuladd.nxv8f32(<vscale x 8 x float> [[WIDE_LOAD3]], <vscale x 8 x float> [[WIDE_LOAD7]], <vscale x 8 x float> [[VEC_PHI3]])
		; CHECK-UNORDERED-NOT: llvm.vector.reduce.fadd
		; CHECK-UNORDERED: middle.block
		; CHECK-UNORDERED: [[BIN_RDX:%.*]] = fadd nnan <vscale x 8 x float> [[FMULADD1]], [[FMULADD]]
		; CHECK-UNORDERED: [[BIN_RDX1:%.*]] = fadd nnan <vscale x 8 x float> [[FMULADD2]], [[BIN_RDX]]
		; CHECK-UNORDERED: [[BIN_RDX2:%.*]] = fadd nnan <vscale x 8 x float> [[FMULADD3]], [[BIN_RDX1]]
		; CHECK-UNORDERED: [[RDX:%.*]] = call nnan float @llvm.vector.reduce.fadd.nxv8f32(float -0.000000e+00, <vscale x 8 x float> [[BIN_RDX2]]
		; CHECK-UNORDERED: for.body
		; CHECK-UNORDERED: [[SUM_07:%.]] = phi float [ [[SCALAR:%.]], %scalar.ph ], [ [[MULADD:%.*]], %for.body ]
		; CHECK-UNORDERED: [[LOAD:%.]] = load float, float
		; CHECK-UNORDERED: [[LOAD1:%.]] = load float, float
		; CHECK-UNORDERED: [[MULADD]] = tail call nnan float @llvm.fmuladd.f32(float [[LOAD]], float [[LOAD1]], float [[SUM_07]])
		; CHECK-UNORDERED: for.end
		; CHECK-UNORDERED: [[RES:%.*]] = phi float [ [[MULADD]], %for.body ], [ [[RDX]], %middle.block ]
		; CHECK-UNORDERED: ret float [[RES]]

		; CHECK-NOT-VECTORIZED-LABEL: @fmuladd_strict_fmf
		; CHECK-NOT-VECTORIZED-NOT: vector.body

		entry:
		br label %for.body

		for.body:
		%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
		%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
		%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
		%0 = load float, float* %arrayidx, align 4
		%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
		%1 = load float, float* %arrayidx2, align 4
		%muladd = tail call nnan float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
		%iv.next = add nuw nsw i64 %iv, 1
		%exitcond.not = icmp eq i64 %iv.next, %n
		br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !1

		for.end:
		ret float %muladd
		}

		declare float @llvm.fmuladd.f32(float, float, float)

attributes #0 = { vscale_range(0, 16) }		attributes #0 = { vscale_range(0, 16) }
!0 = distinct !{!0, !3, !6, !8}		!0 = distinct !{!0, !3, !6, !8}
!1 = distinct !{!1, !3, !7, !8}		!1 = distinct !{!1, !3, !7, !8}
!2 = distinct !{!2, !4, !6, !8}		!2 = distinct !{!2, !4, !6, !8}
!3 = !{!"llvm.loop.vectorize.width", i32 8}		!3 = !{!"llvm.loop.vectorize.width", i32 8}
!4 = !{!"llvm.loop.vectorize.width", i32 4}		!4 = !{!"llvm.loop.vectorize.width", i32 4}
!5 = !{!"llvm.loop.vectorize.width", i32 2}		!5 = !{!"llvm.loop.vectorize.width", i32 2}
!6 = !{!"llvm.loop.interleave.count", i32 1}		!6 = !{!"llvm.loop.interleave.count", i32 1}
!7 = !{!"llvm.loop.interleave.count", i32 4}		!7 = !{!"llvm.loop.interleave.count", i32 4}
!8 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}		!8 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll

	Show First 20 Lines • Show All 936 Lines • ▼ Show 20 Lines
	; CHECK-ORDERED-LABEL: @fadd_multiple_use			; CHECK-ORDERED-LABEL: @fadd_multiple_use
	; CHECK-ORDERED-LABEL-NOT: vector.body			; CHECK-ORDERED-LABEL-NOT: vector.body

	; CHECK-UNORDERED-LABEL: @fadd_multiple_use			; CHECK-UNORDERED-LABEL: @fadd_multiple_use
	; CHECK-UNORDERED-LABEL-NOT: vector.body			; CHECK-UNORDERED-LABEL-NOT: vector.body

	; CHECK-NOT-VECTORIZED-LABEL: @fadd_multiple_use			; CHECK-NOT-VECTORIZED-LABEL: @fadd_multiple_use
	; CHECK-NOT-VECTORIZED-NOT: vector.body			; CHECK-NOT-VECTORIZED-NOT: vector.body

	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next2, %bb2 ]			%iv = phi i64 [ 0, %entry ], [ %iv.next2, %bb2 ]
	%red = phi float [ 0.0, %entry ], [ %fadd, %bb2 ]			%red = phi float [ 0.0, %entry ], [ %fadd, %bb2 ]
	%phi1 = phi i64 [ 0, %entry ], [ %iv.next, %bb2 ]			%phi1 = phi i64 [ 0, %entry ], [ %iv.next, %bb2 ]
	%fadd = fadd float %red, 1.000000e+00			%fadd = fadd float %red, 1.000000e+00
	%iv.next = add nsw i64 %phi1, 1			%iv.next = add nsw i64 %phi1, 1
				david-armUnsubmitted Done Reply Inline Actions Again, you should be able to use `[[RDX3]] = call ...` here david-arm: Again, you should be able to use `[[RDX3]] = call ...` here
	%cmp = icmp ult i64 %iv, %n			%cmp = icmp ult i64 %iv, %n
	br i1 %cmp, label %bb2, label %bb1			br i1 %cmp, label %bb2, label %bb1

	bb1:			bb1:
	%phi2 = phi float [ %fadd, %for.body ]			%phi2 = phi float [ %fadd, %for.body ]
	ret float %phi2			ret float %phi2

	bb2:			bb2:
	%iv.next2 = add nuw nsw i64 %iv, 1			%iv.next2 = add nuw nsw i64 %iv, 1
	br i1 false, label %for.end, label %for.body			br i1 false, label %for.end, label %for.body

	for.end:			for.end:
	%phi3 = phi float [ %fadd, %bb2 ]			%phi3 = phi float [ %fadd, %bb2 ]
	ret float %phi3			ret float %phi3
	}			}

				; Test case where the loop has a call to the llvm.fmuladd intrinsic.
				define float @fmuladd_strict(float* %a, float* %b, i64 %n) {
				; CHECK-ORDERED-LABEL: @fmuladd_strict
				; CHECK-ORDERED: vector.body:
				; CHECK-ORDERED: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, %vector.ph ], [ [[RDX3:%.]], %vector.body ]
				; CHECK-ORDERED: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float>
				; CHECK-ORDERED: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float>
				; CHECK-ORDERED: [[WIDE_LOAD2:%.]] = load <8 x float>, <8 x float>
				; CHECK-ORDERED: [[WIDE_LOAD3:%.]] = load <8 x float>, <8 x float>
				; CHECK-ORDERED: [[WIDE_LOAD4:%.]] = load <8 x float>, <8 x float>
				; CHECK-ORDERED: [[WIDE_LOAD5:%.]] = load <8 x float>, <8 x float>
				; CHECK-ORDERED: [[WIDE_LOAD6:%.]] = load <8 x float>, <8 x float>
				; CHECK-ORDERED: [[WIDE_LOAD7:%.]] = load <8 x float>, <8 x float>
				; CHECK-ORDERED: [[FMUL:%.*]] = fmul <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD4]]
				; CHECK-ORDERED: [[FMUL1:%.*]] = fmul <8 x float> [[WIDE_LOAD1]], [[WIDE_LOAD5]]
				; CHECK-ORDERED: [[FMUL2:%.*]] = fmul <8 x float> [[WIDE_LOAD2]], [[WIDE_LOAD6]]
				; CHECK-ORDERED: [[FMUL3:%.*]] = fmul <8 x float> [[WIDE_LOAD3]], [[WIDE_LOAD7]]
				; CHECK-ORDERED: [[RDX:%.*]] = call float @llvm.vector.reduce.fadd.v8f32(float [[VEC_PHI]], <8 x float> [[FMUL]])
				; CHECK-ORDERED: [[RDX1:%.*]] = call float @llvm.vector.reduce.fadd.v8f32(float [[RDX]], <8 x float> [[FMUL1]])
				; CHECK-ORDERED: [[RDX2:%.*]] = call float @llvm.vector.reduce.fadd.v8f32(float [[RDX1]], <8 x float> [[FMUL2]])
				; CHECK-ORDERED: [[RDX3]] = call float @llvm.vector.reduce.fadd.v8f32(float [[RDX2]], <8 x float> [[FMUL3]])
				; CHECK-ORDERED: for.body:
				; CHECK-ORDERED: [[SUM_07:%.]] = phi float [ {{.}}, %scalar.ph ], [ [[MULADD:%.*]], %for.body ]
				; CHECK-ORDERED: [[LOAD:%.]] = load float, float
				; CHECK-ORDERED: [[LOAD1:%.]] = load float, float
				; CHECK-ORDERED: [[MULADD]] = tail call float @llvm.fmuladd.f32(float [[LOAD]], float [[LOAD1]], float [[SUM_07]])
				; CHECK-ORDERED: for.end
				; CHECK-ORDERED: [[RES:%.*]] = phi float [ [[MULADD]], %for.body ], [ [[RDX3]], %middle.block ]

				; CHECK-UNORDERED-LABEL: @fmuladd_strict
				; CHECK-UNORDERED: vector.body:
				; CHECK-UNORDERED: [[VEC_PHI:%.]] = phi <8 x float> [ <float 0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %vector.ph ], [ [[FMULADD:%.]], %vector.body ]
				; CHECK-UNORDERED: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float>
				; CHECK-UNORDERED: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float>
				; CHECK-UNORDERED: [[WIDE_LOAD2:%.]] = load <8 x float>, <8 x float>
				; CHECK-UNORDERED: [[WIDE_LOAD3:%.]] = load <8 x float>, <8 x float>
				; CHECK-UNORDERED: [[WIDE_LOAD4:%.]] = load <8 x float>, <8 x float>
				; CHECK-UNORDERED: [[FMULADD]] = call <8 x float> @llvm.fmuladd.v8f32(<8 x float> [[WIDE_LOAD]], <8 x float> [[WIDE_LOAD4]], <8 x float> [[VEC_PHI]])
				; CHECK-UNORDERED-NOT: llvm.vector.reduce.fadd
				; CHECK-UNORDERED: middle.block:
				; CHECK-UNORDERED: [[BIN_RDX1:%.*]] = fadd <8 x float>
				; CHECK-UNORDERED: [[BIN_RDX2:%.*]] = fadd <8 x float>
				; CHECK-UNORDERED: [[BIN_RDX3:%.*]] = fadd <8 x float>
				david-armUnsubmitted Done Reply Inline Actions `[[RDX3:%.]]` david-arm:* `[[RDX3:%.*]]`
				; CHECK-UNORDERED: [[RDX:%.*]] = call float @llvm.vector.reduce.fadd.v8f32(float -0.000000e+00, <8 x float> [[BIN_RDX3]])
				; CHECK-UNORDERED: for.body:
				; CHECK-UNORDERED: [[SUM_07:%.]] = phi float [ {{.}}, %scalar.ph ], [ [[MULADD:%.*]], %for.body ]
				; CHECK-UNORDERED: [[LOAD:%.]] = load float, float
				; CHECK-UNORDERED: [[LOAD2:%.]] = load float, float
				; CHECK-UNORDERED: [[MULADD]] = tail call float @llvm.fmuladd.f32(float [[LOAD]], float [[LOAD2]], float [[SUM_07]])
				; CHECK-UNORDERED: for.end:
				; CHECK-UNORDERED: [[RES:%.*]] = phi float [ [[MULADD]], %for.body ], [ [[RDX]], %middle.block ]
				; CHECK-UNORDERED: ret float [[RES]]

				; CHECK-NOT-VECTORIZED-LABEL: @fmuladd_strict
				; CHECK-NOT-VECTORIZED-NOT: vector.body

				entry:
				br label %for.body

				david-armUnsubmitted Done Reply Inline Actions `[[RDX3]] = fadd ...` david-arm: `[[RDX3]] = fadd ...`
				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
				%0 = load float, float* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
				%1 = load float, float* %arrayidx2, align 4
				%muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !1

				for.end:
				ret float %muladd
				}

				; Test reductions for a VF of 1 and a UF > 1 where the loop has a call to the llvm.fmuladd intrinsic.
				define float @fmuladd_scalar_vf(float* %a, float* %b, i64 %n) {
				; CHECK-ORDERED-LABEL: @fmuladd_scalar_vf
				; CHECK-ORDERED: vector.body:
				; CHECK-ORDERED: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, %vector.ph ], [ [[FADD3:%.]], %vector.body ]
				; CHECK-ORDERED: [[LOAD:%.]] = load float, float
				; CHECK-ORDERED: [[LOAD1:%.]] = load float, float
				; CHECK-ORDERED: [[LOAD2:%.]] = load float, float
				; CHECK-ORDERED: [[LOAD3:%.]] = load float, float
				; CHECK-ORDERED: [[LOAD4:%.]] = load float, float
				; CHECK-ORDERED: [[LOAD5:%.]] = load float, float
				; CHECK-ORDERED: [[LOAD6:%.]] = load float, float
				; CHECK-ORDERED: [[LOAD7:%.]] = load float, float
				; CHECK-ORDERED: [[FMUL:%.*]] = fmul float [[LOAD]], [[LOAD4]]
				; CHECK-ORDERED: [[FMUL1:%.*]] = fmul float [[LOAD1]], [[LOAD5]]
				; CHECK-ORDERED: [[FMUL2:%.*]] = fmul float [[LOAD2]], [[LOAD6]]
				; CHECK-ORDERED: [[FMUL3:%.*]] = fmul float [[LOAD3]], [[LOAD7]]
				; CHECK-ORDERED: [[FADD:%.*]] = fadd float [[VEC_PHI]], [[FMUL]]
				; CHECK-ORDERED: [[FADD1:%.*]] = fadd float [[FADD]], [[FMUL1]]
				; CHECK-ORDERED: [[FADD2:%.*]] = fadd float [[FADD1]], [[FMUL2]]
				; CHECK-ORDERED: [[FADD3]] = fadd float [[FADD2]], [[FMUL3]]
				; CHECK-ORDERED-NOT: llvm.vector.reduce.fadd
				; CHECK-ORDERED: scalar.ph
				; CHECK-ORDERED: [[MERGE_RDX:%.*]] = phi float [ 0.000000e+00, %entry ], [ [[FADD3]], %middle.block ]
				; CHECK-ORDERED: for.body
				; CHECK-ORDERED: [[SUM_07:%.]] = phi float [ [[MERGE_RDX]], %scalar.ph ], [ [[MULADD:%.]], %for.body ]
				; CHECK-ORDERED: [[LOAD8:%.]] = load float, float
				; CHECK-ORDERED: [[LOAD9:%.]] = load float, float
				; CHECK-ORDERED: [[MULADD]] = tail call float @llvm.fmuladd.f32(float [[LOAD8]], float [[LOAD9]], float [[SUM_07]])
				; CHECK-ORDERED: for.end
				; CHECK-ORDERED: [[RES:%.*]] = phi float [ [[MULADD]], %for.body ], [ [[FADD3]], %middle.block ]
				; CHECK-ORDERED: ret float [[RES]]

				; CHECK-UNORDERED-LABEL: @fmuladd_scalar_vf
				; CHECK-UNORDERED: vector.body:
				; CHECK-UNORDERED: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, %vector.ph ], [ [[FMULADD:%.]], %vector.body ]
				; CHECK-UNORDERED: [[VEC_PHI1:%.]] = phi float [ -0.000000e+00, %vector.ph ], [ [[FMULADD1:%.]], %vector.body ]
				; CHECK-UNORDERED: [[VEC_PHI2:%.]] = phi float [ -0.000000e+00, %vector.ph ], [ [[FMULADD2:%.]], %vector.body ]
				; CHECK-UNORDERED: [[VEC_PHI3:%.]] = phi float [ -0.000000e+00, %vector.ph ], [ [[FMULADD3:%.]], %vector.body ]
				; CHECK-UNORDERED: [[LOAD:%.]] = load float, float
				; CHECK-UNORDERED: [[LOAD1:%.]] = load float, float
				; CHECK-UNORDERED: [[LOAD2:%.]] = load float, float
				; CHECK-UNORDERED: [[LOAD3:%.]] = load float, float
				; CHECK-UNORDERED: [[LOAD4:%.]] = load float, float
				; CHECK-UNORDERED: [[LOAD5:%.]] = load float, float
				; CHECK-UNORDERED: [[LOAD6:%.]] = load float, float
				; CHECK-UNORDERED: [[LOAD7:%.]] = load float, float
				; CHECK-UNORDERED: [[FMULADD]] = call float @llvm.fmuladd.f32(float [[LOAD]], float [[LOAD4]], float [[VEC_PHI]])
				; CHECK-UNORDERED: [[FMULADD1]] = call float @llvm.fmuladd.f32(float [[LOAD1]], float [[LOAD5]], float [[VEC_PHI1]])
				; CHECK-UNORDERED: [[FMULADD2]] = call float @llvm.fmuladd.f32(float [[LOAD2]], float [[LOAD6]], float [[VEC_PHI2]])
				; CHECK-UNORDERED: [[FMULADD3]] = call float @llvm.fmuladd.f32(float [[LOAD3]], float [[LOAD7]], float [[VEC_PHI3]])
				; CHECK-UNORDERED-NOT: llvm.vector.reduce.fadd
				; CHECK-UNORDERED: middle.block:
				; CHECK-UNORDERED: [[BIN_RDX:%.*]] = fadd float [[FMULADD1]], [[FMULADD]]
				; CHECK-UNORDERED: [[BIN_RDX1:%.*]] = fadd float [[FMULADD2]], [[BIN_RDX]]
				david-armUnsubmitted Done Reply Inline Actions Is it worth also having a negative test for the case when the PHI is both a mul operand and the add operand too? david-arm: Is it worth also having a negative test for the case when the PHI is both a mul operand and the…
				; CHECK-UNORDERED: [[BIN_RDX2:%.*]] = fadd float [[FMULADD3]], [[BIN_RDX1]]
				; CHECK-UNORDERED: scalar.ph:
				; CHECK-UNORDERED: [[MERGE_RDX:%.*]] = phi float [ 0.000000e+00, %entry ], [ [[BIN_RDX2]], %middle.block ]
				; CHECK-UNORDERED: for.body:
				; CHECK-UNORDERED: [[SUM_07:%.]] = phi float [ [[MERGE_RDX]], %scalar.ph ], [ [[MULADD:%.]], %for.body ]
				; CHECK-UNORDERED: [[LOAD8:%.]] = load float, float
				; CHECK-UNORDERED: [[LOAD9:%.]] = load float, float
				; CHECK-UNORDERED: [[MULADD]] = tail call float @llvm.fmuladd.f32(float [[LOAD8]], float [[LOAD9]], float [[SUM_07]])
				; CHECK-UNORDERED: for.end:
				; CHECK-UNORDERED: [[RES:%.*]] = phi float [ [[MULADD]], %for.body ], [ [[BIN_RDX2]], %middle.block ]
				; CHECK-UNORDERED: ret float [[RES]]

				; CHECK-NOT-VECTORIZED-LABEL: @fmuladd_scalar_vf
				; CHECK-NOT-VECTORIZED-NOT: vector.body

				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
				%0 = load float, float* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
				%1 = load float, float* %arrayidx2, align 4
				%muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !4

				for.end:
				ret float %muladd
				}

				; Test case where the reduction phi is one of the mul operands of the fmuladd.
				define float @fmuladd_phi_is_mul_operand(float* %a, float* %b, i64 %n) {
				; CHECK-ORDERED-LABEL: @fmuladd_phi_is_mul_operand
				; CHECK-ORDERED-NOT: vector.body

				; CHECK-UNORDERED-LABEL: @fmuladd_phi_is_mul_operand
				; CHECK-UNORDERED-NOT: vector.body

				; CHECK-NOT-VECTORIZED-LABEL: @fmuladd_phi_is_mul_operand
				; CHECK-NOT-VECTORIZED-NOT: vector.body

				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
				%0 = load float, float* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
				%1 = load float, float* %arrayidx2, align 4
				%muladd = tail call float @llvm.fmuladd.f32(float %sum.07, float %0, float %1)
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !1

				for.end:
				ret float %muladd
				}

				; Test case where the reduction phi is two operands of the fmuladd.
				define float @fmuladd_phi_is_two_operands(float* %a, i64 %n) {
				; CHECK-ORDERED-LABEL: @fmuladd_phi_is_two_operands
				; CHECK-ORDERED-NOT: vector.body

				; CHECK-UNORDERED-LABEL: @fmuladd_phi_is_two_operands
				; CHECK-UNORDERED-NOT: vector.body

				; CHECK-NOT-VECTORIZED-LABEL: @fmuladd_phi_is_two_operands
				; CHECK-NOT-VECTORIZED-NOT: vector.body

				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
				%0 = load float, float* %arrayidx, align 4
				%muladd = tail call float @llvm.fmuladd.f32(float %sum.07, float %0, float %sum.07)
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !1

				for.end:
				ret float %muladd
				}

				; Test case with multiple calls to llvm.fmuladd, which is not safe to reorder
				; so is only vectorized in the unordered (fast) case.
				define float @fmuladd_multiple(float* %a, float* %b, i64 %n) {
				; CHECK-ORDERED-LABEL: @fmuladd_multiple
				; CHECK-ORDERED-NOT: vector.body:

				; CHECK-UNORDERED-LABEL: @fmuladd_multiple
				; CHECK-UNORDERED: vector.body:
				; CHECK-UNORDERED: [[VEC_PHI:%.]] = phi <8 x float> [ <float 0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %vector.ph ], [ [[FMULADD2:%.]], %vector.body ]
				; CHECK-UNORDERED: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float>
				; CHECK-UNORDERED: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float>
				; CHECK-UNORDERED: [[WIDE_LOAD2:%.]] = load <8 x float>, <8 x float>
				; CHECK-UNORDERED: [[WIDE_LOAD3:%.]] = load <8 x float>, <8 x float>
				; CHECK-UNORDERED: [[WIDE_LOAD4:%.]] = load <8 x float>, <8 x float>
				; CHECK-UNORDERED: [[FMULADD:%.*]] = call <8 x float> @llvm.fmuladd.v8f32(<8 x float> [[WIDE_LOAD]], <8 x float> [[WIDE_LOAD4]], <8 x float> [[VEC_PHI]])
				; CHECK-UNORDERED: [[FMULADD2]] = call <8 x float> @llvm.fmuladd.v8f32(<8 x float> [[WIDE_LOAD]], <8 x float> [[WIDE_LOAD4]], <8 x float> [[FMULADD]])
				; CHECK-UNORDERED-NOT: llvm.vector.reduce.fadd
				; CHECK-UNORDERED: middle.block:
				; CHECK-UNORDERED: [[BIN_RDX1:%.*]] = fadd <8 x float>
				; CHECK-UNORDERED: [[BIN_RDX2:%.*]] = fadd <8 x float>
				; CHECK-UNORDERED: [[BIN_RDX3:%.*]] = fadd <8 x float>
				; CHECK-UNORDERED: [[RDX:%.*]] = call float @llvm.vector.reduce.fadd.v8f32(float -0.000000e+00, <8 x float> [[BIN_RDX3]])
				; CHECK-UNORDERED: for.body:
				; CHECK-UNORDERED: [[SUM_07:%.]] = phi float [ {{.}}, %scalar.ph ], [ [[MULADD2:%.*]], %for.body ]
				; CHECK-UNORDERED: [[LOAD:%.]] = load float, float
				; CHECK-UNORDERED: [[LOAD2:%.]] = load float, float
				; CHECK-UNORDERED: [[MULADD:%.*]] = tail call float @llvm.fmuladd.f32(float [[LOAD]], float [[LOAD2]], float [[SUM_07]])
				; CHECK-UNORDERED: [[MULADD2]] = tail call float @llvm.fmuladd.f32(float [[LOAD]], float [[LOAD2]], float [[MULADD]])
				; CHECK-UNORDERED: for.end:
				; CHECK-UNORDERED: [[RES:%.*]] = phi float [ [[MULADD2]], %for.body ], [ [[RDX]], %middle.block ]
				; CHECK-UNORDERED: ret float [[RES]]

				; CHECK-NOT-VECTORIZED-LABEL: @fmuladd_multiple
				; CHECK-NOT-VECTORIZED-NOT: vector.body:

				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd2, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
				%0 = load float, float* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
				%1 = load float, float* %arrayidx2, align 4
				%muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
				%muladd2 = tail call float @llvm.fmuladd.f32(float %0, float %1, float %muladd)
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !1

				for.end:
				ret float %muladd2
				}

				; Same as above but the first fmuladd is one of the mul operands of the second fmuladd.
				define float @multiple_fmuladds_mul_operand(float* %a, float* %b, i64 %n) {
				; CHECK-ORDERED-LABEL: @multiple_fmuladds_mul_operand
				; CHECK-ORDERED-NOT: vector.body

				; CHECK-UNORDERED-LABEL: @multiple_fmuladds_mul_operand
				; CHECK-UNORDERED-NOT: vector.body

				; CHECK-NOT-VECTORIZED-LABEL: @multiple_fmuladds_mul_operand
				; CHECK-NOT-VECTORIZED-NOT: vector.body

				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd2, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
				%0 = load float, float* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
				%1 = load float, float* %arrayidx2, align 4
				%muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
				%muladd2 = tail call float @llvm.fmuladd.f32(float %0, float %muladd, float %1)
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !1

				for.end:
				ret float %muladd2
				}

				; Same as above but the first fmuladd is two of the operands of the second fmuladd.
				define float @multiple_fmuladds_two_operands(float* %a, float* %b, i64 %n) {
				; CHECK-ORDERED-LABEL: @multiple_fmuladds_two_operands
				; CHECK-ORDERED-NOT: vector.body

				; CHECK-UNORDERED-LABEL: @multiple_fmuladds_two_operands
				; CHECK-UNORDERED-NOT: vector.body

				; CHECK-NOT-VECTORIZED-LABEL: @multiple_fmuladds_two_operands
				; CHECK-NOT-VECTORIZED-NOT: vector.body

				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd2, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
				%0 = load float, float* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
				%1 = load float, float* %arrayidx2, align 4
				%muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
				%muladd2 = tail call float @llvm.fmuladd.f32(float %0, float %muladd, float %muladd)
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !1

				for.end:
				ret float %muladd2
				}

				declare float @llvm.fmuladd.f32(float, float, float)

	!0 = distinct !{!0, !5, !9, !11}			!0 = distinct !{!0, !5, !9, !11}
	!1 = distinct !{!1, !5, !10, !11}			!1 = distinct !{!1, !5, !10, !11}
	!2 = distinct !{!2, !6, !9, !11}			!2 = distinct !{!2, !6, !9, !11}
	!3 = distinct !{!3, !7, !9, !11, !12}			!3 = distinct !{!3, !7, !9, !11, !12}
	!4 = distinct !{!4, !8, !10, !11}			!4 = distinct !{!4, !8, !10, !11}
	!5 = !{!"llvm.loop.vectorize.width", i32 8}			!5 = !{!"llvm.loop.vectorize.width", i32 8}
	!6 = !{!"llvm.loop.vectorize.width", i32 4}			!6 = !{!"llvm.loop.vectorize.width", i32 4}
	!7 = !{!"llvm.loop.vectorize.width", i32 2}			!7 = !{!"llvm.loop.vectorize.width", i32 2}
	!8 = !{!"llvm.loop.vectorize.width", i32 1}			!8 = !{!"llvm.loop.vectorize.width", i32 1}
	!9 = !{!"llvm.loop.interleave.count", i32 1}			!9 = !{!"llvm.loop.interleave.count", i32 1}
	!10 = !{!"llvm.loop.interleave.count", i32 4}			!10 = !{!"llvm.loop.interleave.count", i32 4}
	!11 = !{!"llvm.loop.vectorize.enable", i1 true}			!11 = !{!"llvm.loop.vectorize.enable", i1 true}
	!12 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}			!12 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
	!13 = distinct !{!13, !6, !9, !11}			!13 = distinct !{!13, !6, !9, !11}

llvm/test/Transforms/LoopVectorize/reduction-inloop.ll

Show First 20 Lines • Show All 1,085 Lines • ▼ Show 20 Lines	.lr.ph: ; preds = %entry, %.lr.ph
br i1 %exitcond, label %._crit_edge, label %.lr.ph		br i1 %exitcond, label %._crit_edge, label %.lr.ph

._crit_edge: ; preds = %.lr.ph		._crit_edge: ; preds = %.lr.ph
%sum.0.lcssa = phi i32 [ %l9, %.lr.ph ]		%sum.0.lcssa = phi i32 [ %l9, %.lr.ph ]
%ret = trunc i32 %sum.0.lcssa to i8		%ret = trunc i32 %sum.0.lcssa to i8
ret i8 %ret		ret i8 %ret
}		}

		; Test case when loop has a call to the llvm.fmuladd intrinsic.
		define float @reduction_fmuladd(float* %a, float* %b, i64 %n) {
		; CHECK-LABEL: @reduction_fmuladd(
		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.]] = icmp ult i64 [[N:%.]], 4
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[N]], -4
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, [[VECTOR_PH]] ], [ [[TMP6:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[TMP0]] to <4 x float>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4
		; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[INDEX]]
		; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <4 x float>*
		; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x float>, <4 x float> [[TMP3]], align 4
		; CHECK-NEXT: [[TMP4:%.*]] = fmul <4 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
		; CHECK-NEXT: [[TMP5:%.*]] = call float @llvm.vector.reduce.fadd.v4f32(float -0.000000e+00, <4 x float> [[TMP4]])
		; CHECK-NEXT: [[TMP6]] = fadd float [[TMP5]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
		; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP40:![0-9]+]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi float [ [[TMP6]], [[MIDDLE_BLOCK]] ], [ 0.000000e+00, [[ENTRY]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
		; CHECK: for.body:
		; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]
		; CHECK-NEXT: [[SUM_07:%.]] = phi float [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[MULADD:%.]], [[FOR_BODY]] ]
		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[A]], i64 [[IV]]
		; CHECK-NEXT: [[TMP8:%.]] = load float, float [[ARRAYIDX]], align 4
		; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[B]], i64 [[IV]]
		; CHECK-NEXT: [[TMP9:%.]] = load float, float [[ARRAYIDX2]], align 4
		; CHECK-NEXT: [[MULADD]] = tail call float @llvm.fmuladd.f32(float [[TMP8]], float [[TMP9]], float [[SUM_07]])
		; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
		; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
		; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP41:![0-9]+]]
		; CHECK: for.end:
		; CHECK-NEXT: [[MULADD_LCSSA:%.*]] = phi float [ [[MULADD]], [[FOR_BODY]] ], [ [[TMP6]], [[MIDDLE_BLOCK]] ]
		; CHECK-NEXT: ret float [[MULADD_LCSSA]]

		entry:
		br label %for.body

		for.body:
		%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
		%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
		%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
		%0 = load float, float* %arrayidx, align 4
		%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
		%1 = load float, float* %arrayidx2, align 4
		%muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
		%iv.next = add nuw nsw i64 %iv, 1
		%exitcond.not = icmp eq i64 %iv.next, %n
		br i1 %exitcond.not, label %for.end, label %for.body

		for.end:
		ret float %muladd
		}

		declare float @llvm.fmuladd.f32(float, float, float)

!6 = distinct !{!6, !7, !8}		!6 = distinct !{!6, !7, !8}
!7 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}		!7 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
!8 = !{!"llvm.loop.vectorize.enable", i1 true}		!8 = !{!"llvm.loop.vectorize.enable", i1 true}

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Add vector reduction support for fmuladd intrinsicClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 389416

llvm/include/llvm/Analysis/IVDescriptors.h

llvm/lib/Analysis/IVDescriptors.cpp

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/lib/Transforms/Utils/LoopUtils.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll

llvm/test/Transforms/LoopVectorize/reduction-inloop.ll

[LoopVectorize] Add vector reduction support for fmuladd intrinsic
ClosedPublic