This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Account for loss due to FMAs when estimating fmul costs.
Needs ReviewPublic

Authored by fhahn on Aug 29 2022, 11:01 AM.

Download Raw Diff

Details

Reviewers

vdmitrie
ABataev
RKSimon
vporpo
spatel
dmgreen
wjschmidt

Summary

This patch extends the cost model for FMUL operations to consider scalar
FMA opportunities when considering multiplies. If a scalar operation in
the bundle has a single FADD/FSUB user, it is very likely those
instructions can be fused to a single fmuladd/fmulsub operation, which
means the multiply is effectively free.

The patch counts the number of fusable operations in a bundle. If all
entries in the bundle can be fused, it is likely that the resulting
vector instructions can also be fused. In this case, consider both
version free and return the common cost, with a tiny bias towards
vectorizing.

Otherwise just consider the fusable scalar FMULs as free.

This recovers a regression in an application very sensitive to SLP
changes after 65c7cecb13f8d2132a54103903501474083afe8f and overall
improves performance by 10%. There is no other measurable impact on the
other applications in a large proprietary benchmark suite on ARM64.

Excessive SLP vectorization in the presence of scalar FMA opportunities
has also been discussed in D131028 and mentioned by @dmgreen.

D125987 also tries to address a similar issue, but with a focus on
horizontal reductions.

I'd appreciate if someone could give this a test on the X86 side.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Aug 29 2022, 11:01 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2022, 11:01 AM

Herald added subscribers: pengfei, hiraditya, kristof.beyls. · View Herald Transcript

fhahn requested review of this revision.Aug 29 2022, 11:01 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2022, 11:01 AM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

fhahn mentioned this in D125987: [SLP] Account for cost of removing FMA opportunities.Aug 29 2022, 11:02 AM

fhahn mentioned this in D131028: [AArch64] Fix cost model for FADD vector reduction.Aug 29 2022, 11:05 AM

Similar question to D125987, why we you cannot put related cost changes to getArithmeticInstrCost?

In D132872#3756205, @ABataev wrote:

Similar question to D125987, why we you cannot put related cost changes to getArithmeticInstrCost?

I think one reason is that TTI usually avoids to rely on the contex instruction/use-list scans. Not saying this is a blocker though if we converge on this as solution. The other issue is that we can't estimate the fact that the widened FMUL will also be applicable for fusion because no context instruction to pass exists yet.

I think it would be good to role out a solution that first fixes the issue in SLP first and then possible move it to TTI once it has proven itself. This is an issue that mostly impacts the SLP vectorizer AFAICT.

In D132872#3756244, @fhahn wrote:

In D132872#3756205, @ABataev wrote:

Similar question to D125987, why we you cannot put related cost changes to getArithmeticInstrCost?

I think one reason is that TTI usually avoids to rely on the contex instruction/use-list scans. Not saying this is a blocker though if we converge on this as solution.

Not quite true, since it accepts operands as an argument.

The other issue is that we can't estimate the fact that the widened FMUL will also be applicable for fusion because no context instruction to pass exists yet.

Do not see a problem here.

I think it would be good to role out a solution that first fixes the issue in SLP first and then possible move it to TTI once it has proven itself. This is an issue that mostly impacts the SLP vectorizer AFAICT.

Not sure this a good decision.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6465–6466	The cost-based analysis may lead to the wrong final decision here, need to return something like a flag instead or implement this analysis in TTI. What if the cost of Intrinsic::fmuladd != TCC_Basic?

Harbormaster completed remote builds in B183974: Diff 456408.Aug 29 2022, 12:39 PM

Add missing checks for reassoc FMF flag, compare fmuladd cost with FMUL cost to see if fmuladd is at least as cheap as FMUL.

In D132872#3756279, @ABataev wrote:

In D132872#3756244, @fhahn wrote:

In D132872#3756205, @ABataev wrote:

Similar question to D125987, why we you cannot put related cost changes to getArithmeticInstrCost?

I think one reason is that TTI usually avoids to rely on the contex instruction/use-list scans. Not saying this is a blocker though if we converge on this as solution.

Not quite true, since it accepts operands as an argument.

Yes it accepts it, but AFAICT one of the only user of getArithmeticInstrCost is the ARMTargetTransformInfo AFAICT, so I'd argue that using it at the momeent is quite uncommon.

The other issue is that we can't estimate the fact that the widened FMUL will also be applicable for fusion because no context instruction to pass exists yet.

Do not see a problem here.

We need a way to estimate if the vector fmul will be fusable with its users, otherwise we may incorrectly overstate the benefit of scalar FMAs.

For that we would need to check if all entries in the bundle are fusable with their users. But when we call getArithmeticInstrCost for with the vector type, we can only pass a single context instruction, which is one of the scalar instructions of the bundle. So I don't think getArithmeticInstrCost can really figure out whether the vector FMUL will be free or not.

I think if we would want to sink this into TTI, it would need to be integerated into getArithmeticInstrCost, rather than adding a very specialized interface that would only be used by the SLP vectorizer.

I think it would be good to role out a solution that first fixes the issue in SLP first and then possible move it to TTI once it has proven itself. This is an issue that mostly impacts the SLP vectorizer AFAICT.

Not sure this a good decision.

Could you elaborate your concerns with making the decision in SLP to start with? Even with the current implementation in SLP, the target specific information (availability and cost of FMAs) is accessed through TTI.

I think it would be good to first converge on making the correct decision to address the regressions on both X86 and AArch64 and make sure it is actually the right decision. Then decide if and where to generalize the decision.

Harbormaster completed remote builds in B184155: Diff 456654.Aug 30 2022, 7:05 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6465–6466	Thanks! Updated the code to check if the the cost of the scalar FMA is <= the cost of the scalar FMUL. I think that should indicate whether we can consider the FMUL as free if the FMA is used instead FMUL + FADD.

In D132872#3758351, @fhahn wrote:

Add missing checks for reassoc FMF flag, compare fmuladd cost with FMUL cost to see if fmuladd is at least as cheap as FMUL.

In D132872#3756279, @ABataev wrote:

In D132872#3756244, @fhahn wrote:

In D132872#3756205, @ABataev wrote:

Similar question to D125987, why we you cannot put related cost changes to getArithmeticInstrCost?

I think one reason is that TTI usually avoids to rely on the contex instruction/use-list scans. Not saying this is a blocker though if we converge on this as solution.

Not quite true, since it accepts operands as an argument.

Yes it accepts it, but AFAICT one of the only user of getArithmeticInstrCost is the ARMTargetTransformInfo AFAICT, so I'd argue that using it at the momeent is quite uncommon.

Anyway, you introducing another level of the dependency when we already have one. It makes the codebase complex and makes it harder to maintain.

The other issue is that we can't estimate the fact that the widened FMUL will also be applicable for fusion because no context instruction to pass exists yet.

Do not see a problem here.

We need a way to estimate if the vector fmul will be fusable with its users, otherwise we may incorrectly overstate the benefit of scalar FMAs.

For that we would need to check if all entries in the bundle are fusable with their users. But when we call getArithmeticInstrCost for with the vector type, we can only pass a single context instruction, which is one of the scalar instructions of the bundle. So I don't think getArithmeticInstrCost can really figure out whether the vector FMUL will be free or not.

It can be fixed by adding an extra parameter, extending existing function with array of instruction instead of single instruction, flags, etc.

I think if we would want to sink this into TTI, it would need to be integerated into getArithmeticInstrCost, rather than adding a very specialized interface that would only be used by the SLP vectorizer.

I think it would be good to role out a solution that first fixes the issue in SLP first and then possible move it to TTI once it has proven itself. This is an issue that mostly impacts the SLP vectorizer AFAICT.

Not sure this a good decision.

Could you elaborate your concerns with making the decision in SLP to start with? Even with the current implementation in SLP, the target specific information (availability and cost of FMAs) is accessed through TTI.

I think after this it will live here forever, nobody will try to rework it. It is the same problem as before - the patch does not try to address global problem, instead it introduces an immediate hack.

I think it would be good to first converge on making the correct decision to address the regressions on both X86 and AArch64 and make sure it is actually the right decision. Then decide if and where to generalize the decision.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6467–6469	Still it looks like a hack.

fhahn mentioned this in D132966: [TTI] Allow passing ArrayRef of context instructions (NFC)..Aug 30 2022, 12:41 PM

fhahn mentioned this in D132968: [TTI] Account for FMA opportunities in getArithmeticInstrCost..

In D132872#3756205, @ABataev wrote:

I think after this it will live here forever, nobody will try to rework it. It is the same problem as before - the patch does not try to address global problem, instead it introduces an immediate hack.

Fair enough, I put up an alternative series that instead fully sinks the cost logic into TTI: D132968. Personally I am not sure if adding this to TTI is too specific (i.e. encoding SLP specific info into TTI), but if people generally prefer this direction I am happy to go that direction.

fhahn mentioned this in D139702: [AArch64][SLP] Adjust cost estimation for scalar FMUL+FADD/FSUB to be with possibility of using scalar FMA instead..Dec 9 2022, 3:10 AM

RKSimon mentioned this in rG90b02f6c635a: [SLP][X86] slp-fma-loss.ll - add various targets with different FMA abilities.Dec 9 2022, 3:46 AM

dtemirbulatov added a subscriber: dtemirbulatov.Jan 5 2023, 6:54 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

41 lines

test/

Transforms/

SLPVectorizer/

AArch64/

slp-fma-loss.ll

54 lines

X86/

slp-fma-loss.ll

16 lines

Diff 456654

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,435 Lines • ▼ Show 20 Lines	case Instruction::Xor: {
SmallVector<const Value *, 4> Operands(VL0->operand_values());		SmallVector<const Value *, 4> Operands(VL0->operand_values());
InstructionCost ScalarEltCost =		InstructionCost ScalarEltCost =
TTI->getArithmeticInstrCost(E->getOpcode(), ScalarTy, CostKind,		TTI->getArithmeticInstrCost(E->getOpcode(), ScalarTy, CostKind,
Op1Info, Op2Info,		Op1Info, Op2Info,
Operands, VL0);		Operands, VL0);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
CommonCost -= (EntryVF - VL.size()) * ScalarEltCost;		CommonCost -= (EntryVF - VL.size()) * ScalarEltCost;
}		}
InstructionCost ScalarCost = VecTy->getNumElements() * ScalarEltCost;
		// Try to estimate the number of operations in VL which could be combined
		// with their user to a single fmuladd/fmulsub instruction, if those are
		// cheap on the target.
		unsigned NumScalarFusable = 0;
		for (unsigned i = 0, e = VL.size(); i < e; ++i) {
		const Instruction *I = cast<Instruction>(VL[i]);
		if (!I->hasOneUse() \|\| ShuffleOrOp != Instruction::FMul \|\|
		!I->hasAllowReassoc())
		continue;

		const Instruction UserI = cast<Instruction>(I->user_begin());
		bool IsCandOpcode = UserI->getOpcode() == Instruction::FSub \|\|
		UserI->getOpcode() == Instruction::FAdd;
		if (IsCandOpcode && UserI->getOperand(0) == I &&
		UserI->hasAllowReassoc()) {
		// TODO: How can we estimate the cost of fmulsub for which we don't
		// have an intrinsic?
		FastMathFlags FMF = I->getFastMathFlags();
		FMF &= UserI->getFastMathFlags();
		IntrinsicCostAttributes ICA(
		Intrinsic::fmuladd, VL[i]->getType(),
		{VL[i]->getType(), VL[i]->getType(), VL[i]->getType()}, FMF);
		ABataevUnsubmitted Done Reply Inline Actions The cost-based analysis may lead to the wrong final decision here, need to return something like a flag instead or implement this analysis in TTI. What if the cost of Intrinsic::fmuladd != TCC_Basic? ABataev: The cost-based analysis may lead to the wrong final decision here, need to return something…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Thanks! Updated the code to check if the the cost of the scalar FMA is <= the cost of the scalar FMUL. I think that should indicate whether we can consider the FMUL as free if the FMA is used instead FMUL + FADD. fhahn: Thanks! Updated the code to check if the the cost of the scalar FMA is <= the cost of the…
		NumScalarFusable +=
		TTI->getIntrinsicInstrCost(ICA, TTI::TCK_RecipThroughput) <=
		ScalarEltCost;
		ABataevUnsubmitted Not Done Reply Inline Actions Still it looks like a hack. ABataev: Still it looks like a hack.
		}
		}

		// All operations in VL can be fused to fmuladd/fmulsub with their single
		// users. So the whole bundle can be considered free when both scalarized
		// or vectorized. If the common cost is low slightly biasing towards using
		// the vector version should be profitable in most cases.
		if (NumScalarFusable == VecTy->getNumElements())
		return CommonCost - 1;

		// If some of the scalar operations can be fused with their users to
		// fmuladd/fmulsub, consider them as free.
		InstructionCost ScalarCost =
		(VecTy->getNumElements() - NumScalarFusable) * ScalarEltCost;
for (unsigned I = 0, Num = VL0->getNumOperands(); I < Num; ++I) {		for (unsigned I = 0, Num = VL0->getNumOperands(); I < Num; ++I) {
if (all_of(VL, [I](Value *V) {		if (all_of(VL, [I](Value *V) {
return isConstant(cast<Instruction>(V)->getOperand(I));		return isConstant(cast<Instruction>(V)->getOperand(I));
}))		}))
Operands[I] = ConstantVector::getNullValue(VecTy);		Operands[I] = ConstantVector::getNullValue(VecTy);
}		}
InstructionCost VecCost =		InstructionCost VecCost =
TTI->getArithmeticInstrCost(E->getOpcode(), VecTy, CostKind,		TTI->getArithmeticInstrCost(E->getOpcode(), VecTy, CostKind,
▲ Show 20 Lines • Show All 6,082 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -passes=slp-vectorizer -mtriple=arm64-apple-ios -S %s \| FileCheck %s		; RUN: opt -passes=slp-vectorizer -mtriple=arm64-apple-ios -S %s \| FileCheck %s

; Test case where not vectorizing is more profitable because multiple		; Test case where not vectorizing is more profitable because multiple
; fmul/{fadd,fsub} pairs can be lowered to fma instructions.		; fmul/{fadd,fsub} pairs can be lowered to fma instructions.
define void @slp_not_profitable_with_fast_fmf(ptr %A, ptr %B) {		define void @slp_not_profitable_with_fast_fmf(ptr %A, ptr %B) {
; CHECK-LABEL: @slp_not_profitable_with_fast_fmf(		; CHECK-LABEL: @slp_not_profitable_with_fast_fmf(
; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1		; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1
; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4		; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4
		; CHECK-NEXT: [[B_1:%.*]] = load float, ptr [[GEP_B_1]], align 4
		; CHECK-NEXT: [[MUL_0:%.*]] = fmul fast float [[B_1]], [[A_0]]
; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4		; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x float>, ptr [[GEP_B_1]], align 4		; CHECK-NEXT: [[GEP_B_2:%.*]] = getelementptr inbounds float, ptr [[B]], i64 2
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[B_0]], i32 0		; CHECK-NEXT: [[B_2:%.*]] = load float, ptr [[GEP_B_2]], align 4
; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[B_0]], i32 1		; CHECK-NEXT: [[MUL_1:%.*]] = fmul fast float [[B_2]], [[B_0]]
; CHECK-NEXT: [[TMP4:%.*]] = fmul fast <2 x float> [[TMP3]], [[TMP1]]		; CHECK-NEXT: [[SUB:%.*]] = fsub fast float [[MUL_0]], [[MUL_1]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[MUL_2:%.*]] = fmul fast float [[B_0]], [[B_1]]
; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x float> poison, float [[A_0]], i32 0		; CHECK-NEXT: [[MUL_3:%.*]] = fmul fast float [[B_2]], [[A_0]]
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x float> [[TMP5]], float [[A_0]], i32 1		; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[MUL_3]], [[MUL_2]]
; CHECK-NEXT: [[TMP7:%.*]] = fmul fast <2 x float> [[TMP1]], [[TMP6]]		; CHECK-NEXT: store float [[SUB]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP8:%.*]] = fsub fast <2 x float> [[TMP7]], [[SHUFFLE]]		; CHECK-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds float, ptr [[A]], i64 1
; CHECK-NEXT: [[TMP9:%.*]] = fadd fast <2 x float> [[TMP7]], [[SHUFFLE]]		; CHECK-NEXT: store float [[ADD]], ptr [[GEP_A_1]], align 4
; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <2 x float> [[TMP8]], <2 x float> [[TMP9]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: store float [[B_2]], ptr [[B]], align 4
; CHECK-NEXT: store <2 x float> [[TMP10]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x float> [[TMP1]], i32 1
; CHECK-NEXT: store float [[TMP11]], ptr [[B]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1		%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1
%A.0 = load float, ptr %A, align 4		%A.0 = load float, ptr %A, align 4
%B.1 = load float, ptr %gep.B.1, align 4		%B.1 = load float, ptr %gep.B.1, align 4
%mul.0 = fmul fast float %B.1, %A.0		%mul.0 = fmul fast float %B.1, %A.0
%B.0 = load float, ptr %B, align 4		%B.0 = load float, ptr %B, align 4
%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2		%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2
Show All 9 Lines	;
store float %B.2, ptr %B, align 4		store float %B.2, ptr %B, align 4
ret void		ret void
}		}

define void @slp_not_profitable_with_reassoc_fmf(ptr %A, ptr %B) {		define void @slp_not_profitable_with_reassoc_fmf(ptr %A, ptr %B) {
; CHECK-LABEL: @slp_not_profitable_with_reassoc_fmf(		; CHECK-LABEL: @slp_not_profitable_with_reassoc_fmf(
; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1		; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1
; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4		; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4
		; CHECK-NEXT: [[B_1:%.*]] = load float, ptr [[GEP_B_1]], align 4
		; CHECK-NEXT: [[MUL_0:%.*]] = fmul reassoc float [[B_1]], [[A_0]]
; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4		; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x float>, ptr [[GEP_B_1]], align 4		; CHECK-NEXT: [[GEP_B_2:%.*]] = getelementptr inbounds float, ptr [[B]], i64 2
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[B_0]], i32 0		; CHECK-NEXT: [[B_2:%.*]] = load float, ptr [[GEP_B_2]], align 4
; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[B_0]], i32 1		; CHECK-NEXT: [[MUL_1:%.*]] = fmul float [[B_2]], [[B_0]]
; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[TMP3]], [[TMP1]]		; CHECK-NEXT: [[SUB:%.*]] = fsub reassoc float [[MUL_0]], [[MUL_1]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[MUL_2:%.*]] = fmul float [[B_0]], [[B_1]]
; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x float> poison, float [[A_0]], i32 0		; CHECK-NEXT: [[MUL_3:%.*]] = fmul reassoc float [[B_2]], [[A_0]]
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x float> [[TMP5]], float [[A_0]], i32 1		; CHECK-NEXT: [[ADD:%.*]] = fadd reassoc float [[MUL_3]], [[MUL_2]]
; CHECK-NEXT: [[TMP7:%.*]] = fmul reassoc <2 x float> [[TMP1]], [[TMP6]]		; CHECK-NEXT: store float [[SUB]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP8:%.*]] = fsub reassoc <2 x float> [[TMP7]], [[SHUFFLE]]		; CHECK-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds float, ptr [[A]], i64 1
; CHECK-NEXT: [[TMP9:%.*]] = fadd reassoc <2 x float> [[TMP7]], [[SHUFFLE]]		; CHECK-NEXT: store float [[ADD]], ptr [[GEP_A_1]], align 4
; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <2 x float> [[TMP8]], <2 x float> [[TMP9]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: store float [[B_2]], ptr [[B]], align 4
; CHECK-NEXT: store <2 x float> [[TMP10]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x float> [[TMP1]], i32 1
; CHECK-NEXT: store float [[TMP11]], ptr [[B]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1		%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1
%A.0 = load float, ptr %A, align 4		%A.0 = load float, ptr %A, align 4
%B.1 = load float, ptr %gep.B.1, align 4		%B.1 = load float, ptr %gep.B.1, align 4
%mul.0 = fmul reassoc float %B.1, %A.0		%mul.0 = fmul reassoc float %B.1, %A.0
%B.0 = load float, ptr %B, align 4		%B.0 = load float, ptr %B, align 4
%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2		%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2
▲ Show 20 Lines • Show All 225 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S -mcpu=core-avx2 -mtriple=x86_64-unknown-linux-gnu -slp-threshold=-2 < %s \| FileCheck %s			; RUN: opt -slp-vectorizer -S -mcpu=core-avx2 -mtriple=x86_64-unknown-linux-gnu -slp-threshold=-2 < %s \| FileCheck %s

	; This test checks for a case when a horizontal reduction of floating-point			; This test checks for a case when a horizontal reduction of floating-point
	; adds may look profitable, but is not because it eliminates generation of			; adds may look profitable, but is not because it eliminates generation of
	; floating-point FMAs that would be more profitable.			; floating-point FMAs that would be more profitable.

	; FIXME: We generate a horizontal reduction today.

	define void @hr() {			define void @hr() {
	; CHECK-LABEL: @hr(			; CHECK-LABEL: @hr(
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[PHI0:%.]] = phi double [ 0.000000e+00, [[TMP0:%.]] ], [ [[OP_RDX:%.*]], [[LOOP]] ]			; CHECK-NEXT: [[PHI0:%.]] = phi double [ 0.000000e+00, [[TMP0:%.]] ], [ [[ADD3:%.*]], [[LOOP]] ]
	; CHECK-NEXT: [[CVT0:%.*]] = uitofp i16 0 to double			; CHECK-NEXT: [[CVT0:%.*]] = uitofp i16 0 to double
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> <double poison, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00>, double [[CVT0]], i32 0			; CHECK-NEXT: [[MUL0:%.*]] = fmul fast double 0.000000e+00, [[CVT0]]
	; CHECK-NEXT: [[TMP2:%.*]] = fmul fast <4 x double> zeroinitializer, [[TMP1]]			; CHECK-NEXT: [[ADD0:%.*]] = fadd fast double [[MUL0]], [[PHI0]]
	; CHECK-NEXT: [[TMP3:%.*]] = call fast double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> [[TMP2]])			; CHECK-NEXT: [[MUL1:%.*]] = fmul fast double 0.000000e+00, 0.000000e+00
	; CHECK-NEXT: [[OP_RDX]] = fadd fast double [[TMP3]], [[PHI0]]			; CHECK-NEXT: [[ADD1:%.*]] = fadd fast double [[MUL1]], [[ADD0]]
				; CHECK-NEXT: [[MUL2:%.*]] = fmul fast double 0.000000e+00, 0.000000e+00
				; CHECK-NEXT: [[ADD2:%.*]] = fadd fast double [[MUL2]], [[ADD1]]
				; CHECK-NEXT: [[MUL3:%.*]] = fmul fast double 0.000000e+00, 0.000000e+00
				; CHECK-NEXT: [[ADD3]] = fadd fast double [[MUL3]], [[ADD2]]
	; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[LOOP]]			; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[LOOP]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	br label %loop			br label %loop

	loop:			loop:
	%phi0 = phi double [ 0.000000e+00, %0 ], [ %add3, %loop ]			%phi0 = phi double [ 0.000000e+00, %0 ], [ %add3, %loop ]
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines