This is an archive of the discontinued LLVM Phabricator instance.

We had similar codegen matching problems for x86's horizontal ops and fixed them with DAGCombiner/ISel pattern matching.
If you don't do that, then I think you're still going to miss faddp opportunities if the source/IR is already in the form with a shuffle.
Example:

typedef float float2 __attribute__((ext_vector_type(2)));

float faddp(float2 x) {
  return (__builtin_shufflevector(x, x, 1, 1) + x)[0];
}

$ clang -O1 faddp.c -S -o - -target aarch64 -mllvm -disable-vector-combine 
faddp:                                  // @faddp
	dup	v1.2s, v0.s[1]
	fadd	v0.2s, v1.2s, v0.2s
	ret

Thanks @spatel . You're right that we miss that pattern, but, so does x86 currently it seems (I don't read x86 very well so I might be wrong). Using your faddp example:

$  ./bin/clang -O1 ~/tmp/faddp.c -S -o - -target x86_64 -mllvm -disable-vector-combine
...
faddp:                                  # @faddp
        .cfi_startproc
# %bb.0:                                # %entry
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        movaps  %xmm0, %xmm1
        shufps  $85, %xmm0, %xmm1               # xmm1 = xmm1[1,1],xmm0[1,1]
        addss   %xmm0, %xmm1
        movaps  %xmm1, %xmm0
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq

I did find scalarizeBinOpOfSplats in DAGCombiner but that doesn't seem to work here, nor do any of the other patterns in SimplifyVBinOp.

That said, it does seem to make sense to do this in DAGCombiner, thanks for the suggestion. I'll try that.

In D87231#2260558, @sanwou01 wrote:

Thanks @spatel . You're right that we miss that pattern, but, so does x86 currently it seems (I don't read x86 very well so I might be wrong).

Horizontal math ops are a special case for x86 (not all targets support them and even fewer prefer them for performance), so we need to make a CPU subtarget adjustment to see if that example is working:

$ clang -O1 faddp.c -S -o - -target x86_64 -mllvm -disable-vector-combine -march=btver2
  vhaddps	%xmm0, %xmm0, %xmm0

I did find scalarizeBinOpOfSplats in DAGCombiner but that doesn't seem to work here, nor do any of the other patterns in SimplifyVBinOp.

The x86 horizontal transforms are specialized because the HW instructions themselves are weird - no sane target would ever create that functionality from scratch. :)
See "LowerToHorizontalOp" and "lowerAddSubToHorizontalOp" in X86ISelLowering.cpp.

That said, there may still be room to improve the cost models and/or usage here, but I'm not sure exactly how to adjust it. For example, we might match this pattern as a 2-way pairwise reduction?

I think the best option would be to start generating reduction intrinsics in IR, ensure cost models are accurate for them and do all we can to coax the vectorizers to recognize them (inc. partial reduction patterns) - I've been playing whack-a-mole with improving HorizOp patterns in the backend for years now and its not fun any more :-(

The plan to drop the experimental tags from the reduction intrinsics keeps getting delayed - I think due to a couple of minor issues - efficient non pow-2 type handling and inf/nan handling for fp types are the ones that @spatel reminded me about recently.

RKSimon added inline comments.Sep 9 2020, 3:35 AM

llvm/test/CodeGen/AArch64/combine-vectors-faddp.ll
2	VectorCombine tests should be put in llvm\test\Transforms\VectorCombine

Thanks for the feedback. I agree that ideally we'd be generating reduction intrinsics in IR and matching that in the backends. I don't think the pairwise add can be represented with the current intrinsics though: we'd need a <2 x float> variant, or a predicated version of the <4 x float> intrinsic to do this for strict FP math, I believe.

So at least for the moment I'll continue playing whack-a-mole and match the pattern in AArch64 ISel lowering.

Rework to match faddp in AArch64 ISel lowering

sanwou01 retitled this revision from [AArch64] ExtractElement is free when combined with pairwise add to [AArch64] Match pairwise fadd pattern.Sep 16 2020, 2:41 AM

sanwou01 edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B71851: Diff 292158.Sep 16 2020, 2:59 AM

dmgreen added inline comments.Sep 16 2020, 3:06 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11598 ↗	(On Diff #292158)	Could this apply equally for f16/f64 as well?

sanwou01 added inline comments.Sep 16 2020, 3:29 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11598 ↗	(On Diff #292158)	I think so. Looks like we're missing the f16 FADDP pattern in ISel so might as well add that too. Similar for i64 ADDP actually.

Extend to f16, f32, f64 and i64

Harbormaster completed remote builds in B71869: Diff 292195.Sep 16 2020, 5:51 AM

sanwou01 retitled this revision from [AArch64] Match pairwise fadd pattern to [AArch64] Match pairwise add/fadd pattern.Sep 16 2020, 5:52 AM

sanwou01 edited the summary of this revision. (Show Details)

Thanks for making the extra fp16 patterns too. LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11627 ↗	(On Diff #292195)	SDLoc(N) -> DL

This revision is now accepted and ready to land.Sep 16 2020, 9:56 AM

Fix for when there is no fp16 faddp + testing

Harbormaster completed remote builds in B72006: Diff 292484.Sep 17 2020, 6:56 AM

sanwou01 marked an inline comment as done.Sep 17 2020, 7:00 AM

sanwou01 added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11627 ↗	(On Diff #292195)	Huh, I missed those, thanks! This'll be fixed when I land this change.

sanwou01 marked 2 inline comments as done.Sep 17 2020, 7:06 AM

Committed as d5fd3d9b903e

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

32 lines

Transforms/

Vectorize/

VectorCombine.cpp

26 lines

test/

CodeGen/

AArch64/

combine-vectors-faddp.ll

16 lines

Diff 290249

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 478 Lines • ▼ Show 20 Lines	unsigned AArch64TTIImpl::getCFInstrCost(unsigned Opcode,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return Opcode == Instruction::PHI ? 0 : 1;		return Opcode == Instruction::PHI ? 0 : 1;
assert(CostKind == TTI::TCK_RecipThroughput && "unexpected CostKind");		assert(CostKind == TTI::TCK_RecipThroughput && "unexpected CostKind");
// Branches are assumed to be predicted.		// Branches are assumed to be predicted.
return 0;		return 0;
}		}

		bool isPairwiseAdd(const Instruction *I) {
		if (I->getOpcode() != Instruction::FAdd)
		return false;

		assert(I->getNumOperands() == 2);

		unsigned SumIndices = 0;

		for (int i = 0; i < 2; i++) {
		const auto *Ext = dyn_cast<ExtractElementInst>(I->getOperand(i));

		if (!Ext \|\| !isa<ConstantInt>(Ext->getOperand(1)))
		return false;

		unsigned Index = cast<ConstantInt>(Ext->getOperand(1))->getZExtValue();

		if (Index != 0 && Index != 1)
		return false;

		SumIndices += Index;
		}

		return SumIndices == 1;
		}

int AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,		int AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
unsigned Index, const Instruction *I) {		unsigned Index, const Instruction *I) {
assert(Val->isVectorTy() && "This must be a vector type");		assert(Val->isVectorTy() && "This must be a vector type");

		// The Extract is free if this is part of a pairwise add.
		if (I && I->hasOneUse()) {
		auto SingleUser = cast<Instruction>(I->user_begin());
		if (isPairwiseAdd(SingleUser))
		return 0;
		}

if (Index != -1U) {		if (Index != -1U) {
// Legalize the type.		// Legalize the type.
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Val);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Val);

// This type is legalized to a scalar type.		// This type is legalized to a scalar type.
if (!LT.second.isVector())		if (!LT.second.isVector())
return 0;		return 0;

▲ Show 20 Lines • Show All 591 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show First 20 Lines • Show All 226 Lines • ▼ Show 20 Lines	VectorOpCost = TTI.getCmpSelInstrCost(Opcode, VecTy,
CmpInst::makeCmpResultType(VecTy));		CmpInst::makeCmpResultType(VecTy));
}		}

// Get cost estimates for the extract elements. These costs will factor into		// Get cost estimates for the extract elements. These costs will factor into
// both sequences.		// both sequences.
unsigned Ext0Index = cast<ConstantInt>(Ext0->getOperand(1))->getZExtValue();		unsigned Ext0Index = cast<ConstantInt>(Ext0->getOperand(1))->getZExtValue();
unsigned Ext1Index = cast<ConstantInt>(Ext1->getOperand(1))->getZExtValue();		unsigned Ext1Index = cast<ConstantInt>(Ext1->getOperand(1))->getZExtValue();

int Extract0Cost =		// Use instruction context to calculate costs for the current pattern
		int OldExtract0Cost = TTI.getVectorInstrCost(Instruction::ExtractElement,
		VecTy, Ext0Index, Ext0);
		int OldExtract1Cost = TTI.getVectorInstrCost(Instruction::ExtractElement,
		VecTy, Ext1Index, Ext1);

		// Get context-less costs for the ExtractElements in the replacement pattern
		int NewExtract0Cost =
TTI.getVectorInstrCost(Instruction::ExtractElement, VecTy, Ext0Index);		TTI.getVectorInstrCost(Instruction::ExtractElement, VecTy, Ext0Index);
int Extract1Cost =		int NewExtract1Cost =
TTI.getVectorInstrCost(Instruction::ExtractElement, VecTy, Ext1Index);		TTI.getVectorInstrCost(Instruction::ExtractElement, VecTy, Ext1Index);

// A more expensive extract will always be replaced by a splat shuffle.		// A more expensive extract will always be replaced by a splat shuffle.
// For example, if Ext0 is more expensive:		// For example, if Ext0 is more expensive:
// opcode (extelt V0, Ext0), (ext V1, Ext1) -->		// opcode (extelt V0, Ext0), (ext V1, Ext1) -->
// extelt (opcode (splat V0, Ext0), V1), Ext1		// extelt (opcode (splat V0, Ext0), V1), Ext1
// TODO: Evaluate whether that always results in lowest cost. Alternatively,		// TODO: Evaluate whether that always results in lowest cost. Alternatively,
// check the cost of creating a broadcast shuffle and shuffling both		// check the cost of creating a broadcast shuffle and shuffling both
// operands to element 0.		// operands to element 0.
int CheapExtractCost = std::min(Extract0Cost, Extract1Cost);		int MinNewExtractCost = std::min(NewExtract0Cost, NewExtract1Cost);
		int MinOldExtractCost = std::min(OldExtract0Cost, OldExtract1Cost);

// Extra uses of the extracts mean that we include those costs in the		// Extra uses of the extracts mean that we include those costs in the
// vector total because those instructions will not be eliminated.		// vector total because those instructions will not be eliminated.
int OldCost, NewCost;		int OldCost, NewCost;
if (Ext0->getOperand(0) == Ext1->getOperand(0) && Ext0Index == Ext1Index) {		if (Ext0->getOperand(0) == Ext1->getOperand(0) && Ext0Index == Ext1Index) {
// Handle a special case. If the 2 extracts are identical, adjust the		// Handle a special case. If the 2 extracts are identical, adjust the
// formulas to account for that. The extra use charge allows for either the		// formulas to account for that. The extra use charge allows for either the
// CSE'd pattern or an unoptimized form with identical values:		// CSE'd pattern or an unoptimized form with identical values:
// opcode (extelt V, C), (extelt V, C) --> extelt (opcode V, V), C		// opcode (extelt V, C), (extelt V, C) --> extelt (opcode V, V), C
bool HasUseTax = Ext0 == Ext1 ? !Ext0->hasNUses(2)		bool HasUseTax = Ext0 == Ext1 ? !Ext0->hasNUses(2)
: !Ext0->hasOneUse() \|\| !Ext1->hasOneUse();		: !Ext0->hasOneUse() \|\| !Ext1->hasOneUse();
OldCost = CheapExtractCost + ScalarOpCost;		OldCost = MinOldExtractCost + ScalarOpCost;
NewCost = VectorOpCost + CheapExtractCost + HasUseTax * CheapExtractCost;		NewCost = VectorOpCost + MinNewExtractCost + HasUseTax * MinNewExtractCost;
} else {		} else {
// Handle the general case. Each extract is actually a different value:		// Handle the general case. Each extract is actually a different value:
// opcode (extelt V0, C0), (extelt V1, C1) --> extelt (opcode V0, V1), C		// opcode (extelt V0, C0), (extelt V1, C1) --> extelt (opcode V0, V1), C
OldCost = Extract0Cost + Extract1Cost + ScalarOpCost;		OldCost = OldExtract0Cost + OldExtract1Cost + ScalarOpCost;
NewCost = VectorOpCost + CheapExtractCost +		NewCost = VectorOpCost + MinNewExtractCost +
!Ext0->hasOneUse() * Extract0Cost +		!Ext0->hasOneUse() * NewExtract0Cost +
!Ext1->hasOneUse() * Extract1Cost;		!Ext1->hasOneUse() * NewExtract1Cost;
}		}

ConvertToShuffle = getShuffleExtract(Ext0, Ext1, PreferredExtractIndex);		ConvertToShuffle = getShuffleExtract(Ext0, Ext1, PreferredExtractIndex);
if (ConvertToShuffle) {		if (ConvertToShuffle) {
if (IsBinOp && DisableBinopExtractShuffle)		if (IsBinOp && DisableBinopExtractShuffle)
return true;		return true;

// If we are extracting from 2 different indexes, then one operand must be		// If we are extracting from 2 different indexes, then one operand must be
▲ Show 20 Lines • Show All 500 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/combine-vectors-faddp.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -vector-combine -mtriple aarch64-arm-none-eabi < %s \| FileCheck %s			; RUN: opt -S -vector-combine -mtriple aarch64-arm-none-eabi < %s \| FileCheck %s
				RKSimonUnsubmitted Not Done Reply Inline Actions VectorCombine tests should be put in llvm\test\Transforms\VectorCombine RKSimon: VectorCombine tests should be put in llvm\test\Transforms\VectorCombine

	define float @test_no_combine_for_faddp(<2 x float> %a) {			define float @test_no_combine_for_faddp(<2 x float> %a) {
	; CHECK-LABEL: @test_no_combine_for_faddp(			; CHECK-LABEL: @test_no_combine_for_faddp(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[SHIFT:%.]] = shufflevector <2 x float> [[A:%.]], <2 x float> undef, <2 x i32> <i32 1, i32 undef>			; CHECK-NEXT: [[TMP0:%.]] = extractelement <2 x float> [[A:%.]], i32 0
	; CHECK-NEXT: [[TMP0:%.*]] = fadd <2 x float> [[A]], [[SHIFT]]			; CHECK-NEXT: [[TMP1:%.*]] = extractelement <2 x float> [[A]], i32 1
	; CHECK-NEXT: [[TMP1:%.*]] = extractelement <2 x float> [[TMP0]], i32 0			; CHECK-NEXT: [[TMP2:%.*]] = fadd float [[TMP0]], [[TMP1]]
	; CHECK-NEXT: ret float [[TMP1]]			; CHECK-NEXT: ret float [[TMP2]]
	;			;
	entry:			entry:
	%0 = extractelement <2 x float> %a, i32 0			%0 = extractelement <2 x float> %a, i32 0
	%1 = extractelement <2 x float> %a, i32 1			%1 = extractelement <2 x float> %a, i32 1
	%2 = fadd float %0, %1			%2 = fadd float %0, %1
	ret float %2			ret float %2
	}			}

	define float @test_no_combine_for_faddp_swapped(<2 x float> %a) {			define float @test_no_combine_for_faddp_swapped(<2 x float> %a) {
	; CHECK-LABEL: @test_no_combine_for_faddp_swapped(			; CHECK-LABEL: @test_no_combine_for_faddp_swapped(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[SHIFT:%.]] = shufflevector <2 x float> [[A:%.]], <2 x float> undef, <2 x i32> <i32 1, i32 undef>			; CHECK-NEXT: [[TMP0:%.]] = extractelement <2 x float> [[A:%.]], i32 1
	; CHECK-NEXT: [[TMP0:%.*]] = fadd <2 x float> [[SHIFT]], [[A]]			; CHECK-NEXT: [[TMP1:%.*]] = extractelement <2 x float> [[A]], i32 0
	; CHECK-NEXT: [[TMP1:%.*]] = extractelement <2 x float> [[TMP0]], i64 0			; CHECK-NEXT: [[TMP2:%.*]] = fadd float [[TMP0]], [[TMP1]]
	; CHECK-NEXT: ret float [[TMP1]]			; CHECK-NEXT: ret float [[TMP2]]
	;			;
	entry:			entry:
	%0 = extractelement <2 x float> %a, i32 1			%0 = extractelement <2 x float> %a, i32 1
	%1 = extractelement <2 x float> %a, i32 0			%1 = extractelement <2 x float> %a, i32 0
	%2 = fadd float %0, %1			%2 = fadd float %0, %1
	ret float %2			ret float %2
	}			}

	Show All 13 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Match pairwise add/fadd patternClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 290249

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

llvm/test/CodeGen/AArch64/combine-vectors-faddp.ll

[AArch64] Match pairwise add/fadd pattern
ClosedPublic