This is an archive of the discontinued LLVM Phabricator instance.

[Analysis][AArch64] Make fixed-width ordered reductions slightly more expensive
ClosedPublic

Authored by david-arm on Aug 18 2021, 5:00 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
kmclaughlin
dmgreen

Commits

rG219d4518fce9: [Analysis][AArch64] Make fixed-width ordered reductions slightly more expensive

Summary

For tight loops like this:

float r = 0;
for (int i = 0; i < n; i++) {
  r += a[i];
 }

it's better not to vectorise at -O3 using fixed-width ordered reductions
on AArch64 targets. Although the resulting number of instructions in the
generated code ends up being comparable to not vectorising at all, there
may be additional costs on some CPUs, for example perhaps the scheduling
is worse. It makes sense to deter vectorisation in tight loops.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

david-arm created this revision.Aug 18 2021, 5:00 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald TranscriptAug 18 2021, 5:00 AM

david-arm requested review of this revision.Aug 18 2021, 5:00 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2021, 5:00 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

david-arm added a child revision: D106653: [LoopVectorize][AArch64] Enable ordered reductions by default for AArch64.Aug 18 2021, 5:01 AM

Yeah, this sounds sensible to me. We still vectorize when there starts to be a clear advantage of using other vector operations.
Looks good to me. Thanks.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2005–2007	I don't know if we need to talk about this in terms of scheduling exactly - that will be very dependent on the cpu used. Perhaps just describe it in terms of "extra overheads on some cpus"

This revision is now accepted and ready to land.Aug 18 2021, 5:41 AM

Harbormaster completed remote builds in B120105: Diff 367176.Aug 18 2021, 5:56 AM

david-arm edited the summary of this revision. (Show Details)Aug 18 2021, 8:25 AM

Closed by commit rG219d4518fce9: [Analysis][AArch64] Make fixed-width ordered reductions slightly more expensive (authored by david-arm). · Explain WhyAug 18 2021, 9:02 AM

This revision was automatically updated to reflect the committed changes.

david-arm marked an inline comment as done.

david-arm added a commit: rG219d4518fce9: [Analysis][AArch64] Make fixed-width ordered reductions slightly more expensive.

david-arm added inline comments.Aug 18 2021, 9:02 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2005–2007	I've updated this comment in the commit!

dmgreen mentioned this in D106653: [LoopVectorize][AArch64] Enable ordered reductions by default for AArch64.Aug 18 2021, 9:27 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

9 lines

test/

Analysis/

CostModel/

AArch64/

reduce-fadd.ll

8 lines

Transforms/

LoopVectorize/

AArch64/

strict-fadd-cost.ll

8 lines

Diff 367230

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,993 Lines • ▼ Show 20 Lines	InstructionCost AArch64TTIImpl::getArithmeticReductionCostSVE(
}		}
}		}

InstructionCost		InstructionCost
AArch64TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,		AArch64TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,
Optional<FastMathFlags> FMF,		Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
if (TTI::requiresOrderedReduction(FMF)) {		if (TTI::requiresOrderedReduction(FMF)) {
if (!isa<ScalableVectorType>(ValTy))		if (auto *FixedVTy = dyn_cast<FixedVectorType>(ValTy)) {
return BaseT::getArithmeticReductionCost(Opcode, ValTy, FMF, CostKind);		InstructionCost BaseCost =
		BaseT::getArithmeticReductionCost(Opcode, ValTy, FMF, CostKind);
		// Add on extra cost to reflect the extra overhead on some CPUs. We still
		// end up vectorizing for more computationally intensive loops.
		return BaseCost + FixedVTy->getNumElements();
		dmgreenUnsubmitted Done Reply Inline Actions I don't know if we need to talk about this in terms of scheduling exactly - that will be very dependent on the cpu used. Perhaps just describe it in terms of "extra overheads on some cpus" dmgreen: I don't know if we need to talk about this in terms of scheduling exactly - that will be very…
		david-armAuthorUnsubmitted Done Reply Inline Actions I've updated this comment in the commit! david-arm: I've updated this comment in the commit!
		}

if (Opcode != Instruction::FAdd)		if (Opcode != Instruction::FAdd)
return InstructionCost::getInvalid();		return InstructionCost::getInvalid();

auto *VTy = cast<ScalableVectorType>(ValTy);		auto *VTy = cast<ScalableVectorType>(ValTy);
InstructionCost Cost =		InstructionCost Cost =
getArithmeticInstrCost(Opcode, VTy->getScalarType(), CostKind);		getArithmeticInstrCost(Opcode, VTy->getScalarType(), CostKind);
Cost *= getMaxNumElements(VTy->getElementCount());		Cost *= getMaxNumElements(VTy->getElementCount());
▲ Show 20 Lines • Show All 236 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll

	; RUN: opt -cost-model -analyze -mtriple=aarch64--linux-gnu < %s \| FileCheck %s			; RUN: opt -cost-model -analyze -mtriple=aarch64--linux-gnu < %s \| FileCheck %s

	define void @strict_fp_reductions() {			define void @strict_fp_reductions() {
	; CHECK-LABEL: strict_fp_reductions			; CHECK-LABEL: strict_fp_reductions
	; CHECK-NEXT: Cost Model: Found an estimated cost of 17 for instruction: %fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)			; CHECK-NEXT: Cost Model: Found an estimated cost of 21 for instruction: %fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 34 for instruction: %fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)			; CHECK-NEXT: Cost Model: Found an estimated cost of 42 for instruction: %fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)			; CHECK-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)			; CHECK-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
	%fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.0, <4 x float> undef)			%fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.0, <4 x float> undef)
	%fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.0, <8 x float> undef)			%fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.0, <8 x float> undef)
	%fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.0, <2 x double> undef)			%fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.0, <2 x double> undef)
	%fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.0, <4 x double> undef)			%fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.0, <4 x double> undef)

	ret void			ret void
	}			}

	declare float @llvm.vector.reduce.fadd.v4f32(float, <4 x float>)			declare float @llvm.vector.reduce.fadd.v4f32(float, <4 x float>)
	declare float @llvm.vector.reduce.fadd.v8f32(float, <8 x float>)			declare float @llvm.vector.reduce.fadd.v8f32(float, <8 x float>)
	declare double @llvm.vector.reduce.fadd.v2f64(double, <2 x double>)			declare double @llvm.vector.reduce.fadd.v2f64(double, <2 x double>)
	declare double @llvm.vector.reduce.fadd.v4f64(double, <4 x double>)			declare double @llvm.vector.reduce.fadd.v4f64(double, <4 x double>)

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt < %s -loop-vectorize -debug -disable-output -force-ordered-reductions=true -hints-allow-reordering=false \			; RUN: opt < %s -loop-vectorize -debug -disable-output -force-ordered-reductions=true -hints-allow-reordering=false \
	; RUN: -force-vector-width=4 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF4			; RUN: -force-vector-width=4 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF4
	; RUN: opt < %s -loop-vectorize -debug -disable-output -force-ordered-reductions=true -hints-allow-reordering=false \			; RUN: opt < %s -loop-vectorize -debug -disable-output -force-ordered-reductions=true -hints-allow-reordering=false \
	; RUN: -force-vector-width=8 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF8			; RUN: -force-vector-width=8 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF8

	target triple="aarch64-unknown-linux-gnu"			target triple="aarch64-unknown-linux-gnu"

	; CHECK-VF4: Found an estimated cost of 17 for VF 4 For instruction: %add = fadd float %0, %sum.07			; CHECK-VF4: Found an estimated cost of 21 for VF 4 For instruction: %add = fadd float %0, %sum.07
	; CHECK-VF8: Found an estimated cost of 34 for VF 8 For instruction: %add = fadd float %0, %sum.07			; CHECK-VF8: Found an estimated cost of 42 for VF 8 For instruction: %add = fadd float %0, %sum.07

	define float @fadd_strict32(float* noalias nocapture readonly %a, i64 %n) {			define float @fadd_strict32(float* noalias nocapture readonly %a, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]			%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%add = fadd float %0, %sum.07			%add = fadd float %0, %sum.07
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body			br i1 %exitcond.not, label %for.end, label %for.body

	for.end:			for.end:
	ret float %add			ret float %add
	}			}


	; CHECK-VF4: Found an estimated cost of 14 for VF 4 For instruction: %add = fadd double %0, %sum.07			; CHECK-VF4: Found an estimated cost of 18 for VF 4 For instruction: %add = fadd double %0, %sum.07
	; CHECK-VF8: Found an estimated cost of 28 for VF 8 For instruction: %add = fadd double %0, %sum.07			; CHECK-VF8: Found an estimated cost of 36 for VF 8 For instruction: %add = fadd double %0, %sum.07

	define double @fadd_strict64(double* noalias nocapture readonly %a, i64 %n) {			define double @fadd_strict64(double* noalias nocapture readonly %a, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum.07 = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]			%sum.07 = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]
	Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Analysis][AArch64] Make fixed-width ordered reductions slightly more expensiveClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 367230

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll

[Analysis][AArch64] Make fixed-width ordered reductions slightly more expensive
ClosedPublic