This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
2/4
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/X86/
-
Transforms/
-
LoopVectorize/
-
X86/
-
vect.omp.force.small-tc.ll

Differential D147720

[LV] Use the known trip count when costing non-tail folded VFs
ClosedPublic

Authored by dmgreen on Apr 6 2023, 9:29 AM.

Download Raw Diff

Details

Reviewers

fhahn
Ayal
SjoerdMeijer
sdesmalen
david-arm
bmahjour

Commits

rG1869a9c225c7: [LV] Use the known trip count when costing non-tail folded VFs

Summary

Now that we store the ScalarCost in the VectorizationFactor it is possible to use it to get a slightly more accurate cost in isMoreProfitable between two vector factors. This extends the logic added in D101726 to non-tail-folded cases, using the costs of VecCost * (TripCount / VF) + ScalarCost * (TripCount % VF) to compare VFs where the TripCount is known.

This shouldn't alter very much as small trip counts are usually not vectorized, but does seem to help in the testcase where 4 * VF4 is chosen as profitable compared to 2 * VF8 + 4* scalar.

Diff Detail

Event Timeline

dmgreen created this revision.Apr 6 2023, 9:29 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2023, 9:29 AM

Herald added subscribers: StephenFan, hiraditya. · View Herald Transcript

dmgreen requested review of this revision.Apr 6 2023, 9:29 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2023, 9:29 AM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

dmgreen mentioned this in D142015: [LV] Plan with and without FoldTailByMasking.Apr 6 2023, 9:30 AM

Harbormaster completed remote builds in B224039: Diff 511429.Apr 6 2023, 10:35 AM

ping

david-arm added inline comments.Apr 19 2023, 6:46 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5368	Hi @dmgreen, perhaps I've missed something here but it looks like there is a divide-by-zero in the code? If `A.Width.getFixedValue()` is non-zero then we calculate `(CostA * divideCeil(MaxTripCount, A.Width.getFixedValue()))`, and if it is zero then we do `(CostA * (MaxTripCount / A.Width.getFixedValue()) + ...`

dmgreen added inline comments.Apr 21 2023, 5:53 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5368	Oh yeah, that was wrong. It was meant to be based on foldTailByMasking. That must have been confusing. I've reran the testing and it still looks OK, this patch doesnt usually alter a lot.

Update this to what it should have been. One of the tests has changed because 12345 as a i8 is really 57, which was too many scalar iterations to pick v16 over v8. I've changed it to 241 so that it keeps testing the same thing.

dmgreen added a child revision: D142015: [LV] Plan with and without FoldTailByMasking.Apr 24 2023, 2:31 AM

Harbormaster completed remote builds in B227665: Diff 516316.Apr 24 2023, 4:18 AM

sdesmalen added inline comments.Apr 24 2023, 7:35 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

5372–5376

nit: Is it worth using a lambda for this, e.g.

auto GetCostForTC = [MaxTripCount, this](unsigned VF, InstructionCost VectorCost,
                                         InstructionCost ScalarCost) {
  return foldTailByMasking() ?
    VectorCost * divideCeil(MaxTripCount, VF);
    VectorCost * (MaxTripCount / VF) + ScalarCost * (MaxTripCount % VF);
};

auto RTCostA = GetCostForTC(A.Width.getFixedValue(), CostA, A.ScalarCost);
auto RTCostB = GetCostForTC(B.Width.getFixedValue(), CostB, B.ScalarCost);

llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll

98 ↗

(On Diff #516316)

I'm curious why this test needed changing. What VF does it pick with 12345?

dmgreen added inline comments.Apr 24 2023, 7:38 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5372–5376	Sounds good I'll do that now.
llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
98 ↗	(On Diff #516316)	12345 as a i8 is really 57, which was too many scalar iterations to pick v16 over v8. It is `3vf16 + 9vf1` vs `7vf8 + 1vf1`. I've changed it to 241 so that it keeps testing the same thing. And is a i8 value.

Update as suggested

LGTM!

llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
98 ↗	(On Diff #516316)	Ha, I didn't spot it said `i8 12345`, that's silly. Your explanation makes sense, thanks for clarifying!

This revision is now accepted and ready to land.Apr 24 2023, 7:58 AM

This revision was landed with ongoing or failed builds.Apr 24 2023, 2:02 PM

Closed by commit rG1869a9c225c7: [LV] Use the known trip count when costing non-tail folded VFs (authored by dmgreen). · Explain Why

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rG1869a9c225c7: [LV] Use the known trip count when costing non-tail folded VFs.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

31 lines

test/

Transforms/

LoopVectorize/

X86/

vect.omp.force.small-tc.ll

22 lines

Diff 511429

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 5,349 Lines • ▼ Show 20 Lines

	bool LoopVectorizationCostModel::isMoreProfitable(			bool LoopVectorizationCostModel::isMoreProfitable(
	const VectorizationFactor &A, const VectorizationFactor &B) const {			const VectorizationFactor &A, const VectorizationFactor &B) const {
	InstructionCost CostA = A.Cost;			InstructionCost CostA = A.Cost;
	InstructionCost CostB = B.Cost;			InstructionCost CostB = B.Cost;

	unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(TheLoop);			unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(TheLoop);

	if (!A.Width.isScalable() && !B.Width.isScalable() && foldTailByMasking() &&			if (!A.Width.isScalable() && !B.Width.isScalable() && MaxTripCount) {
	MaxTripCount) {			// If the trip count is a known (possibly small) constant, the trip count
	// If we are folding the tail and the trip count is a known (possibly small)			// will be rounded up to an integer number of iterations under
	// constant, the trip count will be rounded up to an integer number of			// FoldTailByMasking. The total cost in that case will be
	// iterations. The total cost will be PerIterationCost*ceil(TripCount/VF),			// VecCost*ceil(TripCount/VF). When not folding the tail, the total
	// which we compare directly. When not folding the tail, the total cost will			// cost will be VecCostfloor(TC/VF) + ScalarCost(TC%VF). There will be
	// be PerIterationCost*floor(TC/VF) + Scalar remainder cost, and so is			// some extra overheads, but for the purpose of comparing the costs of
	// approximated with the per-lane cost below instead of using the tripcount			// different VFs we can use this to compare the total loop-body cost
	// as here.			// expected after vectorization.
	auto RTCostA = CostA * divideCeil(MaxTripCount, A.Width.getFixedValue());			auto RTCostA =
	auto RTCostB = CostB * divideCeil(MaxTripCount, B.Width.getFixedValue());			A.Width.getFixedValue()
				david-armUnsubmitted Not Done Reply Inline Actions Hi @dmgreen, perhaps I've missed something here but it looks like there is a divide-by-zero in the code? If `A.Width.getFixedValue()` is non-zero then we calculate `(CostA * divideCeil(MaxTripCount, A.Width.getFixedValue()))`, and if it is zero then we do `(CostA * (MaxTripCount / A.Width.getFixedValue()) + ...` david-arm: Hi @dmgreen, perhaps I've missed something here but it looks like there is a divide-by-zero in…
				dmgreenAuthorUnsubmitted Done Reply Inline Actions Oh yeah, that was wrong. It was meant to be based on foldTailByMasking. That must have been confusing. I've reran the testing and it still looks OK, this patch doesnt usually alter a lot. dmgreen: Oh yeah, that was wrong. It was meant to be based on foldTailByMasking. That must have been…
				? (CostA * divideCeil(MaxTripCount, A.Width.getFixedValue()))
				: (CostA * (MaxTripCount / A.Width.getFixedValue()) +
				A.ScalarCost * (MaxTripCount % A.Width.getFixedValue()));
				auto RTCostB =
				B.Width.getFixedValue()
				? (CostB * divideCeil(MaxTripCount, B.Width.getFixedValue()))
				: (CostB * (MaxTripCount / B.Width.getFixedValue()) +
				B.ScalarCost * (MaxTripCount % B.Width.getFixedValue()));
				sdesmalenUnsubmitted Not Done Reply Inline Actions nit: Is it worth using a lambda for this, e.g. auto GetCostForTC = [MaxTripCount, this](unsigned VF, InstructionCost VectorCost, InstructionCost ScalarCost) { return foldTailByMasking() ? VectorCost * divideCeil(MaxTripCount, VF); VectorCost * (MaxTripCount / VF) + ScalarCost * (MaxTripCount % VF); }; auto RTCostA = GetCostForTC(A.Width.getFixedValue(), CostA, A.ScalarCost); auto RTCostB = GetCostForTC(B.Width.getFixedValue(), CostB, B.ScalarCost); sdesmalen: nit: Is it worth using a lambda for this, e.g. auto GetCostForTC = [MaxTripCount, this]…
				dmgreenAuthorUnsubmitted Done Reply Inline Actions Sounds good I'll do that now. dmgreen: Sounds good I'll do that now.

	return RTCostA < RTCostB;			return RTCostA < RTCostB;
	}			}

	// Improve estimate for the vector width if it is scalable.			// Improve estimate for the vector width if it is scalable.
	unsigned EstimatedWidthA = A.Width.getKnownMinValue();			unsigned EstimatedWidthA = A.Width.getKnownMinValue();
	unsigned EstimatedWidthB = B.Width.getKnownMinValue();			unsigned EstimatedWidthB = B.Width.getKnownMinValue();
	if (std::optional<unsigned> VScale = getVScaleForTuning(TheFunction, TTI)) {			if (std::optional<unsigned> VScale = getVScaleForTuning(TheFunction, TTI)) {
	if (A.Width.isScalable())			if (A.Width.isScalable())
	▲ Show 20 Lines • Show All 5,236 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -passes=loop-vectorize -mcpu=corei7-avx -S -vectorizer-min-trip-count=21 \| FileCheck %s			; RUN: opt < %s -passes=loop-vectorize -mcpu=corei7-avx -S -vectorizer-min-trip-count=21 \| FileCheck %s

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux"			target triple = "x86_64-unknown-linux"

	;			;
	; The source code for the test:			; The source code for the test:
	;			;
	; void foo(ptr restrict A, ptr restrict B)			; void foo(ptr restrict A, ptr restrict B)
	; {			; {
	; for (int i = 0; i < 20; ++i) A[i] += B[i];			; for (int i = 0; i < 20; ++i) A[i] += B[i];
	; }			; }
	;			;

	;			;
	; This loop will be vectorized, although the trip count is below the threshold, but vectorization is explicitly forced in metadata.			; This loop will be vectorized, although the trip count is below the threshold, but
				; vectorization is explicitly forced in metadata. The trip count of 4 is chosen as
				; it more nicely divides the loop count of 20, produce a lower total cost.
	;			;
	define void @vectorized(ptr noalias nocapture %A, ptr noalias nocapture readonly %B) {			define void @vectorized(ptr noalias nocapture %A, ptr noalias nocapture readonly %B) {
	; CHECK-LABEL: @vectorized(			; CHECK-LABEL: @vectorized(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds float, ptr [[TMP1]], i32 0			; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds float, ptr [[TMP1]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <8 x float>, ptr [[TMP2]], align 4, !llvm.access.group [[ACC_GRP0:![0-9]+]]			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x float>, ptr [[TMP2]], align 4, !llvm.access.group [[ACC_GRP0:![0-9]+]]
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds float, ptr [[A:%.]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds float, ptr [[A:%.]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds float, ptr [[TMP3]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds float, ptr [[TMP3]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <8 x float>, ptr [[TMP4]], align 4, !llvm.access.group [[ACC_GRP0]]			; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <4 x float>, ptr [[TMP4]], align 4, !llvm.access.group [[ACC_GRP0]]
	; CHECK-NEXT: [[TMP5:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP5:%.*]] = fadd fast <4 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: store <8 x float> [[TMP5]], ptr [[TMP4]], align 4, !llvm.access.group [[ACC_GRP0]]			; CHECK-NEXT: store <4 x float> [[TMP5]], ptr [[TMP4]], align 4, !llvm.access.group [[ACC_GRP0]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 20
	; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP1:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP1:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 20, 16			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 20, 20
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 20, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP7:%.*]] = load float, ptr [[ARRAYIDX]], align 4, !llvm.access.group [[ACC_GRP0]]			; CHECK-NEXT: [[TMP7:%.*]] = load float, ptr [[ARRAYIDX]], align 4, !llvm.access.group [[ACC_GRP0]]
	; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP8:%.*]] = load float, ptr [[ARRAYIDX2]], align 4, !llvm.access.group [[ACC_GRP0]]			; CHECK-NEXT: [[TMP8:%.*]] = load float, ptr [[ARRAYIDX2]], align 4, !llvm.access.group [[ACC_GRP0]]
	; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP7]], [[TMP8]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP7]], [[TMP8]]
	; CHECK-NEXT: store float [[ADD]], ptr [[ARRAYIDX2]], align 4, !llvm.access.group [[ACC_GRP0]]			; CHECK-NEXT: store float [[ADD]], ptr [[ARRAYIDX2]], align 4, !llvm.access.group [[ACC_GRP0]]
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 20			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 20
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	▲ Show 20 Lines • Show All 152 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Use the known trip count when costing non-tail folded VFsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 511429

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll

[LV] Use the known trip count when costing non-tail folded VFs
ClosedPublic