This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Target/AArch64/
-
lib/
-
Target/
-
AArch64/
1/2
AArch64TargetTransformInfo.cpp

Differential D156112

[AArch64][LoopVectorize] Improve tail-folding heuristic on neoverse-v1
AbandonedPublic

Authored by igor.kirillov on Jul 24 2023, 5:28 AM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
david-arm

Summary

Increases the number of instructions predication tail-folding threshold
when the loop has more than one comparison. The reason for this is that
the "whileXX" and vector comparison instructions have a throughput of
only one on that CPU, and if there is not enough computation between
the comparisons, it can cause the code to slow down.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	270 ms	x64 debian > Clang.Modules::stress1.cpp
	1,630 ms	x64 windows > Clang.Modules::stress1.cpp

Event Timeline

igor.kirillov created this revision.Jul 24 2023, 5:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 24 2023, 5:28 AM

Herald added subscribers: artagnon, mgabka, shiva0217 and 2 others. · View Herald Transcript

igor.kirillov requested review of this revision.Jul 24 2023, 5:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 24 2023, 5:28 AM

Herald added subscribers: llvm-commits, wangpc. · View Herald Transcript

Harbormaster completed remote builds in B247635: Diff 543485.Jul 24 2023, 7:31 AM

igor.kirillov added reviewers: paulwalker-arm, david-arm.Jul 24 2023, 3:51 PM

david-arm added inline comments.Jul 25 2023, 1:34 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
3789	At first glance this feels a little brutal - doubling (or even tripling!) the threshold when there is an extra compare in the loop. Also, I would have thought that after adding one or two more compares we should really hit a plateau for the threshold because at the point the volume of compares is more likely to be the bottleneck after filling up all the pipelines? @igor.kirillov What loops have you tested this on and have you established what the minimum thresholds required to prevent tail-folding are? I just wonder if instead of multiplying the threshold you can actually do something like this: unsigned AdditionalInsns = NumComparisons > 1 ? 5 : 0; return NumInsns >= (SVETailFoldInsnThreshold + AdditionalInsns ); If you haven't done so already, then I think it's worth collecting some data on what thresholds are really needed.

If for this function we cannot produce a better heuristic based souly on VF and the number of instructions then we're really talking about needing to extend the cost model itself and ensure that such a model correctly costs all the predicated logic, which includes the while based control flow.

igor.kirillov added inline comments.Jul 25 2023, 4:20 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
3789	Brutal but effective :) It's hard to make a simple yet precise heuristic. I tried to see how many computational instructions we need after comparison to make this problem disappear, and it is around ten fmul/fadd instructions. If there are memory access instructions among them, we need less. If we have a loop like this: for (Index_type j = 0; j < N; ++j) { pout[j] = pin1[j] < pin2[j] ? pin1[j] : pin2[j]; } Then if N is big and this loop is executed just several times, the throughput problem is completely overshadowed by slow memory accesses. If N is small and the loop is executed around N times, then the throughput problem comes forward, and this loop can be up to 2 times slower. In the heuristic I've just added, we don't care about memory operations and the number of instructions BETWEEN comparisons, but I doubt it is worth adding, at least for now. For the purposes of the least disruption effect on benchmarks, I would ask for 5 extra instructions for each comparison to allow predicated tail-folding. (One regressed benchmark has 16 instructions and one extra comparison, and the other has 24 instructions and 2 extra comparisons). The question is how this code should behave when a user passes a different `sve-tail-folding-insn-threshold` value.

Matt added a subscriber: Matt.Jul 25 2023, 2:27 PM

igor.kirillov abandoned this revision.Jul 31 2023, 2:51 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

9 lines

Diff 543485

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 3,768 Lines • ▼ Show 20 Lines	bool AArch64TTIImpl::preferPredicateOverEpilogue(TailFoldingInfo *TFI) {

if (!TailFoldingOptionLoc.satisfies(ST->getSVETailFoldingDefaultOpts(),		if (!TailFoldingOptionLoc.satisfies(ST->getSVETailFoldingDefaultOpts(),
Required))		Required))
return false;		return false;

// Don't tail-fold for tight loops where we would be better off interleaving		// Don't tail-fold for tight loops where we would be better off interleaving
// with an unpredicated loop.		// with an unpredicated loop.
unsigned NumInsns = 0;		unsigned NumInsns = 0;
		unsigned NumComparisons = 0;
for (BasicBlock *BB : TFI->LVL->getLoop()->blocks()) {		for (BasicBlock *BB : TFI->LVL->getLoop()->blocks()) {
NumInsns += BB->sizeWithoutDebug();		NumInsns += BB->sizeWithoutDebug();
		NumComparisons += count_if(
		*BB, [](Instruction &I) { return isa<CmpInst, FCmpInst>(&I); });
}		}

// We expect 4 of these to be a IV PHI, IV add, IV compare and branch.		// We expect 4 of these to be a IV PHI, IV add, IV compare and branch.
return NumInsns >= SVETailFoldInsnThreshold;		// If there is more than one comparison in the loop, increase the required
		// number of instructions for predicated tail folding. This is because the
		// throughput of comparison and `whileXX` instructions is only one, and
		// insufficient computation between comparisons can slow down the code.
		return NumInsns >= SVETailFoldInsnThreshold * NumComparisons;
		david-armUnsubmitted Not Done Reply Inline Actions At first glance this feels a little brutal - doubling (or even tripling!) the threshold when there is an extra compare in the loop. Also, I would have thought that after adding one or two more compares we should really hit a plateau for the threshold because at the point the volume of compares is more likely to be the bottleneck after filling up all the pipelines? @igor.kirillov What loops have you tested this on and have you established what the minimum thresholds required to prevent tail-folding are? I just wonder if instead of multiplying the threshold you can actually do something like this: unsigned AdditionalInsns = NumComparisons > 1 ? 5 : 0; return NumInsns >= (SVETailFoldInsnThreshold + AdditionalInsns ); If you haven't done so already, then I think it's worth collecting some data on what thresholds are really needed. david-arm: At first glance this feels a little brutal - doubling (or even tripling!) the threshold when…
		igor.kirillovAuthorUnsubmitted Done Reply Inline Actions Brutal but effective :) It's hard to make a simple yet precise heuristic. I tried to see how many computational instructions we need after comparison to make this problem disappear, and it is around ten fmul/fadd instructions. If there are memory access instructions among them, we need less. If we have a loop like this: for (Index_type j = 0; j < N; ++j) { pout[j] = pin1[j] < pin2[j] ? pin1[j] : pin2[j]; } Then if N is big and this loop is executed just several times, the throughput problem is completely overshadowed by slow memory accesses. If N is small and the loop is executed around N times, then the throughput problem comes forward, and this loop can be up to 2 times slower. In the heuristic I've just added, we don't care about memory operations and the number of instructions BETWEEN comparisons, but I doubt it is worth adding, at least for now. For the purposes of the least disruption effect on benchmarks, I would ask for 5 extra instructions for each comparison to allow predicated tail-folding. (One regressed benchmark has 16 instructions and one extra comparison, and the other has 24 instructions and 2 extra comparisons). The question is how this code should behave when a user passes a different `sve-tail-folding-insn-threshold` value. igor.kirillov: Brutal but effective :) It's hard to make a simple yet precise heuristic. I tried to see how…
}		}

InstructionCost		InstructionCost
AArch64TTIImpl::getScalingFactorCost(Type Ty, GlobalValue BaseGV,		AArch64TTIImpl::getScalingFactorCost(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale, unsigned AddrSpace) const {		int64_t Scale, unsigned AddrSpace) const {
// Scaling factors are not free at all.		// Scaling factors are not free at all.
// Operands \| Rt Latency		// Operands \| Rt Latency
Show All 16 Lines