This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][LoopVectorize] Improve tail-folding heuristic on neoverse-v1
AbandonedPublic

Authored by igor.kirillov on Jul 24 2023, 5:28 AM.

Details

Summary

Increases the number of instructions predication tail-folding threshold
when the loop has more than one comparison. The reason for this is that
the "whileXX" and vector comparison instructions have a throughput of
only one on that CPU, and if there is not enough computation between
the comparisons, it can cause the code to slow down.

Diff Detail

Event Timeline

igor.kirillov created this revision.Jul 24 2023, 5:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 24 2023, 5:28 AM
igor.kirillov requested review of this revision.Jul 24 2023, 5:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 24 2023, 5:28 AM
david-arm added inline comments.Jul 25 2023, 1:34 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
3789

At first glance this feels a little brutal - doubling (or even tripling!) the threshold when there is an extra compare in the loop. Also, I would have thought that after adding one or two more compares we should really hit a plateau for the threshold because at the point the volume of compares is more likely to be the bottleneck after filling up all the pipelines?

@igor.kirillov What loops have you tested this on and have you established what the minimum thresholds required to prevent tail-folding are? I just wonder if instead of multiplying the threshold you can actually do something like this:

unsigned AdditionalInsns = NumComparisons > 1 ? 5 : 0;
return NumInsns >= (SVETailFoldInsnThreshold  + AdditionalInsns );

If you haven't done so already, then I think it's worth collecting some data on what thresholds are really needed.

If for this function we cannot produce a better heuristic based souly on VF and the number of instructions then we're really talking about needing to extend the cost model itself and ensure that such a model correctly costs all the predicated logic, which includes the while based control flow.

igor.kirillov added inline comments.Jul 25 2023, 4:20 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
3789

Brutal but effective :)
It's hard to make a simple yet precise heuristic. I tried to see how many computational instructions we need after comparison to make this problem disappear, and it is around ten fmul/fadd instructions. If there are memory access instructions among them, we need less.

If we have a loop like this:

for (Index_type j = 0; j < N; ++j) {
  pout[j] = pin1[j] < pin2[j] ? pin1[j] : pin2[j];
}

Then if N is big and this loop is executed just several times, the throughput problem is completely overshadowed by slow memory accesses. If N is small and the loop is executed around N times, then the throughput problem comes forward, and this loop can be up to 2 times slower.

In the heuristic I've just added, we don't care about memory operations and the number of instructions BETWEEN comparisons, but I doubt it is worth adding, at least for now.

For the purposes of the least disruption effect on benchmarks, I would ask for 5 extra instructions for each comparison to allow predicated tail-folding. (One regressed benchmark has 16 instructions and one extra comparison, and the other has 24 instructions and 2 extra comparisons). The question is how this code should behave when a user passes a different sve-tail-folding-insn-threshold value.

Matt added a subscriber: Matt.Jul 25 2023, 2:27 PM
igor.kirillov abandoned this revision.Jul 31 2023, 2:51 AM