Increases the number of instructions predication tail-folding threshold
when the loop has more than one comparison. The reason for this is that
the "whileXX" and vector comparison instructions have a throughput of
only one on that CPU, and if there is not enough computation between
the comparisons, it can cause the code to slow down.
Details
- Reviewers
paulwalker-arm david-arm
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
270 ms | x64 debian > Clang.Modules::stress1.cpp | |
1,630 ms | x64 windows > Clang.Modules::stress1.cpp |
Event Timeline
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
3789 | At first glance this feels a little brutal - doubling (or even tripling!) the threshold when there is an extra compare in the loop. Also, I would have thought that after adding one or two more compares we should really hit a plateau for the threshold because at the point the volume of compares is more likely to be the bottleneck after filling up all the pipelines? @igor.kirillov What loops have you tested this on and have you established what the minimum thresholds required to prevent tail-folding are? I just wonder if instead of multiplying the threshold you can actually do something like this: unsigned AdditionalInsns = NumComparisons > 1 ? 5 : 0; return NumInsns >= (SVETailFoldInsnThreshold + AdditionalInsns ); If you haven't done so already, then I think it's worth collecting some data on what thresholds are really needed. |
If for this function we cannot produce a better heuristic based souly on VF and the number of instructions then we're really talking about needing to extend the cost model itself and ensure that such a model correctly costs all the predicated logic, which includes the while based control flow.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
3789 | Brutal but effective :) If we have a loop like this: for (Index_type j = 0; j < N; ++j) { pout[j] = pin1[j] < pin2[j] ? pin1[j] : pin2[j]; } Then if N is big and this loop is executed just several times, the throughput problem is completely overshadowed by slow memory accesses. If N is small and the loop is executed around N times, then the throughput problem comes forward, and this loop can be up to 2 times slower. In the heuristic I've just added, we don't care about memory operations and the number of instructions BETWEEN comparisons, but I doubt it is worth adding, at least for now. For the purposes of the least disruption effect on benchmarks, I would ask for 5 extra instructions for each comparison to allow predicated tail-folding. (One regressed benchmark has 16 instructions and one extra comparison, and the other has 24 instructions and 2 extra comparisons). The question is how this code should behave when a user passes a different sve-tail-folding-insn-threshold value. |
At first glance this feels a little brutal - doubling (or even tripling!) the threshold when there is an extra compare in the loop. Also, I would have thought that after adding one or two more compares we should really hit a plateau for the threshold because at the point the volume of compares is more likely to be the bottleneck after filling up all the pipelines?
@igor.kirillov What loops have you tested this on and have you established what the minimum thresholds required to prevent tail-folding are? I just wonder if instead of multiplying the threshold you can actually do something like this:
If you haven't done so already, then I think it's worth collecting some data on what thresholds are really needed.