Right now, LoopVectorizer has a hard limit on the number of runtime memory checks.
The limit is currently at 8, and while it generally works reasonably well,
as with all arbitrary limits, it's an arbitrary limit.
There are several problems with it:
- It puts a hard cap on the complexity of the loop it will vectorize Naturally, generally, the more pointer arithmetic/"objects" you have, the more checks are needed
- The number of runtime memory checks doesn't actually correlate with the overhead incurred by them. I've checked locally, and a single check can have a cost from 4 to 25...
- Why do we have this hard limit anyways? I guess because we want to avoid generating too many checks?
- How do we come up with the current limit?
Therefore, i would like to propose to completely change the approach here,
and to instead specify the budged for said checks in terms of multiples of cost
of a single iteration of the original scalar loop.
That is, if the cost of a single iteration of the original scalar loop is 10,
and the Multiple is 2, then the budged for the runtime checks is 10*2 = 20.
Currently i have looked for the optimal value for this threshold on RawSpeed and darktable,
and the results may be interesting:
https://docs.google.com/spreadsheets/d/1b3VPU1tPYGq0AO3XH3kBv3zdpKMby8aJzFl2cLSZ5AQ/edit?usp=sharing
Just to preserve all the existing vectorizations, we'd need to allow the cost of run-time checks
to be not greater than the cost of 6 iterations of scalar loop.
I know pretty much all of the code there should vectorize, because i (re)wrote most of it.
Originally, it was just manually vectorized with SSE2, but i've added plain fallbacks.
This is motivated by the bugreport https://bugs.llvm.org/show_bug.cgi?id=44662 i have filed
almost two years ago now. The code is inspired by/based on the code by @fhahn in D75981,
but unfortunately that patch is rather stuck, and vectorization area of llvm appears to be
a walled garden without much outside-of-the-club contributions, with latter being busy,
so i don't have much hope here :S
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index 8a0999ddb98c..f4495cba57f5 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -8186,6 +8186,12 @@ LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC, // Check if it is profitable to vectorize with runtime checks. if (SelectedVF.Width.getKnownMinValue() > 1 && Requirements.getNumRuntimePointerChecks()) { + errs() << "LV LAA num " << Requirements.getNumRuntimePointerChecks() + << " RTCost " << Checks.getCost(CM) << " ScalarLoopCost " + << SelectedVF.ScalarCost.getValue().getValue() << " fraction " + << (double)Checks.getCost(CM) / + SelectedVF.ScalarCost.getValue().getValue() + << "\n"; if (Checks.getCost(CM) > VectorizeMemoryCheckFactor * (*SelectedVF.ScalarCost.getValue())) { ORE->emit([&]() {
clang-format: please reformat the code