Currently vectorizer rejects short trip count loops in presence of run-time checks. That is caused by inability of cost model to take an overhead caused by run-time checks into account. While ideal solution would be to calculate entire cost of VPlan there is no infrastructure in place for that. I see to possibilities to mitigate that.
- Make a dry-run of SCEV expansion during main phase of cost modeling. Thus Instead of real SCEV expansion we will calculate cost of such expansion.
- Defer overhead calculation for run-time checks and final decision till after run-time checks expansion. If overhead turns out to make vectorization not profitable make unconditional bypass of vectorized loop and relay on the following optimizations to remove trivially dead code.
While 1) may look like a better approach I see it as very problematic in practice. That's why I decided to go with the 2).
Please note loops vectorized before will remain vectorized after the change in the same way. Only short trip count loops not vectorized before may potentially get vectorized (if proved profitable by cost model). Thus the change has pretty narrow scope and is not expected to have wide performance impact (neither positive nor negative).
Main motivation for the change is performance gain on a benchmark which is based on publicly available math library for linear algebra (https://github.com/fommil/netlib-java). The following simple loop over DSPR method from Netlib shows +55% gain on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz:
for (int i = 0; i < 1000; i++) {
blas.dspr("l", 9, 1.0, vectorD, 1, result);
}
I measured performance impact on llvm test-suite but I don't know how to read the results. I ran original and patched compilers 3 times each and results are unstable. Even for consequent runs of original compiler I observe +100/-100% variation. This is a special machine dedicated for performance testing and no other applications were running.
Besides I evaluated performance on SpecJVM98, Dacapo9.12 and Netlib . All measurements were performed on OS:Linux, CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. No performance changes detected. I can share numbers if anyone wants to see them.
Note the feature is disabled by default for now. It is going to be enabled downstream first to give it extensive testing on wider range of real world applications.
I think that "forced" check should be done before this, and if it is forced, no further checks needed. With that, conrol flow within if branches should simplify a lot.