We were not able to determine the number of elements processed by the loop (the scalar loop iteration count) for loops with dependent iterators, and so tail-predication was not triggering. The scalar loop iteration is found by pattern matching the masked load/store instruction in the vector body that use this value, which is then checked with SCEV information to make sure that this is right. Not only does the SCEV expression for these type of loops look different, but also finding the actual trip count requires more work, and most changes here are related to this.

Supported now are cases where only the inner loop iterators receives values from its outer loop, like this nested loop example:

for (i = 0; i < N; i++) M = Size - i; for (j = 0; j < M; j++)

And also a 3d example like this because the SCEV expression is the same:

for (k = 0; k < N; k++) for (i = 0; i < N; i++) M = Size - i; for (j = 0; j < M; j++)

And this will cover most reduction kernels that we currently have.

TODO:

The general case where any inner loop iterator can depend on its outer loop is not yet supported. For example, here i is initialised with k, and j is initialised with the value from its parent loop i:

for (k = 0; k < N; k++) for (i = k; i < N; i++) for (j = i; j < M; j++)

The reason that this is not yet support is that pattern matching this SCEV is unwieldy as it almost requires a general SCEV visitor as this involves, scAddExpr, scAddRecExpr, scUMaxExpr, and scSMaxExpr SCEV types and still not very general. Instead, as a follow up, we would like to emit the scalar iteration count with an intrinsic, similar like how this is done for the hardware-loop instruction, which we can then simply pick up here, and then we don't need all this pattern matching anymore.