A set of microbenchmarks in llvm-test-suite (https://github.com/llvm/llvm-test-suite/pull/26), when tested on a AArch64 platform, demonstrates that loop interleaving is beneficial once the interleaved part of the loop runs at least twice. The performance improves as trip count increases and is best for trip counts that only run the vectorized &/or interleaved loop. For example, if VW is the vectorization width & IC is the interleaving count, loops with trip count TC > VW*IC*2 performs better with interleaving and the performance is best when TC is a multiple of VW*IC.
The current trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure.
We have also found example in an application benchmark that was compiled with PGO where a hot loop with trip count less than 24 shows a 40% regression since it didn't get interleaved because the profile-driven trip count was less than 128. Therefore, it seems reasonable to get rid of this threshold and use the trip count for computing interleaving count instead.