Previously, the LowOverheadLoops pass couldn't handle VPT blocks that used the vpt instruction, or loops containing multiple identical VCTPs.
This patch improves the LowOverheadLoops pass so it can handle those cases
I'm still unsure about the changes in this patch, so comments/suggestions are welcome.
This patch will also need a follow-up ARMTargetTransformInfo change to work because the TTI, in its current state, won't allow the vectorizer to do tail-predication for loops bigger than 1 basic block, and loops containing compare instructions, and, as VPT blocks are generated from comparisons (which create the predicate), they never make it to this pass in the current state of things.
However, with the right changes to the TTI and the right compiler options, you can generate this kind of code with these changes:
// C++ void test(int* A, int n, int x) { for(int i = 0; i < n; i++) if (A[i] < x && A[i] > -x) A[i] = 0; } // assembly dlstp.32 lr, r1 .LBB0_1: @ %vector.body @ =>This Inner Loop Header: Depth=1 vldrw.u32 q1, [r0] vptt.s32 lt, q1, r2 vcmpt.s32 gt, q1, r3 vstrwt.32 q0, [r0], #16 letp lr, .LBB0_1
Something's a miss here! This should have been caught by a test.