Blocks can be laid out such that a t2WhileLoopStart branches backwards. This is forbidden by the architecture and so it fails to be converted into a low-overhead loop. This new pass checks for these cases and moves the target block, fixing any fall-through that would then be broken.
This change improves the iterations per megahertz in some widely-used industrial benchmarks, with no regressions.
Add one of these for the new pass, and call it in LLVMInitializeARMTarget.