When vectorising loops using tail-folding and interleaving we end up
with two back-to-back active.lane.mask intrinsic calls. Unfortunately,
this leads to poor codegen like this:
.LBB0_1: ... whilelo p1.b, x11, x1 cset, mi whilelo p0.b, x12, x1 tbnz ..., #0, .LBB0_1
This is because in AArch64InstrInfo::optimizeCondBranch we bail out if
we find a flag-setting operation between a CSINC and a TBNZW machine
node. However, in these cases nothing depends upon the flags set by
the second whilelo and it's safe to move it above the first whilelo.
I've changed AArch64InstrInfo::optimizeCondBranch to support having
a single flag-setting operation between CSINC and TBNZW, provided
we can prove it's safe to move it above the first flag-setting op.