Perform loop unrolling differently for cores with low-overhead branches:
- Don't unroll the remainder loop, with the expectation that both the unrolled loop and the remainder will be converted into a loloop.
- Don't force unroll small loops as we now try to use the non-decrement form of LE for uncountable loops. We need to use CBN/Z for that optimisation, so reducing code size is important due to their limited range.
nit: --check-prefixes=CHECK-UNROLL-A,CHECK is shorter