In a low overhead loop, LR should ideally be used exclusively for the loop count, and not spilled and reloaded in the loop. This attempts to enforce that more directly by adjusting the register classes of registers used or def'd in the loop to no include LR. This can help especially to prevent the live range or LR from being being split between t2LoopDec and t2LoopEnd, meaning we revert the loop less often (and don't end up with movs at the same time!)
It does mean that we have a register less, which can mean we end up spilling other register more. On average this should be an improvement though.
I think adding GPRwithZR and GPRwithZRnosp makes sense as well, especially since your test uses csinc.