Currently we happen to split a chain of 12xi8 accesses into 6xi8 + 6xi8, which
produces rather suboptimal code.
This change attempts to split-off non-multiples of 4bytes at the end and if that
does not work, splits on the smaller power-of-2 boundary.
nit: add {}?