Instead of handling power-of-two sized vector chunks,
try handling the large vector in a stream mode,
decreasing the operational vector size
once it no longer works for the elements left to process.
Notably, this improves costs for overaligned loads - loading padding is fine.
This more directly tracks when we need to insert/extract the YMM/XMM subvector,
some costs fluctuate because of that.