Instead of handling power-of-two sized vector chunks,
try handling the large vector in a stream mode,
decreasing the operational vector size
once it no longer works for the elements left to process.
Notably, this improves costs for overaligned loads - loading padding is fine.
This more directly tracks when we need to insert/extract the YMM/XMM subvector,
some costs fluctuate because of that.
This assert triggers if a vector of i1 (e.g. <16 x i1>) is passed. EltTyBits will be 1.
The assert can be triggered by running opt -cost-model -analyze -mtriple=x86_64-unknown-linux-gnu on
See https://llvm.godbolt.org/z/jxPvdGEW4 for a run-able version.