This models lowering to vpermd/vpermq/vpermps/vpermpd,
that take a single input vector and a single index vector,
and are cross-lane. So far i haven't seen evidence that
replication ever results in demanding more than a single
input vector per output vector.
This results in *shockingly* lesser costs :)
Is this really useful?