Let's take a look at:
https://godbolt.org/z/4f6bv69hc
Even though it would seem that we need 4 shuffles there,
we only need two, because the replication factor is 2x the vector size,
so half of the vectors can be materialized via a move.
Effectively, this means that there is a hard upper limit
for the replication cost along the replication factor axis.