By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements [0, VF)*stride + 0,
and 1'th vector containing elements [0, VF)*stride + 1.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)
Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)
Right now, for such not-fully-interleaved loads we just use the costs
for fully-interleaved loads. But at least in general,
that is obviously overly pessimistic, because in general,
not all the shuffles needed to perform the full interleaving
will end up being live.
So what i propose, is to naively scale the interleaving cost
by the fraction of the live members. I believe this should still result
in the right ballpark cost estimate.
Add a comment explaining this is just an approximation