Modify the cost calculation function for interleaved accesses
to use the target-specific costs for insert/extract element and
memory operations.
This better models the case where the backend can't match
the interleaved group, and we are forced to use a wide load
and shuffle vectors.
Interleaved accesses are not enabled by default, so this shouldn't
cause a performance change.
nitpick: can't you cache the T * to avoid re-using the ugly static_cast?