All of our insert/extract ops work on 128-bit lanes.
For Insert, we need to extract affected 128-bit lane,
unless it's being fully overwritten (FIXME: do we need to be
careful about legalization-induced padding that we obviously don't demand?),
perform insertions, and then insert the 128-bit lane back.
But hold on. If we are operating on an 256-bit legal vector,
and thus have two 128-bit subvectors, and are fully overwriting them both,
we don't actually need to insert *both* subvectors,
only the second one, into the implicitly-widened first one.
Also, Insert wasn't actually querying the costs,
but just assuming them to be 1.
getShuffleCost(TTI::SK_ExtractSubvector) notes:
// Note that in general, the insertion starting at the beginning of a vector // isn't free, because we need to preserve the rest of the wide vector.
... so as far as i can tell, we didn't account for that.
I was hoping this would allow vectorization at a higher VF at one case i looked at,
but the subvector insertion cost is still dis-advising that.
The change for Extract is NFC, and is for consistency only,
i wanted to get rid of of that weird explicit discounting of insertion of 0'th element,
since the general code should already deal with that.
This confused me - we might need a better term that CostValue now (NumLegalVectors?)