X86TTIImpl::getGSScalarCost() has (at least) two issues:
- it naively computes the cost of sequence of insertelement/extractelement. If we are operating not on the XMM (but YMM/ZMM), this widely overestimates the cost of subvector insertions/extractions.
- Gather/scatter takes a vector of pointers, and scalarization results in us performing scalar memory operation for each of these pointers, but we never account for the cost of extracting these pointers out of the vector of pointers.