We ask TTI.getAddressComputationCost() about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.
This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because X86TTIImpl::getAddressComputationCost()
says that cost of vector address computation is 10 as compared to 1 for scalar.
The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)
This makes a lot of sense to me. I expect none of the pre-AVX2 benchmark numbers to change at all and those for AVX2+ to change positively. If that's the case, then I don't see why this would be a bad choice.