Currently getMemInstScalarizationCost severely overestimates the cost of
computing the addresses for scalarized memory instructions. It computes
the cost of computing vector addresses for each scalar access.
This patch adjusts the address cost computation to use the scalar
pointer type. Unfortunately this means that we now prefer scalarization
over gather/scatters in some cases on X86. This is due to the fact that
we divide the cost for predicated memory instructions by the reciprocal
block frequency. This means the cost is usually lower than the masked
gather/scatters. I am not sure if there's an easy way to avoid that in
the short term. The problem is that we currently do not really consider
the cost overhead for the compares/branches of predicated blocks.
With this change, we now vectorize the gather-cost.ll tests. I checked
on both ARM64 and X86 (skylake), and the vectorized versions are roughly
2x faster than the scalar loops.
drive-by nit: VF.getFixedValue(), since this function only makes sense for fixed-width vectors.
(same for some other places in this function, not necessarily caused by this patch)