I found loops that got a much overestimated costs of vectorization where two instructions were both scalarized. Since one was using the result of the other, the defining instruction does not need to do the inserts, and the user does not have to extract any elements.
I experimented with this patch that makes the LoopVectorizer collect any instructions that the target will scalarize (expand). This is then used to find these cases and passed (eventually) to getScalarizationOverhead() which then returns a reduced value.
I found so far in practice on SystemZ that this amounts to more float loops being vectorized, typically with the only benefit being vectorized memory operations. I am not sure hos beneficial this is.
This should easily be usable by other targets also, but sofar this is SystemZ only.
Is this useful enough to include in the loop vectorizer?