The motivating case for this is shown in the first regression test. We are transferring to scalar and back rather than just zero-extending with 'vpmovzxdq'.
That's a special-case for a more general pattern as shown here. In all tests, we're avoiding the vector-scalar-vector moves in favor of vector ops.
I suspect that we aren't producing optimal shuffle code in some cases though, so we may want to limit this patch and/or account for those patterns first. But I figured it was worth posting the larger test diffs, so we can see what's happening and make sure the logic is correct.
If we want to limit this patch but still get that 1st motivating case, I see 2 possibilities:
- Don't handle patterns where we require translating the source element to a different location in the result.
- Don't handle patterns with zero elements in the build vector (only deal with undefs there).
Should ANY_EXTEND be handled as well? SimplifyDemandedBits can reduce ZERO_EXTEND -> ANY_EXTEND more aggressively these days.