This moves v32i16/v64i8 to a model more consistent with how we
treat integer types with avx1.
This does change the ABI for types vXi16/vXi8 vectors larger than
512 bits to pass in multiple zmms instead of multiple ymms. We'd
already hacked some code to make v64i8/v32i16 pass in zmm.
Cost model is still a bit of a mess. In some place I tried to
match existing behavior. But really we need to account for
splitting and concating costs. Cost model for shuffles is
especially pessimistic. This has an big effect on reductions since
the generic lowering uses PermuteSingleSrc. But reduction uses a
very specific pattern that can handled by subvector extracts and
shifts, but the default handling doesn't know that.
This is a bit of a hack, but it was easier than trying to hunt down all the places that can create broadcasts in lowering. I couldn't do this with a isel pattern for the broadcast_load case. So I just handled both here.