This patch reduces code size for all AVX targets and increases speed for some chips.
SSE 4.1 introduced the useless (see code comments) 2-register form of BLENDV and only in the "packed" float/double flavors. Scalar alias mnemonics would have cost so much...paper. But they distinguished between floats and doubles, so we should be thankful. Wait...
AVX subsequently made the instruction useful by adding a 4-register operand form.
So we just need to paper over the lack of scalar forms of this instruction, complicate the code to choose float or double forms, and use blendv on scalars since all FP is in xmm registers anyway.
This gives us an approximately 50% speed up for a blendv microbenchmark sequence on SandyBridge and Haswell:
blendv : 29.73 cycles/iter
logic : 43.15 cycles/iter
I'm not adding any new test cases because:
- fast-isel-select-sse.ll tests the positive side for regular X86 lowering and fast-isel
- sse-minmax.ll and fp-select-cmp-and.ll confirm that we're not firing for scalar selects without AVX
- fp-select-cmp-and.ll and logical-load-fold.ll confirm that we're not firing for scalar selects with constants.