Noticed while looking at D49562 codegen - we can avoid a large constant mask load and a slow VPBLENDVB select op by using VPBLENDW+VPBLENDD instead.
TODO: We should investigate adding VPBLENDVB handling to target shuffle combining as well.
Should we be preferring VPBLENDVB/VSELECT for AVX512 targets?
We should have another testcase that blends something other than -1, because the best way to do this blend with three -1 elements is with an OR. OR with -1 produces -1 regardless of the previous contents, and OR with 0 is a no-op, thus it's a blend.
I don't have numbers on loading a constant vs. a couple extra uops outside a loop. Obviously any time we have a loop that will either keep a constant hot in cache, or let us hoist into a reg, this is a very nice win.
https://godbolt.org/z/JNv5VZ shows that this works: a manually optimized version of the function gives the same result for constant-propagation.
clang actually used vorps, but that can only run on port 5 before Skylake. I used -march=haswell, so compiling _mm256_or_si256 to vorps (port 5) instead of vpor (port 0/1/5) is really silly for an integer vector. (SKL lets vorps run on any port, with latency between FP instructions dependent on which port it actually picks. But I compiled with -march=haswell, and this is a poor choice for HSW.
Without AVX, por is 1 byte longer than orps, but even then por is can be worth it on pre-Skylake depending on the surrounding code (port 5 pressure, and/or if there's any ILP for this blend). Also with Hyperthreading, uops that can be assigned to any port are more likely to be able to take full advantage of the extra ILP exposed by SMT, vs. potentially having both threads together bottleneck on the same port.