We know that pcmp (SSE/AVX at least; I'm intentionally leaving 512-bit out of this patch because I don't know what happens there) produces all-ones/all-zeros bitmasks, so we can use that behavior to avoid unnecessary constant loading.
FWIW, I see no perf differences in test-suite with this change. I don't expect that a zext of a bitmask is a common pattern. This is a first step towards the better motivating example in PR28486:
https://llvm.org/bugs/show_bug.cgi?id=28486
...which is itself just an extract from a case where we seemingly get everything wrong:
https://godbolt.org/g/Ez2bDW
One could argue that load+and is actually a better solution for some CPUs (Intel big cores) because shifts don't have the same throughput potential as load+and on those cores, but I think that should be handled as a CPU-specific later transformation if it ever comes up. Removing the load is the more general x86 optimization. Note that the uneven usage of vpbroadcast in the test cases is filed as PR28505:
https://llvm.org/bugs/show_bug.cgi?id=28505