The motivation for this patch is the 2nd example from PR39665 comment 2:
https://bugs.llvm.org/show_bug.cgi?id=39665#c2
(To avoid confusion, we could open a separate bug report to correspond with this patch because there are several independent problems in the bug report.)
We do a vector comparison, extract >1 lane, and then do some arbitrary ops on those results. The likely expensive part of that sequence is getting the results from XMM to GPR, so we want to do that with a single 'movmsk'.
Typically, we'd want to do "all-of" or "any-of" type comparisons, but I've intentionally created more complicated test cases here to show potential trade-offs. If we're not happy with those diffs, we can restrict the pattern matching to only apply to the more specific/typical patterns.
Unfortunately, we seem to be missing folds to form 'test' instructions with bitmasks, so even the motivating case isn't optimal yet. I'll take a look at that transform next if the general direction on this patch looks good, but I think this patch is likely still a perf win on that example (although I have no idea about the KNL target specifically).
With a little care we should be able to do v8i16 as well.