This is an archive of the discontinued LLVM Phabricator instance.

[x86] use movmsk when extracting multiple lanes of a vector compare (PR39665)
AbandonedPublic

Authored by spatel on Mar 21 2019, 2:29 PM.

Details

Summary

The motivation for this patch is the 2nd example from PR39665 comment 2:
https://bugs.llvm.org/show_bug.cgi?id=39665#c2
(To avoid confusion, we could open a separate bug report to correspond with this patch because there are several independent problems in the bug report.)

We do a vector comparison, extract >1 lane, and then do some arbitrary ops on those results. The likely expensive part of that sequence is getting the results from XMM to GPR, so we want to do that with a single 'movmsk'.

Typically, we'd want to do "all-of" or "any-of" type comparisons, but I've intentionally created more complicated test cases here to show potential trade-offs. If we're not happy with those diffs, we can restrict the pattern matching to only apply to the more specific/typical patterns.

Unfortunately, we seem to be missing folds to form 'test' instructions with bitmasks, so even the motivating case isn't optimal yet. I'll take a look at that transform next if the general direction on this patch looks good, but I think this patch is likely still a perf win on that example (although I have no idea about the KNL target specifically).

Diff Detail

Event Timeline

spatel created this revision.Mar 21 2019, 2:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2019, 2:29 PM
RKSimon added inline comments.Mar 21 2019, 2:46 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
34440

With a little care we should be able to do v8i16 as well.

34480

Can any of the code from combineHorizontalPredicateResult or combineBitcastvxi1 be reused?

llvm/test/CodeGen/X86/movmsk-cmp.ll
4926

Are these and superfluous if we just need the lsb?

5136

For the anyof/allof cases it'd be a lot better if we could merge the tests into a single compare - in c-ray that would allow us to merge the multiple cmp+jmp - the 2 separate jmps packed so close together is a known perf issues.

spatel marked 2 inline comments as done.Mar 21 2019, 3:07 PM
spatel added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
34480

I didn't see a way. This raises what might be a fundamental question about these kinds of patterns.

I was going for a general solution, but if we really only care about reductions or compare-of-compare, then we might be better off trying to add a glob of vector compare logic to SLP (maybe it's already there)?

Ie, we could try to form this in IR:

%cmp = fcmp ogt <2 x double> %x, %y
%e1 = extractelement <2 x i1> %cmp, i32 0
%e2 = extractelement <2 x i1> %cmp, i32 1
%u = and i1 %e1, %e2
=>
%cmp = fcmp ogt <2 x double> %x, %y
%bc = bitcast <2 x i1> %cmp to i2
%u = icmp eq i2 %bc, -1
llvm/test/CodeGen/X86/movmsk-cmp.ll
5136

Yes, I think this is an independent problem. Let me see if I can solve at least part of that, so it's not distracting here.

spatel marked an inline comment as done.
spatel added inline comments.
llvm/test/CodeGen/X86/movmsk-cmp.ll
5136

Oh no...we need the select-to-logic transform that we recently discussed as illegal due to poison:

  t21: i32 = select t28, Constant:i32<42>, Constant:i32<99>
t22: i32 = select t30, t21, Constant:i32<99>

The more I look at the motivating patterns, the less hopeful I am that we can get optimal code by delaying the transforms until SDAG.

So here's a 1st step to improve the SLP vectorizer: D59710

For reference, the double-select is *created here in SDAG* because that is supposed to be an optimization: D7622

There's still a chance that this transform is useful in some cases, so I'm not quite ready to abandon.

spatel abandoned this revision.Apr 30 2019, 7:44 AM

D61189 improves on what this patch set out to do (although we're still likely to need IR fixes), so abandoning.