This is an archive of the discontinued LLVM Phabricator instance.

[x86] eliminate unnecessary vector compare for AVX masked store
ClosedPublic

Authored by spatel on Sep 4 2017, 10:33 AM.

Details

Summary

As noted in PR11210:
https://bugs.llvm.org/show_bug.cgi?id=11210
...fixing this should allow us to eliminate x86-specific masked store intrinsics in IR.
(Although more testing will be needed to confirm that.)

I don't know anything about SKX, so I don't know if the existing code is optimal or not. If we want to handle that case too, we could add a check for a 'PCMPGTM' opcode or try to fold this earlier when the node is still a setcc.

Diff Detail

Event Timeline

spatel created this revision.Sep 4 2017, 10:33 AM
aymanmus added inline comments.Sep 5 2017, 1:10 AM
lib/Target/X86/X86ISelLowering.cpp
33185

Is there any canonical form of compare-with-all-zeros that can be guaranteed here? Or should the pattern with (pcmplt X, 0) be added also?

test/CodeGen/X86/masked_memop.ll
1158

I think the optimal code for SKX is:
vpmovd2m %xmm2, %k1
vmovups %xmm0, (%rdi) {%k1}

RKSimon added inline comments.Sep 5 2017, 4:17 AM
lib/Target/X86/X86ISelLowering.cpp
33185

Add X86ISD::PCMPGTM support?

spatel added inline comments.Sep 5 2017, 6:14 AM
lib/Target/X86/X86ISelLowering.cpp
33185

Waiting until this is PCMPGT is a kind of canonicalization (compared to the general setcc node) because SSE/AVX don't have any other compare predicates. Ie, there's no other simple way to encode this; there is no PCMPLT node.

test/CodeGen/X86/masked_memop.ll
1158

Ok - let me try to shake that out of here. To be clear, we're saying this is the optimal sequence for any CPU with avx512vl/avx512bw. SKX is just an implementation of those ISAs.

aymanmus added inline comments.Sep 5 2017, 7:01 AM
lib/Target/X86/X86ISelLowering.cpp
33185

100%, my fault.

test/CodeGen/X86/masked_memop.ll
1158
  • The IACA tool shows same throughput for both sequences, but the one I suggested has one less uop and one less register.
  • Actually the needed features for vpmovb2m/vpmovw2m are avx512vl+avx512bw and for vpmovd2m/vpmovq2m are avx512vl+avx512dq (which skx also includes)
  • The %y test's operand not used.
spatel added inline comments.Sep 5 2017, 10:37 AM
test/CodeGen/X86/masked_memop.ll
1158

I need to confirm what we're saying here. For a 128-bit vector (and similarly for 256-bit), if the machine has avx512 (with all necessary variants), then we would rather see this:

vpmovd2m %xmm2, %k1
vmovups %xmm0, (%rdi) {%k1}

than the single instruction that we would produce for a plain AVX machine:

vmaskmovps %xmm0, %xmm2, (%rdi)

Ie, we want to treat vmaskmovps as legacy cruft and avoid it if we have bitmasks?

aymanmus added inline comments.Sep 6 2017, 12:42 AM
test/CodeGen/X86/masked_memop.ll
1158

Actually it seems like both are equivalent in skx. They show the same throughput and number of uops, the same ports are used and the latency on each port is equal.

Nonetheless, I think we should prefer the vpmovd2m alternative because it provides a full set of instructions for all possible type granularities (byte, word, double-word and quad-word) while the AVX vmaskmovps are only available for 32-bit and 64-bit.

spatel updated this revision to Diff 114028.Sep 6 2017, 10:33 AM

Patch updated:
Given that AVX512 requires different pattern matching and different output, I'm pushing back on trying to include that in this patch. It should be an independent improvement (and I'm not the right person to make that improvement).

I have updated the comments in the code and the test to reflect this. NFC from the previous rev of the patch.

RKSimon accepted this revision.Sep 12 2017, 9:36 AM

LGTM

This revision is now accepted and ready to land.Sep 12 2017, 9:36 AM

I filed a bug to track this case with AVX512:
https://bugs.llvm.org/show_bug.cgi?id=34584

This revision was automatically updated to reflect the committed changes.