This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Use MOVMSK for all_of/any_of reduction patterns
ClosedPublic

Authored by RKSimon on Jan 17 2017, 8:52 AM.

Details

Summary

This is a first attempt at using the MOVMSK instructions to replace all_of/any_of reduction patterns (i.e. an and/or + shuffle chain).

So far this only matches patterns where we are reducing an all/none bits source vector (i.e. a comparison result) but we should be able to expand on this in conjunction with improvements to 'bool vector' handling both in the x86 backend as well as the vectorizers etc.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Jan 17 2017, 8:52 AM
RKSimon updated this revision to Diff 84691.Jan 17 2017, 9:57 AM

Avoid this on avx512vl targets

I see that some simple tests grow by a bunch of instructions (also simple, but from looking at instruction tables, it seems to be marginally slower after this transform). These tests are like micro-benchmarks and don't really reflect real-world code, but should we only do this transform if we have a minimum number of shuffles?

Thanks,
Filipe

lib/Target/X86/X86ISelLowering.cpp
28639

Maybe that second part should be in an assert?
The way I see it, we haven't really done much yet, so either:

  • The bitwidth changed (how? matchBinOpReduction bails if it encounters an extract_subvector (shuffles with different sizes are a thing in IR though, can't remember if also on SDAG)) but we can still recover if we use the new bitwidth we got (if it still matched, there shouldn't be a big problem, no?)
  • The bitwidth changed and we can't recover (does this happen?)
28657

This raised a red flag when reading. Maybe join the two? Will it be that hard to read?
Otherwise, something like

MVT ElemVT = ...;
MaskVT = ...;
RKSimon updated this revision to Diff 85401.Jan 23 2017, 8:25 AM

Updated based on Filipe's feedback.

RKSimon marked an inline comment as done.Jan 23 2017, 8:27 AM

I've limited the combine to only reduce vectors with 4 or more elements.

lib/Target/X86/X86ISelLowering.cpp
28639

So there are valid situations where Match.getScalarValueSizeInBits() != BitWidth as EXTRACT_VECTOR_ELT can perform implicit extension of the extracted vector element. I've added a comment to explain. If it proves useful we can add support for this case but don't see a need yet.

andreadb added inline comments.Feb 1 2017, 5:21 AM
test/CodeGen/X86/vector-compare-all_of.ll
739–746

I was expecting to see the usual cmpl+movw+cmov pattern here.
Is it because you would end up with a conditional move of i8 quantities?

andreadb added inline comments.Feb 1 2017, 8:20 AM
test/CodeGen/X86/vector-compare-all_of.ll
583–585

Rather than movw+cmovnew+cwtl, can we select instead movl+cmovnel?
The advantage is that we remove the partial register writes on AX, and we can get rid of the sign extend at the end (cwtl).

RKSimon added inline comments.Feb 1 2017, 10:12 AM
test/CodeGen/X86/vector-compare-all_of.ll
739–746

Hmm, it might be possible to lower this as:

pmovmskb %xmm0, %edi
xor %eax, %eax
cmpl $65535, %edi
sete %al
negl %eax
retq
andreadb added inline comments.Feb 1 2017, 10:36 AM
test/CodeGen/X86/vector-compare-all_of.ll
739–746

I was about to suggest cmp+set+sra, but neg is even better :-).

RKSimon updated this revision to Diff 86671.Feb 1 2017, 10:38 AM

To avoid the partial register issue, I've updated the result select code to always perform it as i32/i64 followed by a truncation if necessary.

andreadb accepted this revision.Feb 1 2017, 10:55 AM

To avoid the partial register issue, I've updated the result select code to always perform it as i32/i64 followed by a truncation if necessary.

Nice! That also fixes the issue with partial register writes for OR reductions.

The patch looks good to me now.
Thanks Simon!

This revision is now accepted and ready to land.Feb 1 2017, 10:55 AM
This revision was automatically updated to reflect the committed changes.