This is an archive of the discontinued LLVM Phabricator instance.

[X86] Detect SAD patterns and emit psadbw instructions on X86.
ClosedPublic

Authored by congh on Nov 19 2015, 1:03 PM.

Details

Summary

As we now can detect vectorized reduction operations, it's time to use this feature! This patch detects the SAD pattern on X86 so that much better code will be emitted once the pattern is matched.

Diff Detail

Event Timeline

congh updated this revision to Diff 40699.Nov 19 2015, 1:03 PM
congh retitled this revision from to Detect SAD patterns and emit psadbw instructions on X86..
congh updated this object.
congh updated this revision to Diff 40741.Nov 19 2015, 9:59 PM

Use Metadata to annotate reduction PHI.

davidxl edited edge metadata.Nov 21 2015, 12:29 PM

Is it ready to be reviewed upstream?

David

I am splitting this patch and a subset is now under review
(http://reviews.llvm.org/D14897). After that patch is landed, I will
send this patch to upstream for review.

thanks,
Cong

congh updated this revision to Diff 41184.Nov 25 2015, 1:56 PM
congh edited edge metadata.

Update the patch after part of it is checked in.

congh updated this revision to Diff 41185.Nov 25 2015, 1:58 PM

Upload the correct patch.

congh updated this revision to Diff 48154.Feb 16 2016, 11:09 PM

Update the patch.

congh retitled this revision from Detect SAD patterns and emit psadbw instructions on X86. to [X86] Detect SAD patterns and emit psadbw instructions on X86..Feb 16 2016, 11:12 PM
congh updated this object.
congh added reviewers: hfinkel, RKSimon.
congh added a subscriber: llvm-commits.
ab added a subscriber: ab.Feb 25 2016, 3:10 PM
ab added inline comments.
lib/Target/X86/X86ISelLowering.cpp
28988–29000

Perhaps you could replace this with BuildVectorSDNode::getConstantSplatNode()?

..but looking at the uses, you can replace them all with isBuildVectorAllOnes/isBuildVectorAllZeros.

29013

All vector MVTs have power-of-2 NumElems, so I think you can drop the second part (and NumElems, and merge the ifs).

29016–29023

What happens on v2i32?

Also, doesn't this require SSE2?

I'd just explicitly check subtarget and type, like we do elsewhere. So, something like:

if ((VT == MVT::v16i32 && !Subtarget.hasAVX512()) ||
    (VT == MVT::v8i32 && !Subtarget.hasAVX2()) ||
    (VT == MVT::v4i32 && !Subtarget.hasSSE2()))
  return SDValue();
29097–29101

Why is this necessary? It seems to me that this is forcing RegSize to 128 (because even v16i8, which requires AVX-512 as it's added as v16i32 has a size of 128). In turn, that forces XMM-sized PSADs to be generated, even on AVX2. Am I missing something?

29122–29124
if (SDValue Sad = detectSADPattern(N, DAG, Subtarget))
  return Sad;
test/CodeGen/X86/sad.ll
2–7

IMHO you should only test the minimal IR with llc here. I see the argument for testing the whole thing, but it seems like 1) that prevents you from testing unrelated constructs (not all IR comes from the vectorizer), and 2) there should already be relevant opt tests.

If this fires on the test-suite (IIRC paq8p could benefit from this?), then people will notice regressions. If it doesn't, is there something we could add to the test-suite that exposes this?

Also, I don't think you need -mcpu.

congh marked 2 inline comments as done.Feb 26 2016, 4:33 PM

Thanks for the review! Please check my inline comments.

lib/Target/X86/X86ISelLowering.cpp
28988–29000

I also need to check if a vector contains all -1s, which could not be satisfied by isBuildVectorAllOnes/isBuildVectorAllZeros. How to use getConstantSplatNode here?

29016–29023

I think the case of v2i32 may not be generated by the vectorizer but as we are not just handling the results of auto-vectorization, I think it is better to cover this case.

For v16i32 we only need SSE2 as we are actually handling v16i8, and the result will be collected as a v2i64. We need AVX512 to handle v64i32.

Therefore, my previous check here is incorrect, which should be

if (VT.getSizeInBits() / 4 > RegSize)
  return SDValue();
29097–29101

v16i8 is actually handled by SSE2, while AVX-512 is handling v64i8. This means even on AVX512 we should still use SSE2 instructions to handle v16i8.

test/CodeGen/X86/sad.ll
2–7

I just want to avoid writing the test case for different targets and hence the test code size is minimized. But I agree with you on that we should not rely on auto-vectorization for the test of this patch. So the minimal IR here would be several huge functions for different targets?

congh updated this revision to Diff 49267.Feb 26 2016, 4:34 PM

Update the patch according to Ahmed's comments.

ab added inline comments.Feb 26 2016, 4:42 PM
lib/Target/X86/X86ISelLowering.cpp
28988–29000

Right, isBuildVectorAllOnes looks for -1 (not 1), as it checks that "all of the elements are ~0 or undef."

test/CodeGen/X86/sad.ll
3–8

Does it need to be huge? After all, this is only testing the PSAD ISel, and the pattern seems pretty minimal; would it work to test variants of your example in detectSADPattern?

congh updated this revision to Diff 49424.Feb 29 2016, 2:01 PM

Update the patch according to Ahmed's comments.

congh added a comment.Feb 29 2016, 2:02 PM

Both comments addressed. PTAL.

lib/Target/X86/X86ISelLowering.cpp
28988–29000

You are right. Updated. Thanks!

test/CodeGen/X86/sad.ll
3–8

I need to compose the test so that the reduction vector operations can be detected. But I think I can still write small tests.

To test on different targets, I have to split the test files into threes for testing sse2/avx2/avx512 as I found llc cannot parse opt -mattr=+avx512bw generated IR on AVX2.

RKSimon edited edge metadata.Mar 8 2016, 3:17 AM

The checks on the tests are quite poor - I understand that update_llc_test_checks.py output might be too much but please see if you can expand the checks to give a better idea of context.

lib/Target/X86/X86ISelLowering.cpp
28988–29000

512-bit PSAD requires AVX512BW (Subtarget.hasBWI()) - plain AVX512 can't handle 512-bit vector elements smaller than 32-bits - please ensure the tests run with avx512f (which will use the AVX2 path) as well as avx512bw.

test/CodeGen/X86/sad.ll
3–8

This sounds like a bug - the test case has no target specific intrinsics so we should be able to handle it on every target.

Simon and Ahmed seem to have a good handle on this, just wanted to say...

The checks on the tests are quite poor - I understand that update_llc_test_checks.py output might be too much but please see if you can expand the checks to give a better idea of context.

I would strongly encourage folks to use update_llc_test_checks.py. If it's not working here, we should make a version that does work. The resulting test accuracy is just such a huge improvement.

congh added a comment.Mar 10 2016, 3:31 PM

Simon and Ahmed seem to have a good handle on this, just wanted to say...

The checks on the tests are quite poor - I understand that update_llc_test_checks.py output might be too much but please see if you can expand the checks to give a better idea of context.

I would strongly encourage folks to use update_llc_test_checks.py. If it's not working here, we should make a version that does work. The resulting test accuracy is just such a huge improvement.

I have updated the test file by running update_llc_test_checks.py. Thanks!

lib/Target/X86/X86ISelLowering.cpp
28988–29000

You are right! I have updated the test to cover both AVX512BW and AVX512F. Thanks!

test/CodeGen/X86/sad.ll
3–8

Yes, this is a bug. When I try to compile sad-avx2.ll with -mattr=+avx512bw, I got an internal error. The bug is from DAGCombiner::visitINSERT_SUBVECTOR which tries to convert a INSERT_SUBVECTOR operation into a
CONCAT_VECTORS one. We need to check if two operands of CONCAT_VECTORS have the same type there. This is fixed in this patch.

I then could combine those three tests into one file.

congh updated this revision to Diff 50367.Mar 10 2016, 3:31 PM
congh edited edge metadata.

Update the patch according to Simon and Chandler's comments.

What would be necessary to enable PSADBW to match for SSE2 in the 32i8/64i8 cases (and AVX2/AVX512F in the 64i8 case)?

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13644

We're referencing N->getOperand(1) enough now that we can bring this out as N1.

13653

Add this condition to the outer if()

lib/Target/X86/X86ISelLowering.cpp
29046

implemented

congh added a comment.Mar 11 2016, 3:14 PM

What would be necessary to enable PSADBW to match for SSE2 in the 32i8/64i8 cases (and AVX2/AVX512F in the 64i8 case)?

Then we need to consider this case in detectSADPattern: currently we don't handle too long registers. This is ok in aspect of auto-vectorization as we won't get too long vectors, so I am wondering if it is worth the effort to handle those cases. If we want to do it, we need to tell X86 isel how to split X86ISD::PSADBW. What do you think?

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13644

Done.

13653

OK.

lib/Target/X86/X86ISelLowering.cpp
29046

Done.

congh updated this revision to Diff 50490.Mar 11 2016, 3:15 PM

Update the patch according to Simon's comments.

Minor clanup request.

Also, can you get rid of the stack instructions in the test (nounwind?)

lib/Target/X86/X86ISelLowering.cpp
29090

DAG.getConstant(0, DL, InVT)

congh marked an inline comment as done.Mar 23 2016, 2:15 PM

Minor clanup request.

Also, can you get rid of the stack instructions in the test (nounwind?)

OK. Done. Thanks!

congh updated this revision to Diff 51474.Mar 23 2016, 2:15 PM

Update the patch according to Simon's comments.

ab added a comment.Mar 29 2016, 9:51 AM

So, I understand the huge tests are required to trigger the reduction recognizer. Have you considered changing that, for the sake of testability? The most obvious thing I can think of is to represent the flag as metadata in IR. That has other benefits:

  • do the analysis in IR rather than in SelectionDAGBuilder, which is already a big enough mess as it is ;)
  • only do the analysis for targets that use it: if only X86 can select reduction ops, what's the point in recognizing them elsewhere?

Another (possibly even crappier) alternative: add some "stress-test" commandline flag that overrides the isVectorReduction check on ADDs (or even adds the flag everywhere).

hfinkel accepted this revision.Apr 26 2016, 5:42 PM
hfinkel edited edge metadata.
In D14840#385786, @ab wrote:

So, I understand the huge tests are required to trigger the reduction recognizer. Have you considered changing that, for the sake of testability? The most obvious thing I can think of is to represent the flag as metadata in IR. That has other benefits:

  • do the analysis in IR rather than in SelectionDAGBuilder, which is already a big enough mess as it is ;)

Doing this early means that we need to presuppose what the backend legalization will do, understand target features,etc. The fact that it is hard to test backend components should be addressed by improving those components, not by moving things into the IR level unnecessarily. There are a few places where we do this kind of thing out of necessity (because, for example, we lack sufficient loop analysis capabilities in the backend; PPCCTRLoops is an example), and it's a mess. You're certainly right, however, that SelectionDAGBuilder could use some refactoring (or at least partitioning). It might never happen, however, before the entire thing gets replaced by GlobalISel ;)

  • only do the analysis for targets that use it: if only X86 can select reduction ops, what's the point in recognizing them elsewhere?

Other targets will use this as well. PowerPC has vsumsws and friends. AArch64 has vaddv, etc.

Another (possibly even crappier) alternative: add some "stress-test" commandline flag that overrides the isVectorReduction check on ADDs (or even adds the flag everywhere).

Honestly, the test cases don't look that bad. The IR, at least, is fairly succinct. It's never going to be super-small because we're matching a non-trivial pattern. The CHECK lines are definitely over constraining, but that's a symptom of the way they're autogenerated. I don't actually like tests like this which seem to strongly cover irrelevant register-alllocation decisions, but that's a larger problem with many tests in the backend. Hand-coded tests making proper use of CHECK-DAG and named regex patterns for allocated registers would be smaller.

LGTM.

This revision is now accepted and ready to land.Apr 26 2016, 5:42 PM
This revision was automatically updated to reflect the committed changes.