As we now can detect vectorized reduction operations, it's time to use this feature! This patch detects the SAD pattern on X86 so that much better code will be emitted once the pattern is matched.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I am splitting this patch and a subset is now under review
(http://reviews.llvm.org/D14897). After that patch is landed, I will
send this patch to upstream for review.
thanks,
Cong
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
28588–28600 ↗ | (On Diff #48154) | Perhaps you could replace this with BuildVectorSDNode::getConstantSplatNode()? ..but looking at the uses, you can replace them all with isBuildVectorAllOnes/isBuildVectorAllZeros. |
28613 ↗ | (On Diff #48154) | All vector MVTs have power-of-2 NumElems, so I think you can drop the second part (and NumElems, and merge the ifs). |
28616–28623 ↗ | (On Diff #48154) | What happens on v2i32? Also, doesn't this require SSE2? I'd just explicitly check subtarget and type, like we do elsewhere. So, something like: if ((VT == MVT::v16i32 && !Subtarget.hasAVX512()) || (VT == MVT::v8i32 && !Subtarget.hasAVX2()) || (VT == MVT::v4i32 && !Subtarget.hasSSE2())) return SDValue(); |
28697–28701 ↗ | (On Diff #48154) | Why is this necessary? It seems to me that this is forcing RegSize to 128 (because even v16i8, which requires AVX-512 as it's added as v16i32 has a size of 128). In turn, that forces XMM-sized PSADs to be generated, even on AVX2. Am I missing something? |
28740–28742 ↗ | (On Diff #48154) | if (SDValue Sad = detectSADPattern(N, DAG, Subtarget)) return Sad; |
test/CodeGen/X86/sad.ll | ||
1–6 ↗ | (On Diff #48154) | IMHO you should only test the minimal IR with llc here. I see the argument for testing the whole thing, but it seems like 1) that prevents you from testing unrelated constructs (not all IR comes from the vectorizer), and 2) there should already be relevant opt tests. If this fires on the test-suite (IIRC paq8p could benefit from this?), then people will notice regressions. If it doesn't, is there something we could add to the test-suite that exposes this? Also, I don't think you need -mcpu. |
Thanks for the review! Please check my inline comments.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
28588–28600 ↗ | (On Diff #48154) | I also need to check if a vector contains all -1s, which could not be satisfied by isBuildVectorAllOnes/isBuildVectorAllZeros. How to use getConstantSplatNode here? |
28616–28623 ↗ | (On Diff #48154) | I think the case of v2i32 may not be generated by the vectorizer but as we are not just handling the results of auto-vectorization, I think it is better to cover this case. For v16i32 we only need SSE2 as we are actually handling v16i8, and the result will be collected as a v2i64. We need AVX512 to handle v64i32. Therefore, my previous check here is incorrect, which should be if (VT.getSizeInBits() / 4 > RegSize) return SDValue(); |
28697–28701 ↗ | (On Diff #48154) | v16i8 is actually handled by SSE2, while AVX-512 is handling v64i8. This means even on AVX512 we should still use SSE2 instructions to handle v16i8. |
test/CodeGen/X86/sad.ll | ||
1–6 ↗ | (On Diff #48154) | I just want to avoid writing the test case for different targets and hence the test code size is minimized. But I agree with you on that we should not rely on auto-vectorization for the test of this patch. So the minimal IR here would be several huge functions for different targets? |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
28751–28763 ↗ | (On Diff #49267) | Right, isBuildVectorAllOnes looks for -1 (not 1), as it checks that "all of the elements are ~0 or undef." |
test/CodeGen/X86/sad.ll | ||
2–7 ↗ | (On Diff #49267) | Does it need to be huge? After all, this is only testing the PSAD ISel, and the pattern seems pretty minimal; would it work to test variants of your example in detectSADPattern? |
Both comments addressed. PTAL.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
28751–28763 ↗ | (On Diff #49267) | You are right. Updated. Thanks! |
test/CodeGen/X86/sad.ll | ||
2–7 ↗ | (On Diff #49267) | I need to compose the test so that the reduction vector operations can be detected. But I think I can still write small tests. To test on different targets, I have to split the test files into threes for testing sse2/avx2/avx512 as I found llc cannot parse opt -mattr=+avx512bw generated IR on AVX2. |
The checks on the tests are quite poor - I understand that update_llc_test_checks.py output might be too much but please see if you can expand the checks to give a better idea of context.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
28751–28763 ↗ | (On Diff #49424) | 512-bit PSAD requires AVX512BW (Subtarget.hasBWI()) - plain AVX512 can't handle 512-bit vector elements smaller than 32-bits - please ensure the tests run with avx512f (which will use the AVX2 path) as well as avx512bw. |
test/CodeGen/X86/sad.ll | ||
2–7 ↗ | (On Diff #49267) | This sounds like a bug - the test case has no target specific intrinsics so we should be able to handle it on every target. |
Simon and Ahmed seem to have a good handle on this, just wanted to say...
I would strongly encourage folks to use update_llc_test_checks.py. If it's not working here, we should make a version that does work. The resulting test accuracy is just such a huge improvement.
I have updated the test file by running update_llc_test_checks.py. Thanks!
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
28751–28763 ↗ | (On Diff #49424) | You are right! I have updated the test to cover both AVX512BW and AVX512F. Thanks! |
test/CodeGen/X86/sad.ll | ||
2–7 ↗ | (On Diff #49267) | Yes, this is a bug. When I try to compile sad-avx2.ll with -mattr=+avx512bw, I got an internal error. The bug is from DAGCombiner::visitINSERT_SUBVECTOR which tries to convert a INSERT_SUBVECTOR operation into a I then could combine those three tests into one file. |
What would be necessary to enable PSADBW to match for SSE2 in the 32i8/64i8 cases (and AVX2/AVX512F in the 64i8 case)?
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
13643 ↗ | (On Diff #50367) | We're referencing N->getOperand(1) enough now that we can bring this out as N1. |
13650 ↗ | (On Diff #50367) | Add this condition to the outer if() |
lib/Target/X86/X86ISelLowering.cpp | ||
29046 ↗ | (On Diff #50367) | implemented |
Then we need to consider this case in detectSADPattern: currently we don't handle too long registers. This is ok in aspect of auto-vectorization as we won't get too long vectors, so I am wondering if it is worth the effort to handle those cases. If we want to do it, we need to tell X86 isel how to split X86ISD::PSADBW. What do you think?
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
13643 ↗ | (On Diff #50367) | Done. |
13650 ↗ | (On Diff #50367) | OK. |
lib/Target/X86/X86ISelLowering.cpp | ||
29046 ↗ | (On Diff #50367) | Done. |
Minor clanup request.
Also, can you get rid of the stack instructions in the test (nounwind?)
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
29090 ↗ | (On Diff #50490) | DAG.getConstant(0, DL, InVT) |
So, I understand the huge tests are required to trigger the reduction recognizer. Have you considered changing that, for the sake of testability? The most obvious thing I can think of is to represent the flag as metadata in IR. That has other benefits:
- do the analysis in IR rather than in SelectionDAGBuilder, which is already a big enough mess as it is ;)
- only do the analysis for targets that use it: if only X86 can select reduction ops, what's the point in recognizing them elsewhere?
Another (possibly even crappier) alternative: add some "stress-test" commandline flag that overrides the isVectorReduction check on ADDs (or even adds the flag everywhere).
Doing this early means that we need to presuppose what the backend legalization will do, understand target features,etc. The fact that it is hard to test backend components should be addressed by improving those components, not by moving things into the IR level unnecessarily. There are a few places where we do this kind of thing out of necessity (because, for example, we lack sufficient loop analysis capabilities in the backend; PPCCTRLoops is an example), and it's a mess. You're certainly right, however, that SelectionDAGBuilder could use some refactoring (or at least partitioning). It might never happen, however, before the entire thing gets replaced by GlobalISel ;)
- only do the analysis for targets that use it: if only X86 can select reduction ops, what's the point in recognizing them elsewhere?
Other targets will use this as well. PowerPC has vsumsws and friends. AArch64 has vaddv, etc.
Another (possibly even crappier) alternative: add some "stress-test" commandline flag that overrides the isVectorReduction check on ADDs (or even adds the flag everywhere).
Honestly, the test cases don't look that bad. The IR, at least, is fairly succinct. It's never going to be super-small because we're matching a non-trivial pattern. The CHECK lines are definitely over constraining, but that's a symptom of the way they're autogenerated. I don't actually like tests like this which seem to strongly cover irrelevant register-alllocation decisions, but that's a larger problem with many tests in the backend. Hand-coded tests making proper use of CHECK-DAG and named regex patterns for allocated registers would be smaller.
LGTM.