Currently the X86 shuffle lowering would widen the element type for shuffle if the mask element value is adjacent. For below example %t2 = add nsw <16 x i32> %t0, %t1 %t3 = sub nsw <16 x i32> %t0, %t1 %t4 = shufflevector <16 x i32> %t2, <16 x i32> %t3, <16 x i32> <i32 16, i32 17, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15> ret <16 x i32> %t4 Compiler would transform the shuffle to %t4 = shufflevector <8 x i64> %t2, <8 x i64> %t3, <8 x i64> <i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> This may lose the oppotunity to let ISel select mask instruction when avx512 is enabled. This patch is to prevent the tranform when avx512 feature is enabled.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
44372 | This doesn't look right - shouldn't it be something like: APInt Mask = APIntOps::ScaleBitMask(ConstCond->getAPIntValue(), NumElts * 2) ? |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
44372 | Thanks, Simon. ScaleBitMask perfectly fit this coputation. |
I think we can probably generalize this very easily to any legal widening, e.g. to handle: https://gcc.godbolt.org/z/Pjz5qfYT7 (v32i16 -> v16i32)
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
44346 | Is it possible that CondVT is not vXi1 for ISD::VSELECT? I ask the question becasue the comments for ISD::VSELECT says "targets may change the condition type". /// At first, the VSELECT condition is of vXi1 type. Later, targets may /// change the condition type in order to match the VSELECT node using a /// pattern. The condition follows the BooleanContent format of the target. |
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll | ||
---|---|---|
129 | For this case not sure if it is worse than left side code (previous code). |
I've added some additional test coverage to shuffle-blend.ll - please can you rebase?
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll | ||
---|---|---|
129 | Add a 128-bit vector limit? |
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll | ||
---|---|---|
129 | I'll add 128-bit vector limit. However in this case it is 128-bit vector. |
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll | ||
---|---|---|
4–5 | CHECK,AVX512BW,X86-AVX512BW |
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll | ||
---|---|---|
94 | retl can be merged to retq by ret{{[l|q]}}. Not sure why utils/update_llc_test_checks.py doesn't merge. |
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll | ||
---|---|---|
151 | regression? you might need to improve the 128-bit limit logic to account for vXi16 specifically |
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll | ||
---|---|---|
151 | There is PBLENDW, but there is no PBLENDB instruction, so it is better to widen to v2Xi16 from vXi8. Besides there is more instruction for 16-bit element (e.g., movsh). I'll investigate more on this issue. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
19326 | This list is going to get longer, and we're likely to miss patterns that only fold to target nodes later on - I'm wondering whether we could consider accepting any TLI.isBinOp() case here? |
LGTM - with one minor comment for future work
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
19326 | Please can you add a TODO about maybe converting this to TLI.isBinOp()? |
llvm/test/CodeGen/X86/haddsub-undef.ll | ||
---|---|---|
1053 | This looks a regression, I'll take a look at it. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
19326 | Thank Simon for the suggestion. It seems there is regression on some cases, I'll take a look at the regression. |
Yeah - that was what I saw as well - if you want to get this in for 15.x I'd recommend going back to the old switch statement - then investigate the binop general case later (if you solve it soon enough you request a merge)
Thank Simon for the review. All the comments are valuable. Let me land the old switch statement version first and then investigate the general binop.
This patch unfortunately causes crashes when building llvm-test-suite optimizing for AVX512.
Reproducer for llc:
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx" define i32 @test(<32 x i32> %0) #0 { entry: %1 = mul <32 x i32> %0, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> %2 = tail call i32 @llvm.vector.reduce.add.v32i32(<32 x i32> %1) ret i32 %2 } ; Function Attrs: nocallback nofree nosync nounwind readnone willreturn declare i32 @llvm.vector.reduce.add.v32i32(<32 x i32>) #1 attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="skylake-avx512" } attributes #1 = { nocallback nofree nosync nounwind readnone willreturn }
I've reverted the patch in the meantime to get current main back into a good state.
This list is going to get longer, and we're likely to miss patterns that only fold to target nodes later on - I'm wondering whether we could consider accepting any TLI.isBinOp() case here?