If the upper bits of a truncation shuffle patterns have at least the minimum number of sign/zero bits on their inputs then we can safely use PACKSS/PACKUS as shuffles.
Partial fix for https://bugs.llvm.org/show_bug.cgi?id=34773
Paths
| Differential D38472
[X86][SSE] Add support for lowering shuffles to PACKSS/PACKUS ClosedPublic Authored by RKSimon on Oct 2 2017, 11:02 AM.
Details Summary If the upper bits of a truncation shuffle patterns have at least the minimum number of sign/zero bits on their inputs then we can safely use PACKSS/PACKUS as shuffles. Partial fix for https://bugs.llvm.org/show_bug.cgi?id=34773
Diff Detail
Event TimelineComment Actions Looks like many significant improvements, but a couple possible regressions where we now get a shift+pack instead of a single pshufb. e.g. in trunc16i32_16i8_lshr, trunc8i32_8i16_lshr, and a couple other cases.
This revision is now accepted and ready to land.Oct 2 2017, 5:58 PM Comment Actions Added comments before I commit, the remaining regressions should be handled when we enable shuffle combining to create PACKSS/PACKUS as well as combine from them. But that can only be done once lowering has landed.
Closed by commit rL314788: [X86][SSE] Add support for lowering shuffles to PACKSS/PACKUS (authored by RKSimon). · Explain WhyOct 3 2017, 5:03 AM This revision was automatically updated to reflect the committed changes. Comment Actions Some CPUs have good pblendw throughput, it's not always a win to do 2 shifts. (But I guess that's the same problem you mentioned in https://reviews.llvm.org/D38472?id=117394#inline-335636, that the scheduler model isn't close to figuring out when to use a variable-shuffle to reduce port pressure?) I hope clang isn't going to start compiling _mm_shuffle_epi8 into psrlw $8, %xmm0 / packsswb %xmm0,%xmm0 in cases where that's not a win, when the shuffle control constant lets it do that. I guess it's a tricky tradeoff between aggressive optimization of intrinsics helping novices (or code tuned for a uarch that isn't the target) vs. defeating deliberate tuning choices. I think it's good to have at least one compiler (clang) that does aggressively optimize, since we can always use gcc instead or for comparison.
Revision Contents
Diff 117502 llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
llvm/trunk/test/CodeGen/X86/avx-cvt-2.ll
llvm/trunk/test/CodeGen/X86/avx2-shift.ll
llvm/trunk/test/CodeGen/X86/avx2-vbroadcast.ll
llvm/trunk/test/CodeGen/X86/avx2-vector-shifts.ll
llvm/trunk/test/CodeGen/X86/avx512-any_extend_load.ll
llvm/trunk/test/CodeGen/X86/avx512-trunc.ll
llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-256.ll
llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-512.ll
llvm/trunk/test/CodeGen/X86/bitcast-setcc-128.ll
llvm/trunk/test/CodeGen/X86/psubus.ll
llvm/trunk/test/CodeGen/X86/shuffle-strided-with-offset-256.ll
llvm/trunk/test/CodeGen/X86/vector-compare-results.ll
llvm/trunk/test/CodeGen/X86/vector-shift-ashr-128.ll
llvm/trunk/test/CodeGen/X86/vector-trunc.ll
llvm/trunk/test/CodeGen/X86/vselect-avx.ll
llvm/trunk/test/CodeGen/X86/widen_arith-2.ll
|