This patch lowers the X86 vector packing with saturation intrinsics to native LLVM IR. Comes with a clang patch (D45720).
Details
Diff Detail
Event Timeline
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
34717 | This should use !Subtarget.useBWIRegs() instead of hasBWI. And you can lose the hasAVX512 check. Someday I'm hoping to finish the zmm=low changes that sometimes make 512-bits illegal even though BWI is supported. |
Please can you update the sse2-intrinsics-fast-isel.ll etc. to use the same codegen as the clang builtins tests.
Updated fast-isel tests for SSE and AVX2. There don't appear to be any AVX512 fast-isel tests for these intrinsics, will add them if required.
llvm/lib/IR/AutoUpgrade.cpp | ||
---|---|---|
1054 | This shuffle mask creation is very confusing, please can you make it more obvious (see createPackShuffleMask in X86ISelLowering.cpp) | |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
34791 | This seems all rather messy - please can you try to clean it up? | |
llvm/test/CodeGen/X86/sse41-intrinsics-x86.ll | ||
161 | Please can you leave these tests for now - add a FIXME comment about this PACKS being upgraded if you wish. | |
llvm/test/CodeGen/X86/vector-shuffle-combining-avx2.ll | ||
879 | Please can you leave these tests for now - add a FIXME comment about this PACKS being upgraded if you wish. |
@RKSimon what's the motivation for leaving the tests for upgraded intrinsics in place instead of moving them to appropriate files?
Tidied up the shuffle mask emission and checking and moved the PACKUSDW SSE4.1 check into tracePackVectorShuffle.
Sorry @mike.dvoretsky I lost track of this patch! IIRC those shuffle combining tests are the only test coverage for PACKSS/PACKUS shuffle decoding in getFauxShuffleMask
Restored shuffle combining tests. Fixed the folding to account for unary (A == B) patterns, added tests.
Hmm I suppose that's "more stuff which breaks our JIT compilation".
Removing these intrinsics actually is looking a bit ugly from my point of view (mesa llvmpipe developer).
One issue is that we take care to only emit instructions as necessary - that is if the packxxx instructions aren't available, we only do whatever is necessary to accomplish the pack - so if we already know we cannot have values outside the target range (or maybe we can have but we don't care for some reason) we make sure to only emit shuffle. But if packxxx is available we still prefer that of course.
And if we can't use intrinsics, we have to emit loads of other IR instructions (and they need to match what the autoupgrade would do), but still need a separate path for when we know the backend won't be able to use packxxx intrinsic, which is getting real ugly.
And I can only hope that whenever some other backend implements similar autoupgrade (pack instructions aren't exactly sse specific) they'll actually use the _same_ pattern for it, otherwise that's really going to get insane.
<rant mode on>
Removing intrinsics in general is quite a pain for us, because to actually still get usable (performance-wise) code we need to match what the auto-upgrader does manually exactly (and do some prayers the pattern isn't changed at some point). This is for instance apparent also with the change to remove the saturated adds/subs (which was reverted for now), where the pattern is rather complex (I bet it doesn't help compile times), and our fallback path was in fact using different instructions to emulate it (I'd rather wish there were llvm.add/subsat instructions instead, probably just about all cpus can actually do it natively...).
If there's some easy way to use autoupgrade for jit compilation, please enlighten me (although it's only viable if it has insignificant cost compared to rest of compilation), because my quick experiments weren't exactly successful...
<rant mode off>
This shuffle mask creation is very confusing, please can you make it more obvious (see createPackShuffleMask in X86ISelLowering.cpp)