SKX has an objectively faster shift than shuffle, on all other targets
the two have equal performance (with maybe a slight preference for
shifts because p5 is a more common bottleneck).
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/CodeGen/X86/min-legal-vector-width.ll | ||
---|---|---|
173–277 | Tests for other conditions are gone. You may need to tune the combinations in --check-prefixes. |
Without AVX512 we can't load fold arg0 for bit-shift ops - isn't that likely to be a problem?
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
18321 | This approach isn't particularly easy to grok - why not just add an additional lowerShuffleAsShift check before behind a hasFasterShiftThanShuffle check? | |
llvm/test/CodeGen/X86/pr57340.ll | ||
272 | Are byte shifts faster I thought they were still Port5 bound? |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
18321 | Was to avoid duplicating ~30 lines of code, but will do for v2. | |
llvm/test/CodeGen/X86/pr57340.ll | ||
272 | Same perf/code size for byte-shift vs shuffle so figure its all the same. I guess, however, it could have a drawback because its harder to switch domains for shift than shuffle so I can update logic to only do bit-shift. Also note this particular case actually reflects a missed optimization in combineExtractVectorElt because it should be just using vpextrw but I still haven't figured out exactly whats missing. |
llvm/test/CodeGen/X86/pr57340.ll | ||
---|---|---|
272 | The combineExtractVectorElt peek through shuffle code has slowly evolved as we encountered individual regressions - I'm not surprised it still misses many. |
I'm not sure what you mean?
But the tuning is only for SKX which has avx512.
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
18321 | Refactored as you suggest everything except matchunaryshufflepermute helper where it would cause too much duplication imo. |
llvm/test/CodeGen/X86/min-legal-vector-width.ll | ||
---|---|---|
11 | Have you seen if you can add additional common check-prefixes to reduce the amount of duplications below? |
llvm/test/CodeGen/X86/min-legal-vector-width.ll | ||
---|---|---|
11 | Yeah, sorry misunderstood what it meant the first time. |
llvm/lib/Target/X86/X86.td | ||
---|---|---|
532 | Maybe rephrase this (e.g. you refer to vprold etc. but not shifts). "Prefer lowering shuffles on AVX512 targets (e.g. Skylake Server) to shifts/rotate if they can use more ports than regular shuffles." ? | |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
38727 | Instead of a loop - why not move the shuffle/shift paths into lambdas and then do this - it should be easier to understand: if (Subtarget.hasFasterShiftThanShuffle()) { if (matchUnaryPermuteAsBitShift()) {} if (matchUnaryPermuteAsIntShuffle()) {} } else { if (matchUnaryPermuteAsIntShuffle()) {} if (matchUnaryPermuteAsBitShift()) {} } |
llvm/lib/Target/X86/X86.td | ||
---|---|---|
532 | Added "imm" as prefer to shifts/rotate b.c we don't want var shift, but otherwise done. | |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
38727 | The reason I prefered loop her is there is a lot of boiler plate for type /target checking that would also need to be copied. Seemed like loop was preferable to ~30 lines of dup code. But your call. LMK. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
16574 | remove newline | |
38798–38814 | newline | |
llvm/lib/Target/X86/X86Subtarget.h | ||
252 | You shouldn't need this - the GET_SUBTARGETINFO_MACRO above should have created the getter |
No, the x86fixupinsttuning pass is only meant for guaranteed replacements. There is not shuffle that is always replaceable with shifts.
After this patch, I see an msan issue running this test internally; strangely I don't see a failure on any sanitizer buildbot yet.
There are ~400 test failures in LLVM, e.g. for the test in llvm/test/CodeGen/Generic/vector-casts.ll:
==3443==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x55f66059dba1 in matchUnaryPermuteShuffle(llvm::MVT, llvm::ArrayRef<int>, llvm::APInt const&, bool, bool, llvm::SelectionDAG const&, llvm::X86Subtarget const&, unsigned int&, llvm::MVT&, unsigned int&) llvm/lib/Target/X86/X86ISelLowering.cpp:38834:38 #1 0x55f66058ed18 in combineX86ShuffleChain(llvm::ArrayRef<llvm::SDValue>, llvm::SDValue, llvm::ArrayRef<int>, int, bool, bool, bool, llvm::SelectionDAG&, llvm::X86Subtarget const&) llvm/lib/Target/X86/X86ISelLowering.cpp:39512:9 #2 0x55f6603aa410 in combineX86ShufflesRecursively(llvm::ArrayRef<llvm::SDValue>, int, llvm::SDValue, llvm::ArrayRef<int>, llvm::ArrayRef<llvm::SDNode const*>, unsigned int, unsigned int, bool, bool, bool, llvm::SelectionDAG&, llvm::X86Subtarget const&) llvm/lib/Target/X86/X86ISelLowering.cpp:40791:27 #3 0x55f660620689 in combineX86ShufflesRecursively(llvm::SDValue, llvm::SelectionDAG&, llvm::X86Subtarget const&) llvm/lib/Target/X86/X86ISelLowering.cpp:40818:10 #4 0x55f66045d4b4 in combineVectorPack(llvm::SDNode*, llvm::SelectionDAG&, llvm::TargetLowering::DAGCombinerInfo&, llvm::X86Subtarget const&) llvm/lib/Target/X86/X86ISelLowering.cpp:48525:21 #5 0x55f6603c092c in llvm::X86TargetLowering::PerformDAGCombine(llvm::SDNode*, llvm::TargetLowering::DAGCombinerInfo&) const llvm/lib/Target/X86/X86ISelLowering.cpp:57229:36 #6 0x55f661b6df5b in (anonymous namespace)::DAGCombiner::combine(llvm::SDNode*) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:2014:16 #7 0x55f661b6c8f4 in Run llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:1795:18 #8 0x55f661b6c8f4 in llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:26935:36 #9 0x55f66214e2e3 in llvm::SelectionDAGISel::CodeGenAndEmitDAG() llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:923:13 #10 0x55f662147e92 in llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:1633:7 #11 0x55f662141c18 in llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:480:3 #12 0x55f660981cf0 in (anonymous namespace)::X86DAGToDAGISel::runOnMachineFunction(llvm::MachineFunction&) llvm/lib/Target/X86/X86ISelDAGToDAG.cpp:191:25 #13 0x55f661633d50 in llvm::MachineFunctionPass::runOnFunction(llvm::Function&) llvm/lib/CodeGen/MachineFunctionPass.cpp:91:13 #14 0x55f664c0c6f7 in llvm::FPPassManager::runOnFunction(llvm::Function&) llvm/lib/IR/LegacyPassManager.cpp:1430:27 #15 0x55f664c192d9 in llvm::FPPassManager::runOnModule(llvm::Module&) llvm/lib/IR/LegacyPassManager.cpp:1476:16 #16 0x55f664c0d785 in runOnModule llvm/lib/IR/LegacyPassManager.cpp:1545:27 #17 0x55f664c0d785 in llvm::legacy::PassManagerImpl::run(llvm::Module&) llvm/lib/IR/LegacyPassManager.cpp:535:44 #18 0x55f65d621281 in compileModule(char**, llvm::LLVMContext&) llvm/tools/llc/llc.cpp:733:8 #19 0x55f65d61a872 in main llvm/tools/llc/llc.cpp:420:22 Uninitialized value was created by an allocation of 'Shuffle' in the stack frame #0 0x55f66058a88e in combineX86ShuffleChain(llvm::ArrayRef<llvm::SDValue>, llvm::SDValue, llvm::ArrayRef<int>, int, bool, bool, bool, llvm::SelectionDAG&, llvm::X86Subtarget const&) llvm/lib/Target/X86/X86ISelLowering.cpp:39461:3 SUMMARY: MemorySanitizer: use-of-uninitialized-value llvm/lib/Target/X86/X86ISelLowering.cpp:38834:38 in matchUnaryPermuteShuffle(llvm::MVT, llvm::ArrayRef<int>, llvm::APInt const&, bool, bool, llvm::SelectionDAG const&, llvm::X86Subtarget const&, unsigned int&, llvm::MVT&, unsigned int&)
I think the issue is:
+ // Byte shifts can be slower so only match them on second attempt. + if (Order == 0 && + (Shuffle == X86ISD::VSHLDQ || Shuffle == X86ISD::VSRLDQ)) + continue;
It comes before the check of
+ if (0 < ShiftAmt && (!ShuffleVT.is512BitVector() || Subtarget.hasBWI() || + 32 <= ShuffleVT.getScalarSizeInBits())) { + PermuteImm = (unsigned)ShiftAmt; + return true; + }
and the 0 < ShiftAmt check if basically a check if actually found/set Shuffle.
Don't think the bug actually can change behavior but is bug none the less.
Will post patch to fix.
I think the issue is:
+ // Byte shifts can be slower so only match them on second attempt. + if (Order == 0 && + (Shuffle == X86ISD::VSHLDQ || Shuffle == X86ISD::VSRLDQ)) + continue;It comes before the check of
+ if (0 < ShiftAmt && (!ShuffleVT.is512BitVector() || Subtarget.hasBWI() || + 32 <= ShuffleVT.getScalarSizeInBits())) { + PermuteImm = (unsigned)ShiftAmt; + return true; + }and the 0 < ShiftAmt check if basically a check if actually found/set Shuffle.
Don't think the bug actually can change behavior but is bug none the less.Will post patch to fix.
Yeah, that looks like the cause. There's another place where matchShuffleAsShift is called that seems problematic. Adding some assertions can catch the issue outside of an msan build:
$ git diff diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp index 71dad73cfd9b..6af0ce2fd14b 100644 --- a/llvm/lib/Target/X86/X86ISelLowering.cpp +++ b/llvm/lib/Target/X86/X86ISelLowering.cpp @@ -14118,6 +14118,7 @@ static SDValue lowerShuffleAsShift(const SDLoc &DL, MVT VT, SDValue V1, V = V2; } + assert(ShiftAmt >= 0 && "matchShuffleAsShift failed twice"); if (BitwiseOnly && (Opcode == X86ISD::VSHLDQ || Opcode == X86ISD::VSRLDQ)) return SDValue(); @@ -38829,6 +38830,7 @@ static bool matchUnaryPermuteShuffle(MVT MaskVT, ArrayRef<int> Mask, int ShiftAmt = matchShuffleAsShift(ShuffleVT, Shuffle, MaskScalarSizeInBits, Mask, 0, Zeroable, Subtarget); + assert(ShiftAmt >= 0 && "matchShuffleAsShift failed"); // Byte shifts can be slower so only match them on second attempt. if (Order == 0 && (Shuffle == X86ISD::VSHLDQ || Shuffle == X86ISD::VSRLDQ))
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
14121 | here |
see: D145129
Thanks!
if you have concerns over whether that works I can just revert this.
I don't have concerns but I also don't know how anything in this file is supposed to work :)
Fixing forward is fine with me if it can land sometime soon.
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
14121 | Yeah, that too. Good Catch! |
Maybe rephrase this (e.g. you refer to vprold etc. but not shifts).
"Prefer lowering shuffles on AVX512 targets (e.g. Skylake Server) to shifts/rotate if they can use more ports than regular shuffles." ?