Uses trivial in-lane instructions to avoid expensive cross-lane shuffles. Sometimes prevents using higher YMM/ZMM registers.
Diff Detail
Event Timeline
It might be better to split the rotate/funnel and general shift lowering into 2 separate patches.
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
31023 | Describe the codegen? | |
31028 | This logic should be something like if !supportedVectorVarShift(VT) && supportedVectorVarShift(WideVT) ? | |
31041 | Lo/Hi might be better names? | |
31044 | getTargetVShiftByConstNode ? | |
31046 | Don't you check for !DAG.isSplatValue(Amt) in the opening if() ? | |
31054 | LowerShift has a unsigned Opc we can use | |
31077 | Do we need the ternlog path? Don't existing combines deal with it later (canonicalizeBitSelect etc.)? | |
31454 | It'd be better if you can adjust the logic below instead of having a second copy. | |
31667 | See below what? Add a new self-contained comment instead of relying on a code layout. | |
31680 | clang-format (indentation) | |
31693 | Same for funnel shifts - by tweaking this logic we might be able to avoid the duplicate code. |
llvm/test/CodeGen/X86/avx2-shift.ll | ||
---|---|---|
547 | The instruction count seems to increase from 5 to 7. Could you check if it improves performance with "https://uica.uops.info" or llvm-mca? |
@RKSimon Thanks for the review, I created another diff with only LowerRotate changes: https://reviews.llvm.org/D149071
@LuoYuanke Added some analysis with https://uica.uops.info
llvm/test/CodeGen/X86/avx2-shift.ll | ||
---|---|---|
547 | Yes, it looks like an improvement unless I interpret website results wrong: | |
575 | Left: https://bit.ly/3oLtPcP | |
601 | Left: https://bit.ly/41Ewx2j |
Describe the codegen?