Splitting and improving from the https://reviews.llvm.org/D146357
When running tests for LowerShift, I discovered some poor codegen in rotate and funnel shift tests. This patch attempts to address some of them.
Using unpack for splitting and using double-bitwidth shifts may improve performance according to https://uica.uops.info tests.
- No cross-lane shuffles
- No dirtying double-width registers
- Massive improvement for AVX2 rotates in some cases (var_funnnel_v8i16, var_funnnel_v16i16) — because unpack is currently only used for vXi8 vectors.
Code analyser:
SKX version (left): https://bit.ly/3NamuO5
VBMI version (left): https://bit.ly/444ZUfB
Common (right): https://bit.ly/3n7q4xQ
Unpack version looks better despite more instructions.