insert undef, (bitcast vType X to scalar), C --> bitcast (shuffle X, undef, Mask)
I think this is a universal improvement for vector IR code because it removes a vector-to-scalar-to-vector transition, but I'm not sure if the pattern is relevant to anything besides x86 AVX. In the motivating example from PR34716 ( https://bugs.llvm.org/show_bug.cgi?id=34716 ), we have:
define <8 x i64> @test(i32 %x0, i32 %x1) { %1 = insertelement <2 x i32> undef, i32 %x0, i32 0 %2 = insertelement <2 x i32> %1, i32 %x1, i32 1 %3 = bitcast <2 x i32> %2 to i64 %4 = insertelement <8 x i64> undef, i64 %3, i32 0 %5 = shufflevector <8 x i64> %4, <8 x i64> undef, <8 x i32> zeroinitializer ret <8 x i64> %5 }
This leads to inefficient movement between scalar GPRs and vector registers. With this patch, other vector instcombines will fire reducing the IR to:
define <8 x i64> @test(i32 %x0, i32 %x1) { %1 = insertelement <16 x i32> undef, i32 %x0, i32 0 ; wide vec insert %2 = insertelement <16 x i32> %1, i32 %x1, i32 1 ; wide vec insert %3 = bitcast <16 x i32> %2 to <8 x i64> ; free bitcast %4 = shufflevector <8 x i64> %3, <8 x i64> undef, <8 x i32> zeroinitializer ; splat ret <8 x i64> %4 }
And through backend folds, a 32-bit AVX512 target could manage to load the two 32-bit scalars and splat in one instruction (although this doesn't quite happen yet):
vmovsd 4(%esp), %xmm0 # xmm0 = mem[0],zero vbroadcastsd %xmm0, %zmm0