Due to VADD, VSUB, VMUL can executed on more ports than VINSERT.
We tend to remove FADD/FSUB/FMUL handling from combineConcatVectorOps.
More details pls refer to fb91f0
Problems
1 This patch cause affected tests generated worse code. We need to fix it.
PS: After fb91f0 there is another 649b14 optimize these tests
2 We may also need to root cause why middle end vector optimization didn't work for there cases.
For example, llvm/test/CodeGen/X86/widen_fadd.ll function widen_fadd_v2f32_v8f32
was not be optimized (in avx mode)
 from
 define void @widen_fadd_v2f32_v8f32(ptr %a0, ptr %b0, ptr %c0) {
  %a2 = getelementptr inbounds i8, ptr %a0, i64 8
  %b2 = getelementptr inbounds i8, ptr %b0, i64 8
  %c2 = getelementptr inbounds i8, ptr %c0, i64 8
  %a4 = getelementptr inbounds i8, ptr %a0, i64 16
  %b4 = getelementptr inbounds i8, ptr %b0, i64 16
  %c4 = getelementptr inbounds i8, ptr %c0, i64 16
  %a6 = getelementptr inbounds i8, ptr %a0, i64 24
  %b6 = getelementptr inbounds i8, ptr %b0, i64 24
  %c6 = getelementptr inbounds i8, ptr %c0, i64 24
  %va0 = load <2 x float>, ptr %a0, align 4
  %vb0 = load <2 x float>, ptr %b0, align 4
  %va2 = load <2 x float>, ptr %a2, align 4
  %vb2 = load <2 x float>, ptr %b2, align 4
  %va4 = load <2 x float>, ptr %a4, align 4
  %vb4 = load <2 x float>, ptr %b4, align 4
  %va6 = load <2 x float>, ptr %a6, align 4
  %vb6 = load <2 x float>, ptr %b6, align 4
  %vc0 = fadd <2 x float> %va0, %vb0
  %vc2 = fadd <2 x float> %va2, %vb2
  %vc4 = fadd <2 x float> %va4, %vb4
  %vc6 = fadd <2 x float> %va6, %vb6
  store <2 x float> %vc0, ptr %c0, align 4
  store <2 x float> %vc2, ptr %c2, align 4
  store <2 x float> %vc4, ptr %c4, align 4
  store <2 x float> %vc6, ptr %c6, align 4
  ret void
}to
define void @widen_fadd_v2f32_v8f32(ptr %a0, ptr %b0, ptr %c0) {
  %va0 = load <8 x float>, ptr %a0, align 4
  %vb0 = load <8 x float>, ptr %b0, align 4
  %vc0 = fadd <8 x float> %va0, %vb0
  store <8 x float> %vc0, ptr %c0, align 4
  ret void
}
Add comments to address that the shuffle instructions throughput is less than fadd/fsub/fmul instruction?