Due to VADD, VSUB, VMUL can executed on more ports than VINSERT.
We tend to remove FADD/FSUB/FMUL handling from combineConcatVectorOps.
More details pls refer to fb91f0
Problems
1 This patch cause affected tests generated worse code. We need to fix it.
PS: After fb91f0 there is another 649b14 optimize these tests
2 We may also need to root cause why middle end vector optimization didn't work for there cases.
For example, llvm/test/CodeGen/X86/widen_fadd.ll function widen_fadd_v2f32_v8f32
was not be optimized (in avx mode)
from
define void @widen_fadd_v2f32_v8f32(ptr %a0, ptr %b0, ptr %c0) { %a2 = getelementptr inbounds i8, ptr %a0, i64 8 %b2 = getelementptr inbounds i8, ptr %b0, i64 8 %c2 = getelementptr inbounds i8, ptr %c0, i64 8 %a4 = getelementptr inbounds i8, ptr %a0, i64 16 %b4 = getelementptr inbounds i8, ptr %b0, i64 16 %c4 = getelementptr inbounds i8, ptr %c0, i64 16 %a6 = getelementptr inbounds i8, ptr %a0, i64 24 %b6 = getelementptr inbounds i8, ptr %b0, i64 24 %c6 = getelementptr inbounds i8, ptr %c0, i64 24 %va0 = load <2 x float>, ptr %a0, align 4 %vb0 = load <2 x float>, ptr %b0, align 4 %va2 = load <2 x float>, ptr %a2, align 4 %vb2 = load <2 x float>, ptr %b2, align 4 %va4 = load <2 x float>, ptr %a4, align 4 %vb4 = load <2 x float>, ptr %b4, align 4 %va6 = load <2 x float>, ptr %a6, align 4 %vb6 = load <2 x float>, ptr %b6, align 4 %vc0 = fadd <2 x float> %va0, %vb0 %vc2 = fadd <2 x float> %va2, %vb2 %vc4 = fadd <2 x float> %va4, %vb4 %vc6 = fadd <2 x float> %va6, %vb6 store <2 x float> %vc0, ptr %c0, align 4 store <2 x float> %vc2, ptr %c2, align 4 store <2 x float> %vc4, ptr %c4, align 4 store <2 x float> %vc6, ptr %c6, align 4 ret void }
to
define void @widen_fadd_v2f32_v8f32(ptr %a0, ptr %b0, ptr %c0) { %va0 = load <8 x float>, ptr %a0, align 4 %vb0 = load <8 x float>, ptr %b0, align 4 %vc0 = fadd <8 x float> %va0, %vb0 store <8 x float> %vc0, ptr %c0, align 4 ret void }
Add comments to address that the shuffle instructions throughput is less than fadd/fsub/fmul instruction?