Due to VADD, VSUB, VMUL can executed on more ports than VINSERT.
We tend to remove FADD/FSUB/FMUL handling from combineConcatVectorOps.
More details pls refer to fb91f0
Problems
1 This patch cause affected tests generated worse code. We need to fix it.
PS: After fb91f0 there is another 649b14 optimize these tests
2 We may also need to root cause why middle end vector optimization didn't work for there cases.
For example, llvm/test/CodeGen/X86/widen_fadd.ll function widen_fadd_v2f32_v8f32
was not be optimized (in avx mode)
from
define void @widen_fadd_v2f32_v8f32(ptr %a0, ptr %b0, ptr %c0) {
%a2 = getelementptr inbounds i8, ptr %a0, i64 8
%b2 = getelementptr inbounds i8, ptr %b0, i64 8
%c2 = getelementptr inbounds i8, ptr %c0, i64 8
%a4 = getelementptr inbounds i8, ptr %a0, i64 16
%b4 = getelementptr inbounds i8, ptr %b0, i64 16
%c4 = getelementptr inbounds i8, ptr %c0, i64 16
%a6 = getelementptr inbounds i8, ptr %a0, i64 24
%b6 = getelementptr inbounds i8, ptr %b0, i64 24
%c6 = getelementptr inbounds i8, ptr %c0, i64 24
%va0 = load <2 x float>, ptr %a0, align 4
%vb0 = load <2 x float>, ptr %b0, align 4
%va2 = load <2 x float>, ptr %a2, align 4
%vb2 = load <2 x float>, ptr %b2, align 4
%va4 = load <2 x float>, ptr %a4, align 4
%vb4 = load <2 x float>, ptr %b4, align 4
%va6 = load <2 x float>, ptr %a6, align 4
%vb6 = load <2 x float>, ptr %b6, align 4
%vc0 = fadd <2 x float> %va0, %vb0
%vc2 = fadd <2 x float> %va2, %vb2
%vc4 = fadd <2 x float> %va4, %vb4
%vc6 = fadd <2 x float> %va6, %vb6
store <2 x float> %vc0, ptr %c0, align 4
store <2 x float> %vc2, ptr %c2, align 4
store <2 x float> %vc4, ptr %c4, align 4
store <2 x float> %vc6, ptr %c6, align 4
ret void
}to
define void @widen_fadd_v2f32_v8f32(ptr %a0, ptr %b0, ptr %c0) {
%va0 = load <8 x float>, ptr %a0, align 4
%vb0 = load <8 x float>, ptr %b0, align 4
%vc0 = fadd <8 x float> %va0, %vb0
store <8 x float> %vc0, ptr %c0, align 4
ret void
}
Add comments to address that the shuffle instructions throughput is less than fadd/fsub/fmul instruction?