Hello,
Please review the patch that enables memory folding optimization for
sequences like this:
#include <immintrin.h> double mem; __m128d func(__m128d a, __m128d b) { __m128d m = _mm_load_sd(&mem); return _mm_fmadd_sd(a, b, m); }
Code without the patch (clang -O3 -S):
func: # @func .cfi_startproc # BB#0: # %entry movsd mem(%rip), %xmm2 # xmm2 = mem[0],zero vfmadd213sd %xmm2, %xmm1, %xmm0 retq
Code with the patch:
func: # @func .cfi_startproc # BB#0: # %entry vfmadd213sd mem(%rip), %xmm1, %xmm0 retq
The load can be folded into 2nd or 3rd operand of FMA*_Int instruction.
The newly added test fma-scalar-memfold.ll checks memory folding for both of operands.
lib/Target/X86/X86InstrFMA.td:
Removed the redundant register to register moves. Memory folding does not work with those moves. // TODO: perhaps, the register-to-register moves can be just stripped in such/some cases, // but that is a separate optimization/change-set.
lib/Target/X86/X86InstrInfo.cpp:
Added the FMA*_Int opcodes to the routine isNonFoldablePartialRegisterLoad()
test/CodeGen/X86/fma-scalar-memfold.ll:
New test. Checks that result of _mm_load_{s,d}() can be folded into 2nd or 3rd operand of FMA*_Int.
Thank you,
Slava