These are semantically equivalent, but fmuladd allows decaying the op
into fmul+fadd if there is no fma instruction available. llvm.fma lowers
to scalar calls to libm fmaf, which is a lot slower.
Details
Details
Diff Detail
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Comment Actions
Yup, this change makes the generated matmul 8x faster when targeting SSE with no fma instructions.
Comment Actions
LGTM, nice finding for SSE!
note that we guarantee use of llvm.fma in some of the vector doc, we probably want to update that as well