The semantics of the scalar FMA intrinsics are that the high vector elements are copied from the first source, e.g. (from the Intel manual):
m128 _mm_fmadd_ss (m128 a, m128 b, m128 c)
Operation:
dst[31:0] := (a[31:0] * b[31:0]) + c[31:0]
dst[127:32] := a[127:32]
dst[MAX:128] := 0
The current pattern switches src1 and src2 around (I guess to match the "213" order), which ends up tying the original src2 to the dest.
Since the actual scalar fma3 instructions copy the high elements from the dest register, the wrong values are copied.
This modifies the pattern to leave src1 and src2 in their original order.
Please add a comment, that you use 1-2-3 instead of 2-1-3 because src1 is tied to dest.