For example, _mm_maskz_fmadd_sd would generate the following assembly:
vmovapd 48(%rsp), %xmm1
vmovapd 32(%rsp), %xmm2
vmovapd 16(%rsp), %xmm0
kmovw %eax, %k1
vfmadd231sd %xmm2, %xmm1, %xmm0 {%k1} {z} # xmm0 = (xmm1 * xmm2) + xmm0
In some cases it will be optimized as follows:
vmovapd 48(%rsp), %xmm0
vmovapd 32(%rsp), %xmm1
vmovapd 16(%rsp), %xmm2
kmovw %eax, %k1
vfmadd213sd %xmm2, %xmm1, %xmm0 {%k1} {z} # xmm0 = (xmm1 * xmm0) + xmm2
The upper 64 bit of the result isn't right.