On BtVer2 conditional SIMD stores are heavily microcoded.
The latency is directly proportional to the number of packed elements extracted from the input vector. Also, according to micro-benchmarks, most of the computation seems to be done in the integer unit.
Only a minority of the uOPs is executed by the FPU. The observed behaviour on the FPU looks similar to this:
- The input MASK value is moved to the Integer Unit -- [ a VMOVMSK-like uOP - executed on JFPU0].
- In parallel, each element of the input XMM/YMM is extracted and then sent to the IntegerUnit through JFPU1.
As expected, a (conditional) store is executed for every extracted element. Interestingly, a (speculative) load is executed for every extracted element too. It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the integer unit for every contionally stored element.
VMASKMOVDQU is a special case: the number of speculative loads is always 2 (presumably, one load per quadword). That means, extra shifts and masking is performed on (one of) the loaded quadwords before each conditional store (that also explains the big number of non-FP uOPs retired).