While the packed single-precision domain has the smallest encodings and
was the first domain, it isn't a very good default. Consider operations
which could occur in any domain: loads, stores, shuffles, and, or, xor.
All of these operations are the same latency in the integer domain and
the floating point domains, but in many cases have 2x or 3x the
throughput in the integer domain! When in the floating point domain,
they end up bottlenecked on a single execution port in every micro
architecture since sandybridge, and probably some older ones as well.
This in turn uncovers some issues with our execution domain settings in
the backend. I've got a patch for one that is an independent improvement
and I'll submit shortly -- it adds an execution domain to movss and
movsd, both of which specifically target floating point domains. Adding
these causes us to match up floating point domain code much better
already.
A second issue I'm investigating is with vinsertf128 near a packed
double domain instruction becoming vinserti128.
Aside from these issues, everything I'm seeing looks like a huge
improvement to domain crossing and generally using the higher throughput
integer units. What do others think?
I'll update this patch with the test changes if folks want, but it will
be *MANY* updates to tests to make this change.