This is a follow up to D28759 and together with that commit fixes
all (maybe all, pending another look at the benchmarks) of
the benchmarks that regressed due to rL278321 (while keeping the
performance enhancements in cases where rL278321 was beneficial).
Prior to this commit, the analysis would simply ignore any function
calls for the clearance calulation, causing incorrect results after
any function call (for the benchmarks that regressed rL278321 just
happened to pick a register that was worse than the xmm0 default).
With this patch, we kill clearance for all registers when a function
Similarly, we kill clearance at function entry. This is by far the more
disruprive of the two cases, but is necessary to avoid 2x penalty in some
common cases. The most obvious case where this happens is calling a
small-ish non-inlined function in a loop. If the function uses xmm
registers that are not live-ins, it is likely to fall into this performance
trap, if we don't consider clearance small on live-ins.
This is obviously more pessimistic than reality in a lot cases. However,
the combination of the immense penalty for not having the dependency
breaking instruction (3-5x), together with the fact that these instructions
are extremely cheap (they are special cased in the decoder, AFAIK, so
don't even take up an execution unit).