Page MenuHomePhabricator

Disable the vzeroupper insertion pass on PS4
ClosedPublic

Authored by ygao on Feb 2 2016, 7:32 PM.

Details

Summary

Hi,
This patch re-implements the work to disable the vzeroupper insertion pass
on PS4 based on review feedback from Hal and Sean.

I am not sure whether there are other processors that behave like Jaguar
when it comes to writing YMM registers.

Diff Detail

Repository
rL LLVM

Event Timeline

ygao updated this revision to Diff 46733.Feb 2 2016, 7:32 PM
ygao retitled this revision from to Disable the vzeroupper insertion pass on PS4.
ygao updated this object.
ygao added a reviewer: hfinkel.
ygao added subscribers: silvas, llvm-commits.
hfinkel accepted this revision.Feb 2 2016, 8:22 PM
hfinkel edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Feb 2 2016, 8:22 PM
silvas accepted this revision.Feb 2 2016, 9:53 PM
silvas added a reviewer: silvas.

LGTM.

As long as the consequence of running such code on a non-btver2 CPU is merely performance, not correctness.
I seem to remember that being a concern in the first attempt at turning off vzeroupper, years ago. Something about the consistency of behavior of code in a library, IIRC, when caller and callee were compiled for different CPUs and did not have the same concept of whether the upper parts had been zeroed. Sorry I don't remember the specifics better than that, and I certainly don't know enough about the microarchitectural details to say one way or the other.

As long as the consequence of running such code on a non-btver2 CPU is merely performance, not correctness.
I seem to remember that being a concern in the first attempt at turning off vzeroupper, years ago. Something about the consistency of behavior of code in a library, IIRC, when caller and callee were compiled for different CPUs and did not have the same concept of whether the upper parts had been zeroed. Sorry I don't remember the specifics better than that, and I certainly don't know enough about the microarchitectural details to say one way or the other.

My understanding is that this should only affect performance.

The problem is when you mix legacy SSE instructions with AVX instructions. Legacy SSE instructions do not affect the upper 128-bits of the YMM registers. This may cause false dependencies due to partial register writes.

So, if a library is built for a non AVX CPU (or if the library cannot avoid using legacy SSE code), the absence of vzeroupper in the code has the potential of causing stalls due to false dependencies (when there is a AVX-SSE transition).

On AMD Fam 15h processors (and Btver2) there is no penalty for AVX-SSE transitions. This is an important difference with respect to Intel processors where, for each SSE-AVX transition, the hardware saves and restores the upper 128 bits of the YMM registers. I think that is the reason why on Intel, vzeroupper is very fast, while on btver2 vzeroupper is microcoded (and extremely slow!).
Also, (since Fam 15) AMD processors implement an XMM register merge optimization; the hardware keeps track of XMM registers whose upper portions have been cleared to zeros.

ygao added a comment.Feb 3 2016, 11:16 AM

I definitely remember there was some concern (or incident?) over correctness,
and it involves some library. Unfortunately I cannot recall the details.

In this patch I was setting the feature bit on btver2, but it probably also
applies to bdver[2..4].

This revision was automatically updated to reflect the committed changes.