This is an archive of the discontinued LLVM Phabricator instance.

Disable the vzeroupper insertion pass on PS4
ClosedPublic

Authored by ygao on Feb 2 2016, 7:32 PM.

Download Raw Diff

Details

Reviewers

silvas
hfinkel

Commits

rG0de36ec169b8: Disable the vzeroupper insertion pass on PS4.
rL260764: Disable the vzeroupper insertion pass on PS4.

Summary

Hi,
This patch re-implements the work to disable the vzeroupper insertion pass
on PS4 based on review feedback from Hal and Sean.

I am not sure whether there are other processors that behave like Jaguar
when it comes to writing YMM registers.

Diff Detail

Repository: rL LLVM

Event Timeline

ygao updated this revision to Diff 46733.Feb 2 2016, 7:32 PM

ygao retitled this revision from to Disable the vzeroupper insertion pass on PS4.

ygao updated this object.

ygao added a reviewer: hfinkel.

ygao added subscribers: silvas, llvm-commits.

LGTM.

This revision is now accepted and ready to land.Feb 2 2016, 8:22 PM

LGTM.

As long as the consequence of running such code on a non-btver2 CPU is merely performance, not correctness.
I seem to remember that being a concern in the first attempt at turning off vzeroupper, years ago. Something about the consistency of behavior of code in a library, IIRC, when caller and callee were compiled for different CPUs and did not have the same concept of whether the upper parts had been zeroed. Sorry I don't remember the specifics better than that, and I certainly don't know enough about the microarchitectural details to say one way or the other.

In D16837#343006, @probinson wrote:

As long as the consequence of running such code on a non-btver2 CPU is merely performance, not correctness.
I seem to remember that being a concern in the first attempt at turning off vzeroupper, years ago. Something about the consistency of behavior of code in a library, IIRC, when caller and callee were compiled for different CPUs and did not have the same concept of whether the upper parts had been zeroed. Sorry I don't remember the specifics better than that, and I certainly don't know enough about the microarchitectural details to say one way or the other.

My understanding is that this should only affect performance.

The problem is when you mix legacy SSE instructions with AVX instructions. Legacy SSE instructions do not affect the upper 128-bits of the YMM registers. This may cause false dependencies due to partial register writes.

So, if a library is built for a non AVX CPU (or if the library cannot avoid using legacy SSE code), the absence of vzeroupper in the code has the potential of causing stalls due to false dependencies (when there is a AVX-SSE transition).

On AMD Fam 15h processors (and Btver2) there is no penalty for AVX-SSE transitions. This is an important difference with respect to Intel processors where, for each SSE-AVX transition, the hardware saves and restores the upper 128 bits of the YMM registers. I think that is the reason why on Intel, vzeroupper is very fast, while on btver2 vzeroupper is microcoded (and extremely slow!).
Also, (since Fam 15) AMD processors implement an XMM register merge optimization; the hardware keeps track of XMM registers whose upper portions have been cleared to zeros.

I definitely remember there was some concern (or incident?) over correctness,
and it involves some library. Unfortunately I cannot recall the details.

In this patch I was setting the feature bit on btver2, but it probably also
applies to bdver[2..4].

Closed by commit rL260764: Disable the vzeroupper insertion pass on PS4. (authored by ygao). · Explain WhyFeb 12 2016, 3:42 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

8 lines

5 lines

1 line

2 lines

test/

CodeGen/

X86/

avx-vzeroupper.ll

5 lines

Diff 47867

llvm/trunk/lib/Target/X86/X86.td

Show First 20 Lines • Show All 233 Lines • ▼ Show 20 Lines	def FeatureLEAUsesAG : SubtargetFeature<"lea-uses-ag", "LEAUsesAG", "true",
"LEA instruction needs inputs at AG stage">;		"LEA instruction needs inputs at AG stage">;
def FeatureSlowLEA : SubtargetFeature<"slow-lea", "SlowLEA", "true",		def FeatureSlowLEA : SubtargetFeature<"slow-lea", "SlowLEA", "true",
"LEA instruction with certain arguments is slow">;		"LEA instruction with certain arguments is slow">;
def FeatureSlowIncDec : SubtargetFeature<"slow-incdec", "SlowIncDec", "true",		def FeatureSlowIncDec : SubtargetFeature<"slow-incdec", "SlowIncDec", "true",
"INC and DEC instructions are slower than ADD and SUB">;		"INC and DEC instructions are slower than ADD and SUB">;
def FeatureSoftFloat		def FeatureSoftFloat
: SubtargetFeature<"soft-float", "UseSoftFloat", "true",		: SubtargetFeature<"soft-float", "UseSoftFloat", "true",
"Use software floating point features.">;		"Use software floating point features.">;
		// On at least some AMD processors, there is no performance hazard to writing
		// only the lower parts of a YMM register without clearing the upper part.
		def FeatureFastPartialYMMWrite
		: SubtargetFeature<"fast-partial-ymm-write", "HasFastPartialYMMWrite",
		"true", "Partial writes to YMM registers are fast">;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// X86 processors supported.		// X86 processors supported.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

include "X86Schedule.td"		include "X86Schedule.td"

def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",		def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",
▲ Show 20 Lines • Show All 341 Lines • ▼ Show 20 Lines	def : ProcessorModel<"btver2", BtVer2Model, [
FeatureBMI,		FeatureBMI,
FeatureF16C,		FeatureF16C,
FeatureMOVBE,		FeatureMOVBE,
FeatureLZCNT,		FeatureLZCNT,
FeaturePOPCNT,		FeaturePOPCNT,
FeatureXSAVE,		FeatureXSAVE,
FeatureXSAVEOPT,		FeatureXSAVEOPT,
FeatureSlowSHLD,		FeatureSlowSHLD,
FeatureLAHFSAHF		FeatureLAHFSAHF,
		FeatureFastPartialYMMWrite
]>;		]>;

// Bulldozer		// Bulldozer
def : Proc<"bdver1", [		def : Proc<"bdver1", [
FeatureXOP,		FeatureXOP,
FeatureFMA4,		FeatureFMA4,
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeatureAES,		FeatureAES,
▲ Show 20 Lines • Show All 177 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines	protected:
/// True if this processor has the CMPXCHG16B instruction;		/// True if this processor has the CMPXCHG16B instruction;
/// this is true for most x86-64 chips, but not the first AMD chips.		/// this is true for most x86-64 chips, but not the first AMD chips.
bool HasCmpxchg16b;		bool HasCmpxchg16b;

/// True if the LEA instruction should be used for adjusting		/// True if the LEA instruction should be used for adjusting
/// the stack pointer. This is an optimization for Intel Atom processors.		/// the stack pointer. This is an optimization for Intel Atom processors.
bool UseLeaForSP;		bool UseLeaForSP;

		/// True if there is no performance penalty to writing only the lower parts
		/// of a YMM register without clearing the upper part.
		bool HasFastPartialYMMWrite;

/// True if 8-bit divisions are significantly faster than		/// True if 8-bit divisions are significantly faster than
/// 32-bit divisions and should be used when possible.		/// 32-bit divisions and should be used when possible.
bool HasSlowDivide32;		bool HasSlowDivide32;

/// True if 16-bit divides are significantly faster than		/// True if 16-bit divides are significantly faster than
/// 64-bit divisions and should be used when possible.		/// 64-bit divisions and should be used when possible.
bool HasSlowDivide64;		bool HasSlowDivide64;

▲ Show 20 Lines • Show All 216 Lines • ▼ Show 20 Lines	public:
bool hasLAHFSAHF() const { return HasLAHFSAHF; }		bool hasLAHFSAHF() const { return HasLAHFSAHF; }
bool isBTMemSlow() const { return IsBTMemSlow; }		bool isBTMemSlow() const { return IsBTMemSlow; }
bool isSHLDSlow() const { return IsSHLDSlow; }		bool isSHLDSlow() const { return IsSHLDSlow; }
bool isUnalignedMem16Slow() const { return IsUAMem16Slow; }		bool isUnalignedMem16Slow() const { return IsUAMem16Slow; }
bool isUnalignedMem32Slow() const { return IsUAMem32Slow; }		bool isUnalignedMem32Slow() const { return IsUAMem32Slow; }
bool hasSSEUnalignedMem() const { return HasSSEUnalignedMem; }		bool hasSSEUnalignedMem() const { return HasSSEUnalignedMem; }
bool hasCmpxchg16b() const { return HasCmpxchg16b; }		bool hasCmpxchg16b() const { return HasCmpxchg16b; }
bool useLeaForSP() const { return UseLeaForSP; }		bool useLeaForSP() const { return UseLeaForSP; }
		bool hasFastPartialYMMWrite() const { return HasFastPartialYMMWrite; }
bool hasSlowDivide32() const { return HasSlowDivide32; }		bool hasSlowDivide32() const { return HasSlowDivide32; }
bool hasSlowDivide64() const { return HasSlowDivide64; }		bool hasSlowDivide64() const { return HasSlowDivide64; }
bool padShortFunctions() const { return PadShortFunctions; }		bool padShortFunctions() const { return PadShortFunctions; }
bool callRegIndirect() const { return CallRegIndirect; }		bool callRegIndirect() const { return CallRegIndirect; }
bool LEAusesAG() const { return LEAUsesAG; }		bool LEAusesAG() const { return LEAUsesAG; }
bool slowLEA() const { return SlowLEA; }		bool slowLEA() const { return SlowLEA; }
bool slowIncDec() const { return SlowIncDec; }		bool slowIncDec() const { return SlowIncDec; }
bool hasCDI() const { return HasCDI; }		bool hasCDI() const { return HasCDI; }
▲ Show 20 Lines • Show All 148 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 279 Lines • ▼ Show 20 Lines	void X86Subtarget::initializeEnvironment() {
HasMPX = false;		HasMPX = false;
IsBTMemSlow = false;		IsBTMemSlow = false;
IsSHLDSlow = false;		IsSHLDSlow = false;
IsUAMem16Slow = false;		IsUAMem16Slow = false;
IsUAMem32Slow = false;		IsUAMem32Slow = false;
HasSSEUnalignedMem = false;		HasSSEUnalignedMem = false;
HasCmpxchg16b = false;		HasCmpxchg16b = false;
UseLeaForSP = false;		UseLeaForSP = false;
		HasFastPartialYMMWrite = false;
HasSlowDivide32 = false;		HasSlowDivide32 = false;
HasSlowDivide64 = false;		HasSlowDivide64 = false;
PadShortFunctions = false;		PadShortFunctions = false;
CallRegIndirect = false;		CallRegIndirect = false;
LEAUsesAG = false;		LEAUsesAG = false;
SlowLEA = false;		SlowLEA = false;
SlowIncDec = false;		SlowIncDec = false;
stackAlignment = 4;		stackAlignment = 4;
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86VZeroUpper.cpp

Show First 20 Lines • Show All 242 Lines • ▼ Show 20 Lines	void VZeroUpperInserter::processBasicBlock(MachineBasicBlock &MBB) {

BlockStates[MBB.getNumber()].ExitState = CurState;		BlockStates[MBB.getNumber()].ExitState = CurState;
}		}

/// runOnMachineFunction - Loop over all of the basic blocks, inserting		/// runOnMachineFunction - Loop over all of the basic blocks, inserting
/// vzeroupper instructions before function calls.		/// vzeroupper instructions before function calls.
bool VZeroUpperInserter::runOnMachineFunction(MachineFunction &MF) {		bool VZeroUpperInserter::runOnMachineFunction(MachineFunction &MF) {
const X86Subtarget &ST = MF.getSubtarget<X86Subtarget>();		const X86Subtarget &ST = MF.getSubtarget<X86Subtarget>();
if (!ST.hasAVX() \|\| ST.hasAVX512())		if (!ST.hasAVX() \|\| ST.hasAVX512() \|\| ST.hasFastPartialYMMWrite())
return false;		return false;
TII = ST.getInstrInfo();		TII = ST.getInstrInfo();
MachineRegisterInfo &MRI = MF.getRegInfo();		MachineRegisterInfo &MRI = MF.getRegInfo();
EverMadeChange = false;		EverMadeChange = false;

bool FnHasLiveInYmm = checkFnHasLiveInYmm(MRI);		bool FnHasLiveInYmm = checkFnHasLiveInYmm(MRI);

// Fast check: if the function doesn't use any ymm registers, we don't need		// Fast check: if the function doesn't use any ymm registers, we don't need
▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/avx-vzeroupper.ll

	; RUN: llc < %s -x86-use-vzeroupper -mtriple=x86_64-apple-darwin -mattr=+avx \| FileCheck %s			; RUN: llc < %s -x86-use-vzeroupper -mtriple=x86_64-apple-darwin -mattr=+avx \| FileCheck %s
				; RUN: llc < %s -x86-use-vzeroupper -mtriple=x86_64-apple-darwin -mattr=+avx,+fast-partial-ymm-write \| FileCheck --check-prefix=FASTYMM %s
				; RUN: llc < %s -x86-use-vzeroupper -mtriple=x86_64-apple-darwin -mcpu=btver2 \| FileCheck --check-prefix=BTVER2 %s

				; FASTYMM-NOT: vzeroupper
				; BTVER2-NOT: vzeroupper

	declare i32 @foo()			declare i32 @foo()
	declare <4 x float> @do_sse(<4 x float>)			declare <4 x float> @do_sse(<4 x float>)
	declare <8 x float> @do_avx(<8 x float>)			declare <8 x float> @do_avx(<8 x float>)
	declare <4 x float> @llvm.x86.avx.vextractf128.ps.256(<8 x float>, i8) nounwind readnone			declare <4 x float> @llvm.x86.avx.vextractf128.ps.256(<8 x float>, i8) nounwind readnone
	@x = common global <4 x float> zeroinitializer, align 16			@x = common global <4 x float> zeroinitializer, align 16
	@g = common global <8 x float> zeroinitializer, align 32			@g = common global <8 x float> zeroinitializer, align 32

	▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines