This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Target/
-
llvm/
-
Target/
-
TargetInstrInfo.h
-
lib/
-
CodeGen/
-
ExecutionDepsFix.cpp
-
Target/X86/
-
X86/
-
X86InstrInfo.h
1
X86InstrInfo.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
2
break-false-dep.ll
-
known-bits-vector.ll
-
vec_int_to_fp.ll

Differential D30177

[ExecutionDepsFix] Recognize existing dep breaks
Needs ReviewPublic

Authored by loladiro on Feb 20 2017, 11:29 AM.

Download Raw Diff

Details

Reviewers

myatsina

Summary

Teach ExecutionDepsFix to recognize instructions that are already
register-dependency breaking. This is done in preparation of being
more conservative in clearance assumptions at function entry/after
function calls. In this situation, this commit gives us two benefits:

It reduces the number of inserted dependency breaks, by reusing those that already exist (at the moment we assume all registers have significant clerance at function entry, so basically any unused register can be used for undef reads - this will no longer be the case after the above mentioned change)

It provides a simple way to test the clearance calculation code. Right now, those tests assume that all registers have large clearance at function entry. Further, while there is a way to forcably clobber register clearance (e.g. by using inline assembly), without this change, there is no easy way to clear register clearance again. This commit provides a way to do so (since LLVM materializes constant 0s in registers using dependency breaking instructions). E.g. the LLVM IR fcmp ult double %x, 0.0, will force such an instruction (assuming that %x is unknown).

Diff Detail

Build Status

Buildable 4132
Build 4132: arc lint + arc unit

Event Timeline

loladiro created this revision.Feb 20 2017, 11:29 AM

loladiro mentioned this in D28915: [ExecutionDepsFix] Optimize instruction insertion.Feb 20 2017, 11:31 AM

myatsina added inline comments.Feb 27 2017, 7:24 AM

lib/Target/X86/X86InstrInfo.cpp
8271	According to the guide, xorps and xorpd cannot break dependencies on pentium4, so this information should be added too.
test/CodeGen/X86/break-false-dep.ll
339	This assumption seems to be very fragile, vector zero initializer should also create a xor, and probably be more robust. I'm starting to wonder if this test should be written in mir instead, then we would not be dependent on code gen optimizations that create xors and we could test all the additional dependency breaking instructions easily, not just xor.
349	What happens for this function?: define double @recognize_existing(i64 %arg) { ; Mark all regs as "used" and thus having same clearance tail call void asm sideeffect "", "~{xmm0},...,~{xmm15},~{dirflag},~{fpsr},~{flags}"() %tmp1 = sitofp i64 %arg to double %tmp2 = sitofp i64 %arg to double %tmp3 = fadd double %tmp1, %tmp2 ret %tmp3 } I expect a xor for some xmm will added before the first register. Will we see additional xor before the second function? Meaning, does this optimization take into account the xors we add to break dependency and thus re calculate their clearance?

Do you have a new revision for this change?

Do you have any update on this change?

Revision Contents

Path

Size

include/

llvm/

Target/

TargetInstrInfo.h

10 lines

lib/

CodeGen/

ExecutionDepsFix.cpp

8 lines

Target/

X86/

X86InstrInfo.h

1 line

X86InstrInfo.cpp

45 lines

test/

CodeGen/

X86/

break-false-dep.ll

22 lines

known-bits-vector.ll

2 lines

vec_int_to_fp.ll

5 lines

Diff 89138

include/llvm/Target/TargetInstrInfo.h

Show First 20 Lines • Show All 1,425 Lines • ▼ Show 20 Lines	public:
/// cvtsi2ss %rbx, %xmm0		/// cvtsi2ss %rbx, %xmm0
///		///
/// An <imp-kill> operand should be added to MI if an instruction was		/// An <imp-kill> operand should be added to MI if an instruction was
/// inserted. This ties the instructions together in the post-ra scheduler.		/// inserted. This ties the instructions together in the post-ra scheduler.
///		///
virtual void breakPartialRegDependency(MachineInstr &MI, unsigned OpNum,		virtual void breakPartialRegDependency(MachineInstr &MI, unsigned OpNum,
const TargetRegisterInfo *TRI) const {}		const TargetRegisterInfo *TRI) const {}

		/// If the instruction `MI` is a dependency breaking instruction, return
		/// the register number for which this instruction is dependency breaking.
		/// This function may conservatively return an empty Optional even if MI is
		/// dependency breaking (resulting in at worst an unnecessary dependency break
		/// insertion), but should always return an empty Optional when MI is not
		/// dependency breaking.
		virtual Optional<unsigned> getDependencyBreakReg(MachineInstr &MI) const {
		return Optional<unsigned>{};
		}

/// Create machine specific model for scheduling.		/// Create machine specific model for scheduling.
virtual DFAPacketizer *		virtual DFAPacketizer *
CreateTargetScheduleState(const TargetSubtargetInfo &) const {		CreateTargetScheduleState(const TargetSubtargetInfo &) const {
return nullptr;		return nullptr;
}		}

// Sometimes, it is possible for the target		// Sometimes, it is possible for the target
// to tell, even without aliasing information, that two MIs access different		// to tell, even without aliasing information, that two MIs access different
▲ Show 20 Lines • Show All 105 Lines • Show Last 20 Lines

lib/CodeGen/ExecutionDepsFix.cpp

Show First 20 Lines • Show All 613 Lines • ▼ Show 20 Lines	for (int rx : regIndices(MO.getReg())) {
// How many instructions since rx was last written?		// How many instructions since rx was last written?
LiveRegs[rx].Def = CurInstr;		LiveRegs[rx].Def = CurInstr;

// Kill off domains redefined by generic instructions.		// Kill off domains redefined by generic instructions.
if (Kill)		if (Kill)
kill(rx);		kill(rx);
}		}
}		}
		Optional<unsigned> DepReg = TII->getDependencyBreakReg(*MI);
		if (DepReg) {
		for (int rx : regIndices(DepReg.getValue())) {
		// This instruction is a pre-existing dependency break, so there are no
		// clearance issues, reset the counter.
		LiveRegs[rx].Def = -(1 << 20);
		}
		}
++CurInstr;		++CurInstr;
}		}

/// \break Break false dependencies on undefined register reads.		/// \break Break false dependencies on undefined register reads.
///		///
/// Walk the block backward computing precise liveness. This is expensive, so we		/// Walk the block backward computing precise liveness. This is expensive, so we
/// only do it on demand. Note that the occurrence of undefined register reads		/// only do it on demand. Note that the occurrence of undefined register reads
/// that should be broken is very rare, but when they occur we may have many in		/// that should be broken is very rare, but when they occur we may have many in
▲ Show 20 Lines • Show All 341 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.h

Show First 20 Lines • Show All 475 Lines • ▼ Show 20 Lines	public:

unsigned		unsigned
getPartialRegUpdateClearance(const MachineInstr &MI, unsigned OpNum,		getPartialRegUpdateClearance(const MachineInstr &MI, unsigned OpNum,
const TargetRegisterInfo *TRI) const override;		const TargetRegisterInfo *TRI) const override;
unsigned getUndefRegClearance(const MachineInstr &MI, unsigned &OpNum,		unsigned getUndefRegClearance(const MachineInstr &MI, unsigned &OpNum,
const TargetRegisterInfo *TRI) const override;		const TargetRegisterInfo *TRI) const override;
void breakPartialRegDependency(MachineInstr &MI, unsigned OpNum,		void breakPartialRegDependency(MachineInstr &MI, unsigned OpNum,
const TargetRegisterInfo *TRI) const override;		const TargetRegisterInfo *TRI) const override;
		Optional<unsigned> getDependencyBreakReg(MachineInstr &MI) const override;

MachineInstr *foldMemoryOperandImpl(MachineFunction &MF, MachineInstr &MI,		MachineInstr *foldMemoryOperandImpl(MachineFunction &MF, MachineInstr &MI,
unsigned OpNum,		unsigned OpNum,
ArrayRef<MachineOperand> MOs,		ArrayRef<MachineOperand> MOs,
MachineBasicBlock::iterator InsertPt,		MachineBasicBlock::iterator InsertPt,
unsigned Size, unsigned Alignment,		unsigned Size, unsigned Alignment,
bool AllowCommute) const;		bool AllowCommute) const;

▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,247 Lines • ▼ Show 20 Lines	if (X86::VR128RegClass.contains(Reg)) {
BuildMI(*MI.getParent(), MI, MI.getDebugLoc(), get(X86::VXORPSrr), XReg)		BuildMI(*MI.getParent(), MI, MI.getDebugLoc(), get(X86::VXORPSrr), XReg)
.addReg(XReg, RegState::Undef)		.addReg(XReg, RegState::Undef)
.addReg(XReg, RegState::Undef)		.addReg(XReg, RegState::Undef)
.addReg(Reg, RegState::ImplicitDefine);		.addReg(Reg, RegState::ImplicitDefine);
MI.addRegisterKilled(Reg, TRI, true);		MI.addRegisterKilled(Reg, TRI, true);
}		}
}		}

		Optional<unsigned> X86InstrInfo::getDependencyBreakReg(MachineInstr &MI) const {
		unsigned Opc = MI.getOpcode();
		switch (Opc) {
		default:
		break;
		// See the Intel Architecture Optimization Reference Manual
		// Section 3.5.1.8 Clearing Registers and Dependency Breaking Idioms
		case X86::XOR8rr:
		case X86::XOR16rr:
		case X86::XOR32rr:
		case X86::XOR64rr:
		case X86::SUB8rr:
		case X86::SUB16rr:
		case X86::SUB32rr:
		case X86::SUB64rr:
		case X86::XORPSrr:
		myatsinaUnsubmitted Not Done Reply Inline Actions According to the guide, xorps and xorpd cannot break dependencies on pentium4, so this information should be added too. myatsina: According to the guide, xorps and xorpd cannot break dependencies on pentium4, so this…
		case X86::XORPDrr:
		case X86::PXORrr:
		case X86::SUBPSrr:
		case X86::SUBPDrr:
		case X86::PSUBBrr:
		case X86::PSUBWrr:
		case X86::PSUBDrr:
		case X86::PSUBQrr:
		case X86::VXORPSrr:
		case X86::VXORPDrr:
		case X86::VPXORrr:
		case X86::VSUBPSrr:
		case X86::VSUBPDrr:
		case X86::VPSUBBrr:
		case X86::VPSUBWrr:
		case X86::VPSUBDrr:
		case X86::VPSUBQrr: {
		unsigned Reg = X86::NoRegister;
		for (const MachineOperand &MO : MI.operands()) {
		if (!MO.isReg() \|\| (Reg != X86::NoRegister && MO.getReg() != Reg))
		return Optional<unsigned>{};
		Reg = MO.getReg();
		}
		return Optional<unsigned>{Reg};
		}
		}
		return Optional<unsigned>{};
		}

MachineInstr *		MachineInstr *
X86InstrInfo::foldMemoryOperandImpl(MachineFunction &MF, MachineInstr &MI,		X86InstrInfo::foldMemoryOperandImpl(MachineFunction &MF, MachineInstr &MI,
ArrayRef<unsigned> Ops,		ArrayRef<unsigned> Ops,
MachineBasicBlock::iterator InsertPt,		MachineBasicBlock::iterator InsertPt,
int FrameIndex, LiveIntervals *LIS) const {		int FrameIndex, LiveIntervals *LIS) const {
// Check switch flag		// Check switch flag
if (NoFusing)		if (NoFusing)
return nullptr;		return nullptr;
▲ Show 20 Lines • Show All 2,122 Lines • Show Last 20 Lines

test/CodeGen/X86/break-false-dep.ll

Show First 20 Lines • Show All 328 Lines • ▼ Show 20 Lines	;AVX-NOT: %xmm6
%outptr = getelementptr double, double* %y, i64 %prev_j		%outptr = getelementptr double, double* %y, i64 %prev_j
store double %div, double* %outptr, align 8		store double %div, double* %outptr, align 8
%done = icmp slt i64 %size, %nexti		%done = icmp slt i64 %size, %nexti
br i1 %done, label %loopdone, label %loop		br i1 %done, label %loopdone, label %loop

loopdone:		loopdone:
ret void		ret void
}		}

		; Make sure we recognize pre-existing dependency breaking instructions and
		; re-use them. In `fcmp ult double %x, 0.0`, the `0.0` constant gets
		myatsinaUnsubmitted Not Done Reply Inline Actions This assumption seems to be very fragile, vector zero initializer should also create a xor, and probably be more robust. I'm starting to wonder if this test should be written in mir instead, then we would not be dependent on code gen optimizations that create xors and we could test all the additional dependency breaking instructions easily, not just xor. myatsina: This assumption seems to be very fragile, vector zero initializer should also create a xor, and…
		; materialized as a vxorps
		define double @recognize_existing(double %x, i64 %arg) {
		;AVX-LABEL:@recognize_existing
		tail call void asm sideeffect "", "~{xmm1},~{xmm2},~{xmm3},~{dirflag},~{fpsr},~{flags}"()
		tail call void asm sideeffect "", "~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{dirflag},~{fpsr},~{flags}"()
		tail call void asm sideeffect "", "~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{dirflag},~{fpsr},~{flags}"()
		tail call void asm sideeffect "", "~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{dirflag},~{fpsr},~{flags}"()
		;AVX: vxorps [[XMM1:%xmm1]], [[XMM1]], [[XMM1]]
		;AVX: vucomisd [[XMM1]], %xmm0
		%1 = fcmp ult double %x, 0.0
		myatsinaUnsubmitted Not Done Reply Inline Actions What happens for this function?: define double @recognize_existing(i64 %arg) { ; Mark all regs as "used" and thus having same clearance tail call void asm sideeffect "", "~{xmm0},...,~{xmm15},~{dirflag},~{fpsr},~{flags}"() %tmp1 = sitofp i64 %arg to double %tmp2 = sitofp i64 %arg to double %tmp3 = fadd double %tmp1, %tmp2 ret %tmp3 } I expect a xor for some xmm will added before the first register. Will we see additional xor before the second function? Meaning, does this optimization take into account the xors we add to break dependency and thus re calculate their clearance? myatsina: What happens for this function?: define double @recognize_existing(i64 %arg) { ; Mark all…
		br i1 %1, label %main, label %fake
		main:
		;AVX-NOT: vxorps
		;AVX: vcvtsi2sdq {{.*}}, [[XMM1]], {{%xmm[0-9]+}}
		%tmp1 = sitofp i64 %arg to double
		ret double %tmp1
		fake:
		ret double 0.0
		}

test/CodeGen/X86/known-bits-vector.ll

	Show All 36 Lines
	; X32-NEXT: popl %ebp			; X32-NEXT: popl %ebp
	; X32-NEXT: retl			; X32-NEXT: retl
	;			;
	; X64-LABEL: knownbits_mask_extract_uitofp:			; X64-LABEL: knownbits_mask_extract_uitofp:
	; X64: # BB#0:			; X64: # BB#0:
	; X64-NEXT: vpxor %xmm1, %xmm1, %xmm1			; X64-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; X64-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0],xmm1[1,2,3],xmm0[4,5,6,7]			; X64-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0],xmm1[1,2,3],xmm0[4,5,6,7]
	; X64-NEXT: vmovq %xmm0, %rax			; X64-NEXT: vmovq %xmm0, %rax
	; X64-NEXT: vcvtsi2ssq %rax, %xmm2, %xmm0			; X64-NEXT: vcvtsi2ssq %rax, %xmm1, %xmm0
	; X64-NEXT: retq			; X64-NEXT: retq
	%1 = and <2 x i64> %a0, <i64 65535, i64 -1>			%1 = and <2 x i64> %a0, <i64 65535, i64 -1>
	%2 = extractelement <2 x i64> %1, i32 0			%2 = extractelement <2 x i64> %1, i32 0
	%3 = uitofp i64 %2 to float			%3 = uitofp i64 %2 to float
	ret float %3			ret float %3
	}			}

	define <4 x float> @knownbits_insert_uitofp(<4 x i32> %a0, i16 %a1, i16 %a2) nounwind {			define <4 x float> @knownbits_insert_uitofp(<4 x i32> %a0, i16 %a1, i16 %a2) nounwind {
	▲ Show 20 Lines • Show All 504 Lines • Show Last 20 Lines

test/CodeGen/X86/vec_int_to_fp.ll

	Show First 20 Lines • Show All 1,659 Lines • ▼ Show 20 Lines
	; VEX-NEXT: vcvtsi2ssq %rax, %xmm2, %xmm0			; VEX-NEXT: vcvtsi2ssq %rax, %xmm2, %xmm0
	; VEX-NEXT: vaddss %xmm0, %xmm0, %xmm0			; VEX-NEXT: vaddss %xmm0, %xmm0, %xmm0
	; VEX-NEXT: .LBB39_6:			; VEX-NEXT: .LBB39_6:
	; VEX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[2,3]			; VEX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[2,3]
	; VEX-NEXT: vxorps %xmm1, %xmm1, %xmm1			; VEX-NEXT: vxorps %xmm1, %xmm1, %xmm1
	; VEX-NEXT: testq %rax, %rax			; VEX-NEXT: testq %rax, %rax
	; VEX-NEXT: js .LBB39_8			; VEX-NEXT: js .LBB39_8
	; VEX-NEXT: # BB#7:			; VEX-NEXT: # BB#7:
	; VEX-NEXT: vcvtsi2ssq %rax, %xmm2, %xmm1			; VEX-NEXT: vcvtsi2ssq %rax, %xmm1, %xmm1
	; VEX-NEXT: .LBB39_8:			; VEX-NEXT: .LBB39_8:
	; VEX-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,0]			; VEX-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,0]
	; VEX-NEXT: retq			; VEX-NEXT: retq
	;			;
	; AVX512F-LABEL: uitofp_2i64_to_4f32:			; AVX512F-LABEL: uitofp_2i64_to_4f32:
	; AVX512F: # BB#0:			; AVX512F: # BB#0:
	; AVX512F-NEXT: vpextrq $1, %xmm0, %rax			; AVX512F-NEXT: vpextrq $1, %xmm0, %rax
	; AVX512F-NEXT: vcvtusi2ssq %rax, %xmm1, %xmm1			; AVX512F-NEXT: vcvtusi2ssq %rax, %xmm1, %xmm1
	▲ Show 20 Lines • Show All 142 Lines • ▼ Show 20 Lines
	define <4 x float> @uitofp_4i64_to_4f32_undef(<2 x i64> %a) {			define <4 x float> @uitofp_4i64_to_4f32_undef(<2 x i64> %a) {
	; SSE-LABEL: uitofp_4i64_to_4f32_undef:			; SSE-LABEL: uitofp_4i64_to_4f32_undef:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: movdqa %xmm0, %xmm1			; SSE-NEXT: movdqa %xmm0, %xmm1
	; SSE-NEXT: testq %rax, %rax			; SSE-NEXT: testq %rax, %rax
	; SSE-NEXT: xorps %xmm2, %xmm2			; SSE-NEXT: xorps %xmm2, %xmm2
	; SSE-NEXT: js .LBB41_2			; SSE-NEXT: js .LBB41_2
	; SSE-NEXT: # BB#1:			; SSE-NEXT: # BB#1:
	; SSE-NEXT: xorps %xmm2, %xmm2
	; SSE-NEXT: cvtsi2ssq %rax, %xmm2			; SSE-NEXT: cvtsi2ssq %rax, %xmm2
	; SSE-NEXT: .LBB41_2:			; SSE-NEXT: .LBB41_2:
	; SSE-NEXT: movd %xmm1, %rax			; SSE-NEXT: movd %xmm1, %rax
	; SSE-NEXT: testq %rax, %rax			; SSE-NEXT: testq %rax, %rax
	; SSE-NEXT: js .LBB41_3			; SSE-NEXT: js .LBB41_3
	; SSE-NEXT: # BB#4:			; SSE-NEXT: # BB#4:
	; SSE-NEXT: xorps %xmm0, %xmm0			; SSE-NEXT: xorps %xmm0, %xmm0
	; SSE-NEXT: cvtsi2ssq %rax, %xmm0			; SSE-NEXT: cvtsi2ssq %rax, %xmm0
	▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
	; VEX-NEXT: vcvtsi2ssq %rax, %xmm2, %xmm0			; VEX-NEXT: vcvtsi2ssq %rax, %xmm2, %xmm0
	; VEX-NEXT: vaddss %xmm0, %xmm0, %xmm0			; VEX-NEXT: vaddss %xmm0, %xmm0, %xmm0
	; VEX-NEXT: .LBB41_6:			; VEX-NEXT: .LBB41_6:
	; VEX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[2,3]			; VEX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[2,3]
	; VEX-NEXT: vxorps %xmm1, %xmm1, %xmm1			; VEX-NEXT: vxorps %xmm1, %xmm1, %xmm1
	; VEX-NEXT: testq %rax, %rax			; VEX-NEXT: testq %rax, %rax
	; VEX-NEXT: js .LBB41_8			; VEX-NEXT: js .LBB41_8
	; VEX-NEXT: # BB#7:			; VEX-NEXT: # BB#7:
	; VEX-NEXT: vcvtsi2ssq %rax, %xmm2, %xmm1			; VEX-NEXT: vcvtsi2ssq %rax, %xmm1, %xmm1
	; VEX-NEXT: .LBB41_8:			; VEX-NEXT: .LBB41_8:
	; VEX-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,0]			; VEX-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,0]
	; VEX-NEXT: retq			; VEX-NEXT: retq
	;			;
	; AVX512F-LABEL: uitofp_4i64_to_4f32_undef:			; AVX512F-LABEL: uitofp_4i64_to_4f32_undef:
	; AVX512F: # BB#0:			; AVX512F: # BB#0:
	; AVX512F-NEXT: vpextrq $1, %xmm0, %rax			; AVX512F-NEXT: vpextrq $1, %xmm0, %rax
	; AVX512F-NEXT: vcvtusi2ssq %rax, %xmm1, %xmm1			; AVX512F-NEXT: vcvtusi2ssq %rax, %xmm1, %xmm1
	▲ Show 20 Lines • Show All 2,969 Lines • Show Last 20 Lines