This is an archive of the discontinued LLVM Phabricator instance.

[X86] Memory folding for commutative instructions.
ClosedPublic

Authored by RKSimon on Oct 9 2014, 6:04 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
nadav
andreadb

Commits

rG77ac26d27989: [X86] Memory folding for commutative instructions.
rL219584: [X86] Memory folding for commutative instructions.

Summary

This patch improves support for commutative instructions in the x86 memory folding implementation by attempting to fold a commuted version of the instruction if the original folding fails - if that folding fails as well the instruction is 're-commuted' back to its original order before returning.

This mainly helps the stack inliner better fold reloads of 3 (or more) operand instructions (VEX encoded SSE etc.) but by performing this in the lowest foldMemoryOperandImpl implementation it also replaces the X86InstrInfo::optimizeLoadInstr version and is now used by FastISel too.

Unlike the X86InstrInfo::optimizeLoadInstr implementation it uses findCommutedOpIndices instead of hard coded commute operand indices.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 14650.Oct 9 2014, 6:04 AM

RKSimon retitled this revision from to [X86] Memory folding for commutative instructions..

RKSimon updated this object.

RKSimon edited the test plan for this revision. (Show Details)

RKSimon added reviewers: nadav, andreadb, spatel.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added subscribers: alexr, Unknown Object (MLST).

Macro testcase:

Tidied up check for commutative instructions - we can avoid isCommutable and use findCommutedOpIndices directly and then fold anything that is commutable under certain cases (e.g. FMA instructions).

Added the missing test case.

Hi Simon

lib/Target/X86/X86InstrInfo.cpp
4218 ↗	(On Diff #14665)	I am not sure this is what we want generally speaking. Indeed, if you look at the code you modified earlier (BTW having the full context would be easier for the review), we may not want to keep the commuted instruction unless the commute is in place: MachineInstr *NewMI = commuteInstruction(MI, false); Unable to commute. if (!NewMI) return 0; if (NewMI != MI) { <——— here if the instruction is different we do not commute. // New instruction. It doesn't need to be kept. NewMI->eraseFromParent(); return 0; } Should we provide more arguments to have a finer control?
4219 ↗	(On Diff #14665)	This assert is not valid. This is legal for commuteInstruction to return nullptr. Make an early exit here. See r208371 for more details.
4228 ↗	(On Diff #14665)	Ditto.
4218 ↗	(On Diff #14650)	I am not sure this is what we want generally speaking. Indeed, if you look at the code you modified earlier (BTW having the full context would be easier for the review), we may not want to keep the commuted instruction unless the commute is in place: MachineInstr *NewMI = commuteInstruction(MI, false); // Unable to commute. if (!NewMI) return 0; if (NewMI != MI) { // <——— here if the instruction is different we do not commute. // New instruction. It doesn't need to be kept. NewMI->eraseFromParent(); return 0; } Should we provide more arguments to have a finer control?
4219 ↗	(On Diff #14650)	This assert is not valid. This is legal for commuteInstruction to return nullptr. Make an early exit here. See r208371 for more details.
4228 ↗	(On Diff #14650)	Ditto.

Hi Quentin,

I've added error handling code for commuteInstruction returning nullptr or a new MachineInstr*, both before and after the commuted folding attempt. This appears to be enough and we don't need to alter either the commute or folding call arguments to support it - we can keep to in-place instruction commutes only.

Simon.

Hi Simon,

LGTM.

As a side question, do you see any performance difference with that patch?

Thanks,
-Quentin

This revision is now accepted and ready to land.Oct 10 2014, 9:35 AM

As a side question, do you see any performance difference with that patch?

Thanks Quentin, on Jaguar I'm seeing a definite gain on (VEX encoded) SSE heavy code loops (physics, animation and vectormath), completely due to stack reload folding. On my Sandy Bridge Xeon the difference is a lot smaller / negligible. Although minor there are also improvements to code size / instruction packing.

Simon.

Closed by commit rL219584 (authored by @RKSimon).

Excuse me, I have reverted this in r219595.
It broke i686 builders.

I'll send a reproducible testcase.ll later.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86FastISel.cpp

2 lines

X86InstrInfo.h

2 lines

X86InstrInfo.cpp

117 lines

test/

CodeGen/

X86/

avx1-stack-reload-folding.ll

16 lines

Diff 14775

llvm/trunk/lib/Target/X86/X86FastISel.cpp

Show First 20 Lines • Show All 3,331 Lines • ▼ Show 20 Lines	bool X86FastISel::tryToFoldLoadIntoMI(MachineInstr *MI, unsigned OpNo,

if (Alignment == 0) // Ensure that codegen never sees alignment 0		if (Alignment == 0) // Ensure that codegen never sees alignment 0
Alignment = DL.getABITypeAlignment(LI->getType());		Alignment = DL.getABITypeAlignment(LI->getType());

SmallVector<MachineOperand, 8> AddrOps;		SmallVector<MachineOperand, 8> AddrOps;
AM.getFullAddress(AddrOps);		AM.getFullAddress(AddrOps);

MachineInstr *Result =		MachineInstr *Result =
XII.foldMemoryOperandImpl(*FuncInfo.MF, MI, OpNo, AddrOps, Size, Alignment);		XII.foldMemoryOperandImpl(FuncInfo.MF, MI, OpNo, AddrOps, Size, Alignment, /AllowCommute=*/ true);
if (!Result)		if (!Result)
return false;		return false;

Result->addMemOperand(*FuncInfo.MF, createMachineMemOperandFor(LI));		Result->addMemOperand(*FuncInfo.MF, createMachineMemOperandFor(LI));
FuncInfo.MBB->insert(FuncInfo.InsertPt, Result);		FuncInfo.MBB->insert(FuncInfo.InsertPt, Result);
MI->eraseFromParent();		MI->eraseFromParent();
return true;		return true;
}		}


namespace llvm {		namespace llvm {
FastISel *X86::createFastISel(FunctionLoweringInfo &funcInfo,		FastISel *X86::createFastISel(FunctionLoweringInfo &funcInfo,
const TargetLibraryInfo *libInfo) {		const TargetLibraryInfo *libInfo) {
return new X86FastISel(funcInfo, libInfo);		return new X86FastISel(funcInfo, libInfo);
}		}
}		}

llvm/trunk/lib/Target/X86/X86InstrInfo.h

Show First 20 Lines • Show All 398 Lines • ▼ Show 20 Lines	unsigned getUndefRegClearance(const MachineInstr *MI, unsigned &OpNum,
const TargetRegisterInfo *TRI) const override;		const TargetRegisterInfo *TRI) const override;
void breakPartialRegDependency(MachineBasicBlock::iterator MI, unsigned OpNum,		void breakPartialRegDependency(MachineBasicBlock::iterator MI, unsigned OpNum,
const TargetRegisterInfo *TRI) const override;		const TargetRegisterInfo *TRI) const override;

MachineInstr* foldMemoryOperandImpl(MachineFunction &MF,		MachineInstr* foldMemoryOperandImpl(MachineFunction &MF,
MachineInstr* MI,		MachineInstr* MI,
unsigned OpNum,		unsigned OpNum,
const SmallVectorImpl<MachineOperand> &MOs,		const SmallVectorImpl<MachineOperand> &MOs,
unsigned Size, unsigned Alignment) const;		unsigned Size, unsigned Alignment, bool AllowCommute) const;

void		void
getUnconditionalBranch(MCInst &Branch,		getUnconditionalBranch(MCInst &Branch,
const MCSymbolRefExpr *BranchTarget) const override;		const MCSymbolRefExpr *BranchTarget) const override;

void getTrap(MCInst &MI) const override;		void getTrap(MCInst &MI) const override;

bool isHighLatencyDef(int opc) const override;		bool isHighLatencyDef(int opc) const override;
▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrInfo.cpp

Show First 20 Lines • Show All 3,920 Lines • ▼ Show 20 Lines	optimizeLoadInstr(MachineInstr MI, const MachineRegisterInfo MRI,

// Check whether we can move DefMI here.		// Check whether we can move DefMI here.
DefMI = MRI->getVRegDef(FoldAsLoadDefReg);		DefMI = MRI->getVRegDef(FoldAsLoadDefReg);
assert(DefMI);		assert(DefMI);
bool SawStore = false;		bool SawStore = false;
if (!DefMI->isSafeToMove(this, nullptr, SawStore))		if (!DefMI->isSafeToMove(this, nullptr, SawStore))
return nullptr;		return nullptr;

// We try to commute MI if possible.
unsigned IdxEnd = (MI->isCommutable()) ? 2 : 1;
for (unsigned Idx = 0; Idx < IdxEnd; Idx++) {
// Collect information about virtual register operands of MI.		// Collect information about virtual register operands of MI.
unsigned SrcOperandId = 0;		unsigned SrcOperandId = 0;
bool FoundSrcOperand = false;		bool FoundSrcOperand = false;
for (unsigned i = 0, e = MI->getDesc().getNumOperands(); i != e; ++i) {		for (unsigned i = 0, e = MI->getDesc().getNumOperands(); i != e; ++i) {
MachineOperand &MO = MI->getOperand(i);		MachineOperand &MO = MI->getOperand(i);
if (!MO.isReg())		if (!MO.isReg())
continue;		continue;
unsigned Reg = MO.getReg();		unsigned Reg = MO.getReg();
if (Reg != FoldAsLoadDefReg)		if (Reg != FoldAsLoadDefReg)
continue;		continue;
// Do not fold if we have a subreg use or a def or multiple uses.		// Do not fold if we have a subreg use or a def or multiple uses.
if (MO.getSubReg() \|\| MO.isDef() \|\| FoundSrcOperand)		if (MO.getSubReg() \|\| MO.isDef() \|\| FoundSrcOperand)
return nullptr;		return nullptr;

SrcOperandId = i;		SrcOperandId = i;
FoundSrcOperand = true;		FoundSrcOperand = true;
}		}
if (!FoundSrcOperand) return nullptr;		if (!FoundSrcOperand) return nullptr;

// Check whether we can fold the def into SrcOperandId.		// Check whether we can fold the def into SrcOperandId.
SmallVector<unsigned, 8> Ops;		SmallVector<unsigned, 8> Ops;
Ops.push_back(SrcOperandId);		Ops.push_back(SrcOperandId);
MachineInstr *FoldMI = foldMemoryOperand(MI, Ops, DefMI);		MachineInstr *FoldMI = foldMemoryOperand(MI, Ops, DefMI);
if (FoldMI) {		if (FoldMI) {
FoldAsLoadDefReg = 0;		FoldAsLoadDefReg = 0;
return FoldMI;		return FoldMI;
}		}

if (Idx == 1) {
// MI was changed but it didn't help, commute it back!
commuteInstruction(MI, false);
return nullptr;
}

// Check whether we can commute MI and enable folding.
if (MI->isCommutable()) {
MachineInstr *NewMI = commuteInstruction(MI, false);
// Unable to commute.
if (!NewMI) return nullptr;
if (NewMI != MI) {
// New instruction. It doesn't need to be kept.
NewMI->eraseFromParent();
return nullptr;
}
}
}
return nullptr;		return nullptr;
}		}

/// Expand2AddrUndef - Expand a single-def pseudo instruction to a two-addr		/// Expand2AddrUndef - Expand a single-def pseudo instruction to a two-addr
/// instruction with two undef reads of the register being defined. This is		/// instruction with two undef reads of the register being defined. This is
/// used for mapping:		/// used for mapping:
/// %xmm4 = V_SET0		/// %xmm4 = V_SET0
/// to:		/// to:
▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	if (NumAddrOps < 4) // FrameIndex only
addOffset(MIB, 0);		addOffset(MIB, 0);
return MIB.addImm(0);		return MIB.addImm(0);
}		}

MachineInstr*		MachineInstr*
X86InstrInfo::foldMemoryOperandImpl(MachineFunction &MF,		X86InstrInfo::foldMemoryOperandImpl(MachineFunction &MF,
MachineInstr *MI, unsigned i,		MachineInstr *MI, unsigned i,
const SmallVectorImpl<MachineOperand> &MOs,		const SmallVectorImpl<MachineOperand> &MOs,
unsigned Size, unsigned Align) const {		unsigned Size, unsigned Align, bool AllowCommute) const {
const DenseMap<unsigned,		const DenseMap<unsigned,
std::pair<unsigned,unsigned> > *OpcodeTablePtr = nullptr;		std::pair<unsigned,unsigned> > *OpcodeTablePtr = nullptr;
bool isCallRegIndirect = Subtarget.callRegIndirect();		bool isCallRegIndirect = Subtarget.callRegIndirect();
bool isTwoAddrFold = false;		bool isTwoAddrFold = false;

// Atom favors register form of call. So, we do not fold loads into calls		// Atom favors register form of call. So, we do not fold loads into calls
// when X86Subtarget is Atom.		// when X86Subtarget is Atom.
if (isCallRegIndirect &&		if (isCallRegIndirect &&
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	if (I != OpcodeTablePtr->end()) {
X86::sub_32bit));		X86::sub_32bit));
else		else
NewMI->getOperand(0).setSubReg(X86::sub_32bit);		NewMI->getOperand(0).setSubReg(X86::sub_32bit);
}		}
return NewMI;		return NewMI;
}		}
}		}

		// If the instruction and target operand are commutable, commute the instruction and try again.
		if (AllowCommute) {
		unsigned OriginalOpIdx = i, CommuteOpIdx1, CommuteOpIdx2;
		if (findCommutedOpIndices(MI, CommuteOpIdx1, CommuteOpIdx2)) {
		if ((CommuteOpIdx1 == OriginalOpIdx) \|\| (CommuteOpIdx2 == OriginalOpIdx)) {
		MachineInstr* CommutedMI = commuteInstruction(MI, false);
		if (!CommutedMI) {
		// Unable to commute.
		return nullptr;
		}
		if (CommutedMI != MI) {
		// New instruction. We can't fold from this.
		CommutedMI->eraseFromParent();
		return nullptr;
		}

		// Attempt to fold with the commuted version of the instruction.
		unsigned CommuteOpIdx = (CommuteOpIdx1 == OriginalOpIdx ? CommuteOpIdx2 : CommuteOpIdx1);
		NewMI = foldMemoryOperandImpl(MF, MI, CommuteOpIdx, MOs, Size, Align, /AllowCommute=/ false);
		if (NewMI)
		return NewMI;

		// Folding failed again - undo the commute before returning.
		MachineInstr* UncommutedMI = commuteInstruction(MI, false);
		if (!UncommutedMI) {
		// Unable to commute.
		return nullptr;
		}
		if (UncommutedMI != MI) {
		// New instruction. It doesn't need to be kept.
		UncommutedMI->eraseFromParent();
		return nullptr;
		}

		// Return here to prevent duplicate fuse failure report.
		return nullptr;
		}
		}
		}

// No fusion		// No fusion
if (PrintFailedFusing && !MI->isCopy())		if (PrintFailedFusing && !MI->isCopy())
dbgs() << "We failed to fuse operand " << i << " in " << *MI;		dbgs() << "We failed to fuse operand " << i << " in " << *MI;
return nullptr;		return nullptr;
}		}

/// hasPartialRegUpdate - Return true for all instructions that only update		/// hasPartialRegUpdate - Return true for all instructions that only update
/// the first 32 or 64-bits of the destination register and leave the rest		/// the first 32 or 64-bits of the destination register and leave the rest
▲ Show 20 Lines • Show All 193 Lines • ▼ Show 20 Lines	if (Ops.size() == 2 && Ops[0] == 0 && Ops[1] == 1) {
// Change to CMPXXri r, 0 first.		// Change to CMPXXri r, 0 first.
MI->setDesc(get(NewOpc));		MI->setDesc(get(NewOpc));
MI->getOperand(1).ChangeToImmediate(0);		MI->getOperand(1).ChangeToImmediate(0);
} else if (Ops.size() != 1)		} else if (Ops.size() != 1)
return nullptr;		return nullptr;

SmallVector<MachineOperand,4> MOs;		SmallVector<MachineOperand,4> MOs;
MOs.push_back(MachineOperand::CreateFI(FrameIndex));		MOs.push_back(MachineOperand::CreateFI(FrameIndex));
return foldMemoryOperandImpl(MF, MI, Ops[0], MOs, Size, Alignment);		return foldMemoryOperandImpl(MF, MI, Ops[0], MOs, Size, Alignment, /AllowCommute=/ true);
}		}

static bool isPartialRegisterLoad(const MachineInstr &LoadMI,		static bool isPartialRegisterLoad(const MachineInstr &LoadMI,
const MachineFunction &MF) {		const MachineFunction &MF) {
unsigned Opc = LoadMI.getOpcode();		unsigned Opc = LoadMI.getOpcode();
unsigned RegSize =		unsigned RegSize =
MF.getRegInfo().getRegClass(LoadMI.getOperand(0).getReg())->getSize();		MF.getRegInfo().getRegClass(LoadMI.getOperand(0).getReg())->getSize();

▲ Show 20 Lines • Show All 136 Lines • ▼ Show 20 Lines	if (isPartialRegisterLoad(*LoadMI, MF))
return nullptr;		return nullptr;

// Folding a normal load. Just copy the load's address operands.		// Folding a normal load. Just copy the load's address operands.
for (unsigned i = NumOps - X86::AddrNumOperands; i != NumOps; ++i)		for (unsigned i = NumOps - X86::AddrNumOperands; i != NumOps; ++i)
MOs.push_back(LoadMI->getOperand(i));		MOs.push_back(LoadMI->getOperand(i));
break;		break;
}		}
}		}
return foldMemoryOperandImpl(MF, MI, Ops[0], MOs, 0, Alignment);		return foldMemoryOperandImpl(MF, MI, Ops[0], MOs, 0, Alignment, /AllowCommute=/ true);
}		}


bool X86InstrInfo::canFoldMemoryOperand(const MachineInstr *MI,		bool X86InstrInfo::canFoldMemoryOperand(const MachineInstr *MI,
const SmallVectorImpl<unsigned> &Ops) const {		const SmallVectorImpl<unsigned> &Ops) const {
// Check switch flag		// Check switch flag
if (NoFusing) return 0;		if (NoFusing) return 0;

▲ Show 20 Lines • Show All 1,052 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/avx1-stack-reload-folding.ll

				; RUN: llc -O3 -disable-peephole -mcpu=corei7-avx -mattr=+avx < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-unknown"

				; Function Attrs: nounwind readonly uwtable
				define <32 x double> @_Z14vstack_foldDv32_dS_(<32 x double> %a, <32 x double> %b) #0 {
				%1 = fadd <32 x double> %a, %b
				%2 = fsub <32 x double> %a, %b
				%3 = fmul <32 x double> %1, %2
				ret <32 x double> %3

				;CHECK-NOT: vmovapd {{.*#+}} 32-byte Reload
				;CHECK: vmulpd {{[0-9]}}(%rsp), {{%ymm[0-9][0-9]}}, {{%ymm[0-9][0-9]}} {{.#+}} 32-byte Folded Reload
				;CHECK-NOT: vmovapd {{.*#+}} 32-byte Reload
				}