This is an archive of the discontinued LLVM Phabricator instance.

[X86] A first stab at a heuristic to estimate the size impact for converting movs to pushes
ClosedPublic

Authored by mkuper on Feb 11 2015, 7:24 AM.

Download Raw Diff

Details

Reviewers

Commits

rGdb95d04be4e9: [X86] A heuristic to estimate the size impact for converting stack-relative…
rL228915: [X86] A heuristic to estimate the size impact for converting stack-relative…

Summary

The idea is to go over all calls in the MachineFunction and compute:
a) For each callsite that can not use pushes, the penalty of not having a reserved call frame.
b) For each callsite that can use pushes, the gain of actually replacing the movs with pushes (and the potential penalty of having to readjust the stack).

This could be made more precise (e.g. by looking at the size of the constants, or even constructing the potential instruction and asking the MC layer for the encoding size. Not to mention trying to figure out the gains from folding.) but this should be a decent first approximation.

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper updated this revision to Diff 19754.Feb 11 2015, 7:24 AM

mkuper retitled this revision from to [X86] A first stab at a heuristic to estimate the size impact for converting movs to pushes.

mkuper updated this object.

mkuper edited the test plan for this revision. (Show Details)

mkuper added a reviewer: rnk.

mkuper added a subscriber: Unknown Object (MLST).

Functions that do not take any parameters through the stack (or no parameters at all) should not count towards the heuristic.
Thanks to Roman Divacky for providing a test-case where this matters.

Same as before, but with a flow that makes a bit more sense.

lgtm

lib/Target/X86/X86CallFrameOptimization.cpp
193–195 ↗	(On Diff #19767)	Not if the calling convention is callee-pop. In fact, if the convention is callee-pop, using a reserved call frame requires a sub, which should give the 'mov' lowering a penalty. Anyway, not a blocking issue, just a heuristic worth adding.
199 ↗	(On Diff #19767)	Can we motivate the 3 byte saving heuristic a bit more?
test/CodeGen/X86/movtopush.ll
298–299 ↗	(On Diff #19767)	I'd really like to be able to convert to pushes for __thiscall methods, which effectively have one inreg parameter.

This revision is now accepted and ready to land.Feb 11 2015, 2:00 PM

Thanks, Reid!

lib/Target/X86/X86CallFrameOptimization.cpp
193–195 ↗	(On Diff #19767)	Right, will add a TODO here and get to that separately, thanks.
199 ↗	(On Diff #19767)	So, 3 is an average value that looks reasonable, although it may be a bit conservative. It depends on two things: a) What is the value being put on the stack (register,8-bit integer or >8-bit integer) b) For a mov, what is the displacement w.r.t to %esp (0, < 7-bits, > 7-bits) For pushes, the encoding size for the three options of (a) are 1/2/5 For mov (%esp), they are 3/7/7, for a difference of 2/5/2. But this can only happen once per call-site. For mov k(%esp), for k < 128, which is probably the most common case, an additional byte is encoded, and they are 4/8/8, for a difference of 3/6/3. For mov k(%esp), for k >= 128, 4 bytes are used to encode k, so we have 7/11/11. This is probably fairly rare and can be ignored. To me, looking at the numbers, 3 seems like a good bet, unless we want to special-case each of the above options. It won't be too precise (this also doesn't factor in the potential benefit of removing a mov by folding) - but I'm not trying to be extremely precise here, I just want to avoid making "obviously wrong" decisions.
test/CodeGen/X86/movtopush.ll
298–299 ↗	(On Diff #19767)	I agree. The next two things I want to do here are remove the push <fi> restrictions, and support __thiscall.

Closed by commit rL228915: [X86] A heuristic to estimate the size impact for converting stack-relative… (authored by mkuper). · Explain WhyFeb 12 2015, 12:38 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86CallFrameOptimization.cpp

96 lines

test/

CodeGen/

X86/

movtopush.ll

88 lines

Diff 19809

llvm/trunk/lib/Target/X86/X86CallFrameOptimization.cpp

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
namespace {		namespace {
class X86CallFrameOptimization : public MachineFunctionPass {		class X86CallFrameOptimization : public MachineFunctionPass {
public:		public:
X86CallFrameOptimization() : MachineFunctionPass(ID) {}		X86CallFrameOptimization() : MachineFunctionPass(ID) {}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

private:		private:
bool shouldPerformTransformation(MachineFunction &MF);

// Information we know about a particular call site		// Information we know about a particular call site
struct CallContext {		struct CallContext {
CallContext()		CallContext()
: Call(nullptr), SPCopy(nullptr), ExpectedDist(0),		: Call(nullptr), SPCopy(nullptr), ExpectedDist(0),
MovVector(4, nullptr), UsePush(false){};		MovVector(4, nullptr), NoStackParams(false), UsePush(false){};

// Actuall call instruction		// Actuall call instruction
MachineInstr *Call;		MachineInstr *Call;

// A copy of the stack pointer		// A copy of the stack pointer
MachineInstr *SPCopy;		MachineInstr *SPCopy;

// The total displacement of all passed parameters		// The total displacement of all passed parameters
int64_t ExpectedDist;		int64_t ExpectedDist;

// The sequence of movs used to pass the parameters		// The sequence of movs used to pass the parameters
SmallVector<MachineInstr *, 4> MovVector;		SmallVector<MachineInstr *, 4> MovVector;

// Whether this site should use push instructions		// True if this call site has no stack parameters
		bool NoStackParams;

		// True of this callsite can use push instructions
bool UsePush;		bool UsePush;
};		};

		typedef DenseMap<MachineInstr *, CallContext> ContextMap;

		bool isLegal(MachineFunction &MF);

		bool isProfitable(MachineFunction &MF, ContextMap &CallSeqMap);

void collectCallInfo(MachineFunction &MF, MachineBasicBlock &MBB,		void collectCallInfo(MachineFunction &MF, MachineBasicBlock &MBB,
MachineBasicBlock::iterator I, CallContext &Context);		MachineBasicBlock::iterator I, CallContext &Context);

bool adjustCallSequence(MachineFunction &MF, MachineBasicBlock::iterator I,		bool adjustCallSequence(MachineFunction &MF, MachineBasicBlock::iterator I,
const CallContext &Context);		const CallContext &Context);

MachineInstr *canFoldIntoRegPush(MachineBasicBlock::iterator FrameSetup,		MachineInstr *canFoldIntoRegPush(MachineBasicBlock::iterator FrameSetup,
unsigned Reg);		unsigned Reg);

const char *getPassName() const override { return "X86 Optimize Call Frame"; }		const char *getPassName() const override { return "X86 Optimize Call Frame"; }

const TargetInstrInfo *TII;		const TargetInstrInfo *TII;
const TargetFrameLowering *TFL;		const TargetFrameLowering *TFL;
const MachineRegisterInfo *MRI;		const MachineRegisterInfo *MRI;
static char ID;		static char ID;
};		};

char X86CallFrameOptimization::ID = 0;		char X86CallFrameOptimization::ID = 0;
}		}

FunctionPass *llvm::createX86CallFrameOptimization() {		FunctionPass *llvm::createX86CallFrameOptimization() {
return new X86CallFrameOptimization();		return new X86CallFrameOptimization();
}		}

// This checks whether the transformation is legal and profitable		// This checks whether the transformation is legal.
bool X86CallFrameOptimization::shouldPerformTransformation(		// Also returns false in cases where it's potentially legal, but
MachineFunction &MF) {		// we don't even want to try.
		bool X86CallFrameOptimization::isLegal(MachineFunction &MF) {
if (NoX86CFOpt.getValue())		if (NoX86CFOpt.getValue())
return false;		return false;

// We currently only support call sequences where all parameters.		// We currently only support call sequences where all parameters.
// are passed on the stack.		// are passed on the stack.
// No point in running this in 64-bit mode, since some arguments are		// No point in running this in 64-bit mode, since some arguments are
// passed in-register in all common calling conventions, so the pattern		// passed in-register in all common calling conventions, so the pattern
// we're looking for will never match.		// we're looking for will never match.
Show All 23 Lines	for (MachineInstr &MI : BB) {
InsideFrameSequence = false;		InsideFrameSequence = false;
}		}
}		}

if (InsideFrameSequence)		if (InsideFrameSequence)
return false;		return false;
}		}

// Now that we know the transformation is legal, check if it is		return true;
// profitable.		}
// TODO: Add a heuristic that actually looks at the function,
// and enable this for more cases.

// This transformation is always a win when we expected to have		// Check whether this trasnformation is profitable for a particular
		// function - in terms of code size.
		bool X86CallFrameOptimization::isProfitable(MachineFunction &MF,
		ContextMap &CallSeqMap) {
		// This transformation is always a win when we do not expect to have
// a reserved call frame. Under other circumstances, it may be either		// a reserved call frame. Under other circumstances, it may be either
// a win or a loss, and requires a heuristic.		// a win or a loss, and requires a heuristic.
// For now, enable it only for the relatively clear win cases.
bool CannotReserveFrame = MF.getFrameInfo()->hasVarSizedObjects();		bool CannotReserveFrame = MF.getFrameInfo()->hasVarSizedObjects();
if (CannotReserveFrame)		if (CannotReserveFrame)
return true;		return true;

// For now, don't even try to evaluate the profitability when		// Don't do this when not optimizing for size.
// not optimizing for size.
AttributeSet FnAttrs = MF.getFunction()->getAttributes();		AttributeSet FnAttrs = MF.getFunction()->getAttributes();
bool OptForSize =		bool OptForSize =
FnAttrs.hasAttribute(AttributeSet::FunctionIndex,		FnAttrs.hasAttribute(AttributeSet::FunctionIndex,
Attribute::OptimizeForSize) \|\|		Attribute::OptimizeForSize) \|\|
FnAttrs.hasAttribute(AttributeSet::FunctionIndex, Attribute::MinSize);		FnAttrs.hasAttribute(AttributeSet::FunctionIndex, Attribute::MinSize);

if (!OptForSize)		if (!OptForSize)
return false;		return false;

// Stack re-alignment can make this unprofitable even in terms of size.
// As mentioned above, a better heuristic is needed. For now, don't do this
// when the required alignment is above 8. (4 would be the safe choice, but
// some experimentation showed 8 is generally good).
if (TFL->getStackAlignment() > 8)
return false;

return true;		unsigned StackAlign = TFL->getStackAlignment();

		int64_t Advantage = 0;
		for (auto CC : CallSeqMap) {
		// Call sites where no parameters are passed on the stack
		// do not affect the cost, since there needs to be no
		// stack adjustment.
		if (CC.second.NoStackParams)
		continue;

		if (!CC.second.UsePush) {
		// If we don't use pushes for a particular call site,
		// we pay for not having a reserved call frame with an
		// additional sub/add esp pair. The cost is ~3 bytes per instruction,
		// depending on the size of the constant.
		// TODO: Callee-pop functions should have a smaller penalty, because
		// an add is needed even with a reserved call frame.
		Advantage -= 6;
		} else {
		// We can use pushes. First, account for the fixed costs.
		// We'll need a add after the call.
		Advantage -= 3;
		// If we have to realign the stack, we'll also need and sub before
		if (CC.second.ExpectedDist % StackAlign)
		Advantage -= 3;
		// Now, for each push, we save ~3 bytes. For small constants, we actually,
		// save more (up to 5 bytes), but 3 should be a good approximation.
		Advantage += (CC.second.ExpectedDist / 4) * 3;
		}
}		}

		return (Advantage >= 0);
		}


bool X86CallFrameOptimization::runOnMachineFunction(MachineFunction &MF) {		bool X86CallFrameOptimization::runOnMachineFunction(MachineFunction &MF) {
TII = MF.getSubtarget().getInstrInfo();		TII = MF.getSubtarget().getInstrInfo();
TFL = MF.getSubtarget().getFrameLowering();		TFL = MF.getSubtarget().getFrameLowering();
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();

if (!shouldPerformTransformation(MF))		if (!isLegal(MF))
return false;		return false;

int FrameSetupOpcode = TII->getCallFrameSetupOpcode();		int FrameSetupOpcode = TII->getCallFrameSetupOpcode();

bool Changed = false;		bool Changed = false;

DenseMap<MachineInstr *, CallContext> CallSeqMap;		ContextMap CallSeqMap;

for (MachineFunction::iterator BB = MF.begin(), E = MF.end(); BB != E; ++BB)		for (MachineFunction::iterator BB = MF.begin(), E = MF.end(); BB != E; ++BB)
for (MachineBasicBlock::iterator I = BB->begin(); I != BB->end(); ++I)		for (MachineBasicBlock::iterator I = BB->begin(); I != BB->end(); ++I)
if (I->getOpcode() == FrameSetupOpcode) {		if (I->getOpcode() == FrameSetupOpcode) {
CallContext &Context = CallSeqMap[I];		CallContext &Context = CallSeqMap[I];
collectCallInfo(MF, *BB, I, Context);		collectCallInfo(MF, *BB, I, Context);
}		}

		if (!isProfitable(MF, CallSeqMap))
		return false;

for (auto CC : CallSeqMap)		for (auto CC : CallSeqMap)
if (CC.second.UsePush)		if (CC.second.UsePush)
Changed \|= adjustCallSequence(MF, CC.first, CC.second);		Changed \|= adjustCallSequence(MF, CC.first, CC.second);

return Changed;		return Changed;
}		}

void X86CallFrameOptimization::collectCallInfo(MachineFunction &MF,		void X86CallFrameOptimization::collectCallInfo(MachineFunction &MF,
MachineBasicBlock &MBB,		MachineBasicBlock &MBB,
MachineBasicBlock::iterator I,		MachineBasicBlock::iterator I,
CallContext &Context) {		CallContext &Context) {
// Check that this particular call sequence is amenable to the		// Check that this particular call sequence is amenable to the
// transformation.		// transformation.
const X86RegisterInfo &RegInfo = static_cast<const X86RegisterInfo >(		const X86RegisterInfo &RegInfo = static_cast<const X86RegisterInfo >(
MF.getSubtarget().getRegisterInfo());		MF.getSubtarget().getRegisterInfo());
unsigned StackPtr = RegInfo.getStackRegister();		unsigned StackPtr = RegInfo.getStackRegister();
int FrameDestroyOpcode = TII->getCallFrameDestroyOpcode();		int FrameDestroyOpcode = TII->getCallFrameDestroyOpcode();

// We expect to enter this at the beginning of a call sequence		// We expect to enter this at the beginning of a call sequence
assert(I->getOpcode() == TII->getCallFrameSetupOpcode());		assert(I->getOpcode() == TII->getCallFrameSetupOpcode());
MachineBasicBlock::iterator FrameSetup = I++;		MachineBasicBlock::iterator FrameSetup = I++;

		// How much do we adjust the stack? This puts an upper bound on
		// the number of parameters actually passed on it.
		unsigned int MaxAdjust = FrameSetup->getOperand(0).getImm() / 4;

		// A zero adjustment means no stack parameters
		if (!MaxAdjust) {
		Context.NoStackParams = true;
		return;
		}

// For globals in PIC mode, we can have some LEAs here.		// For globals in PIC mode, we can have some LEAs here.
// Ignore them, they don't bother us.		// Ignore them, they don't bother us.
// TODO: Extend this to something that covers more cases.		// TODO: Extend this to something that covers more cases.
while (I->getOpcode() == X86::LEA32r)		while (I->getOpcode() == X86::LEA32r)
++I;		++I;

// We expect a copy instruction here.		// We expect a copy instruction here.
// TODO: The copy instruction is a lowering artifact.		// TODO: The copy instruction is a lowering artifact.
// We should also support a copy-less version, where the stack		// We should also support a copy-less version, where the stack
// pointer is used directly.		// pointer is used directly.
if (!I->isCopy() \|\| !I->getOperand(0).isReg())		if (!I->isCopy() \|\| !I->getOperand(0).isReg())
return;		return;
Context.SPCopy = I++;		Context.SPCopy = I++;
StackPtr = Context.SPCopy->getOperand(0).getReg();		StackPtr = Context.SPCopy->getOperand(0).getReg();

// Scan the call setup sequence for the pattern we're looking for.		// Scan the call setup sequence for the pattern we're looking for.
// We only handle a simple case - a sequence of MOV32mi or MOV32mr		// We only handle a simple case - a sequence of MOV32mi or MOV32mr
// instructions, that push a sequence of 32-bit values onto the stack, with		// instructions, that push a sequence of 32-bit values onto the stack, with
// no gaps between them.		// no gaps between them.
unsigned int MaxAdjust = FrameSetup->getOperand(0).getImm() / 4;
if (MaxAdjust > 4)		if (MaxAdjust > 4)
Context.MovVector.resize(MaxAdjust, nullptr);		Context.MovVector.resize(MaxAdjust, nullptr);

do {		do {
int Opcode = I->getOpcode();		int Opcode = I->getOpcode();
if (Opcode != X86::MOV32mi && Opcode != X86::MOV32mr)		if (Opcode != X86::MOV32mi && Opcode != X86::MOV32mr)
break;		break;

▲ Show 20 Lines • Show All 195 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/movtopush.ll

	; RUN: llc < %s -mtriple=i686-windows \| FileCheck %s -check-prefix=NORMAL			; RUN: llc < %s -mtriple=i686-windows \| FileCheck %s -check-prefix=NORMAL
	; RUN: llc < %s -mtriple=x86_64-windows \| FileCheck %s -check-prefix=X64			; RUN: llc < %s -mtriple=x86_64-windows \| FileCheck %s -check-prefix=X64
	; RUN: llc < %s -mtriple=i686-windows -force-align-stack -stack-alignment=32 \| FileCheck %s -check-prefix=ALIGNED			; RUN: llc < %s -mtriple=i686-windows -force-align-stack -stack-alignment=32 \| FileCheck %s -check-prefix=ALIGNED

	declare void @good(i32 %a, i32 %b, i32 %c, i32 %d)			declare void @good(i32 %a, i32 %b, i32 %c, i32 %d)
	declare void @inreg(i32 %a, i32 inreg %b, i32 %c, i32 %d)			declare void @inreg(i32 %a, i32 inreg %b, i32 %c, i32 %d)
				declare void @oneparam(i32 %a)
				declare void @eightparams(i32 %a, i32 %b, i32 %c, i32 %d, i32 %e, i32 %f, i32 %g, i32 %h)


	; Here, we should have a reserved frame, so we don't expect pushes			; Here, we should have a reserved frame, so we don't expect pushes
	; NORMAL-LABEL: test1:			; NORMAL-LABEL: test1:
	; NORMAL: subl $16, %esp			; NORMAL: subl $16, %esp
	; NORMAL-NEXT: movl $4, 12(%esp)			; NORMAL-NEXT: movl $4, 12(%esp)
	; NORMAL-NEXT: movl $3, 8(%esp)			; NORMAL-NEXT: movl $3, 8(%esp)
	; NORMAL-NEXT: movl $2, 4(%esp)			; NORMAL-NEXT: movl $2, 4(%esp)
	; NORMAL-NEXT: movl $1, (%esp)			; NORMAL-NEXT: movl $1, (%esp)
	▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
	; ALIGNED-NEXT: call			; ALIGNED-NEXT: call
	define void @test5(i32 %k) {			define void @test5(i32 %k) {
	entry:			entry:
	%a = alloca i32, i32 %k			%a = alloca i32, i32 %k
	call void @good(i32 1, i32 2, i32 3, i32 4)			call void @good(i32 1, i32 2, i32 3, i32 4)
	ret void			ret void
	}			}

				; When the alignment adds up, do the transformation
				; ALIGNED-LABEL: test5b:
				; ALIGNED: pushl $8
				; ALIGNED-NEXT: pushl $7
				; ALIGNED-NEXT: pushl $6
				; ALIGNED-NEXT: pushl $5
				; ALIGNED-NEXT: pushl $4
				; ALIGNED-NEXT: pushl $3
				; ALIGNED-NEXT: pushl $2
				; ALIGNED-NEXT: pushl $1
				; ALIGNED-NEXT: call
				define void @test5b() optsize {
				entry:
				call void @eightparams(i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8)
				ret void
				}

				; When having to compensate for the alignment isn't worth it,
				; don't use pushes.
				; ALIGNED-LABEL: test5c:
				; ALIGNED: movl $1, (%esp)
				; ALIGNED-NEXT: call
				define void @test5c() optsize {
				entry:
				call void @oneparam(i32 1)
				ret void
				}

	; Check that pushing the addresses of globals (Or generally, things that			; Check that pushing the addresses of globals (Or generally, things that
	; aren't exactly immediates) isn't broken.			; aren't exactly immediates) isn't broken.
	; Fixes PR21878.			; Fixes PR21878.
	; NORMAL-LABEL: test6:			; NORMAL-LABEL: test6:
	; NORMAL: pushl $_ext			; NORMAL: pushl $_ext
	; NORMAL-NEXT: call			; NORMAL-NEXT: call
	declare void @f(i8*)			declare void @f(i8*)
	@ext = external constant i8			@ext = external constant i8
	▲ Show 20 Lines • Show All 105 Lines • ▼ Show 20 Lines
	; NORMAL-NEXT: addl $16, %esp			; NORMAL-NEXT: addl $16, %esp
	@the_global = external global i32			@the_global = external global i32
	define void @test11() optsize {			define void @test11() optsize {
	%myload = load i32* @the_global			%myload = load i32* @the_global
	store i32 42, i32* @the_global			store i32 42, i32* @the_global
	call void @good(i32 %myload, i32 2, i32 3, i32 4)			call void @good(i32 %myload, i32 2, i32 3, i32 4)
	ret void			ret void
	}			}

				; Converting one mov into a push isn't worth it when
				; doing so forces too much overhead for other calls.
				; NORMAL-LABEL: test12:
				; NORMAL: subl $16, %esp
				; NORMAL-NEXT: movl $4, 8(%esp)
				; NORMAL-NEXT: movl $3, 4(%esp)
				; NORMAL-NEXT: movl $1, (%esp)
				; NORMAL-NEXT: movl $2, %eax
				; NORMAL-NEXT: calll _inreg
				; NORMAL-NEXT: movl $8, 12(%esp)
				; NORMAL-NEXT: movl $7, 8(%esp)
				; NORMAL-NEXT: movl $6, 4(%esp)
				; NORMAL-NEXT: movl $5, (%esp)
				; NORMAL-NEXT: calll _good
				; NORMAL-NEXT: movl $12, 8(%esp)
				; NORMAL-NEXT: movl $11, 4(%esp)
				; NORMAL-NEXT: movl $9, (%esp)
				; NORMAL-NEXT: movl $10, %eax
				; NORMAL-NEXT: calll _inreg
				; NORMAL-NEXT: addl $16, %esp
				define void @test12() optsize {
				entry:
				call void @inreg(i32 1, i32 2, i32 3, i32 4)
				call void @good(i32 5, i32 6, i32 7, i32 8)
				call void @inreg(i32 9, i32 10, i32 11, i32 12)
				ret void
				}

				; But if the gains outweigh the overhead, we should do it
				; NORMAL-LABEL: test12b:
				; NORMAL: pushl $4
				; NORMAL-NEXT: pushl $3
				; NORMAL-NEXT: pushl $2
				; NORMAL-NEXT: pushl $1
				; NORMAL-NEXT: calll _good
				; NORMAL-NEXT: addl $16, %esp
				; NORMAL-NEXT: subl $12, %esp
				; NORMAL-NEXT: movl $8, 8(%esp)
				; NORMAL-NEXT: movl $7, 4(%esp)
				; NORMAL-NEXT: movl $5, (%esp)
				; NORMAL-NEXT: movl $6, %eax
				; NORMAL-NEXT: calll _inreg
				; NORMAL-NEXT: addl $12, %esp
				; NORMAL-NEXT: pushl $12
				; NORMAL-NEXT: pushl $11
				; NORMAL-NEXT: pushl $10
				; NORMAL-NEXT: pushl $9
				; NORMAL-NEXT: calll _good
				; NORMAL-NEXT: addl $16, %esp
				define void @test12b() optsize {
				entry:
				call void @good(i32 1, i32 2, i32 3, i32 4)
				call void @inreg(i32 5, i32 6, i32 7, i32 8)
				call void @good(i32 9, i32 10, i32 11, i32 12)
				ret void
				}