This is an archive of the discontinued LLVM Phabricator instance.

Differential D20003

X86CallFrameOpt: a first step towards optimizing inalloca calls (PR27076)
AbandonedPublic

Authored by hans on May 5 2016, 4:30 PM.

Download Raw Diff

Details

Reviewers

mkuper
DavidKreitzer
rnk

Summary

Using pushes to move arguments into the stack results in significantly smaller code. We can also remove the _chkstk call, as the pushes probe the stack naturally.

This patch only covers basic cases where there are no complicated instructions in the call sequence. Inalloca calls often have e.g. nested calls or control flow in the call sequence, so in practice this patch doesn't fire a lot, but it's a start.

Please take a look.

Diff Detail

Event Timeline

hans updated this revision to Diff 56368.May 5 2016, 4:30 PM

hans retitled this revision from to X86CallFrameOpt: a first step towards optimizing inalloca calls (PR27076).

hans updated this object.

hans added reviewers: rnk, mkuper, DavidKreitzer.

hans added a subscriber: llvm-commits.

Hi Hans,
I have a few small comments, none of them very important.

Regarding the general direction - are you sure this can be extended to the more complicated cases in a sane way?
The reason I went in this direction for the regular mov -> push transformation was that the amount of different patterns we actually generate for call sequences is pretty limited. So it seemed a dumb linear instruction search will be able to cover most cases. There were some exceptions - I think one was a pseudo instruction expansion inside the call sequence that created control flow, which meant the FrameSetup and FrameDestroy ended up in different basic blocks. But this was very rare.
I'm not really familiar with the IR we generate for inalloca, but the nested stuff seems scary.

lib/Target/X86/X86CallFrameOptimization.cpp
363	Is this often or always? It seems like we're only going to match the vreg case, right?
384	We use _alloca() for cygwin/mingw. Handling only _chkstk sounds reasonable, but I'm not sure I'm a fan of having this string hardcoded here. Perhaps factor the code that decides on the symbol name out of X86FrameLowering::emitStackProbeCall() and call that here as well?
458	As long as you're touching this - the comment is no longer correct. Could you please delete it?
658	incalloca -> inalloca
674	Is it possible for ChkstkResultCopy and AmountMov to have more uses? I don't see why that would happen, but I don't see anything completely preventing it either.
test/CodeGen/X86/inalloca.ll
1	Are you sure you want to have these tests run only with -no-x86-call-frame-opt? This way we don't actually cover the "normal" case of x86-call-frame-opt + thiscall, etc.

Hi Hans,

I think X86CallFrameOptimization is the wrong place to be trying to eliminate the _chkstk calls for inalloca. The ability to do the store-to-push optimization has no bearing on whether the chkstk call is needed. Your comment about the pushes naturally probing the stack is true, but it is also true of the original stores.

I think David had the right idea in r262370. But I assume the problem he ran into is that clang is nesting these inalloca calls, so it isn't easy to tell how much stack space is ultimately going to be allocated. Consider a case like this:

struct S
{
    S(const S&);
    char a[3000];
};

void f1(S, int);
int f2(S);
void f3(S *s)
{
  f1(*s, f2(*s));
}

We get this from clang:

define void @"\01?f3@@YAXPAUS@@@Z"(%struct.S* %s) #0 {
entry:
  %argmem4 = alloca inalloca <{ %struct.S, i32 }>, align 4
  %inalloca.save1 = tail call i8* @llvm.stacksave()
  %argmem = alloca inalloca <{ %struct.S }>, align 4
  %0 = getelementptr inbounds <{ %struct.S }>, <{ %struct.S }>* %argmem, i32 0, 
i32 0
  %call = call x86_thiscallcc %struct.S* @"\01??0S@@QAE@ABU0@@Z"(%struct.S* %0, 
%struct.S* dereferenceable(3000) %s)
  %call2 = call i32 @"\01?f2@@YAHUS@@@Z"(<{ %struct.S }>* inalloca nonnull %argm
em)
  call void @llvm.stackrestore(i8* %inalloca.save1)
  %1 = getelementptr inbounds <{ %struct.S, i32 }>, <{ %struct.S, i32 }>* %argme
m4, i32 0, i32 0
  %call3 = call x86_thiscallcc %struct.S* @"\01??0S@@QAE@ABU0@@Z"(%struct.S* %1,
 %struct.S* dereferenceable(3000) %s)
  %2 = getelementptr inbounds <{ %struct.S, i32 }>, <{ %struct.S, i32 }>* %argme
m4, i32 0, i32 1
  store i32 %call2, i32* %2, align 4, !tbaa !1
  call void @"\01?f1@@YAXUS@@H@Z"(<{ %struct.S, i32 }>* inalloca nonnull %argmem
4)
  ret void
}

See how there are two inalloca calls at the top? They effectively grow the stack by 6000 bytes (which is big enough to require _chkstk even though the separate 3000-byte allocations are not). Neither MSVC nor ICC do it like this. They both allocate space for the call to f2, call f2, and cleanup the stack from calling f2 before allocating space for the call to f1. I think we need to fix clang to do something similar and then David's solution ought to work.

For the example in pr27076, another thing clang could do is just avoid inalloca altogether. We shouldn't need it for passing objects that use the default copy constructor.

Thank you both for the review comments!

In D20003#423533, @DavidKreitzer wrote:

I think X86CallFrameOptimization is the wrong place to be trying to eliminate the _chkstk calls for inalloca. The ability to do the store-to-push optimization has no bearing on whether the chkstk call is needed. Your comment about the pushes naturally probing the stack is true, but it is also true of the original stores.

I think David had the right idea in r262370. But I assume the problem he ran into is that clang is nesting these inalloca calls, so it isn't easy to tell how much stack space is ultimately going to be allocated. Consider a case like this:

The connection I saw between removing the _chkstk calls and this transformation is that the pushes will touch the stack in a safe order (starting at %esp, which should be safe, and progressing downwards), whereas the original stores might be in an order that starts by touching an address beyond the allocated stack.

What worries me about the approach in r262370 is that it doesn't take other stack objects into account (and I think that's why it failed). What if our function has a 3 KB fixed array which hasn't been touched yet, can we then remove a 2 KB _chkstk? And IIUC (but I could be wrong here), we can't rely on checking the size of the stack frame at that stage, because it could increase due to register spills.

I figured tying this to the push-conversion would resolve these concerns in a neat way.

I agree with your point, and Michael's, that it's not clear yet if we could make this work for more complicated calls in a reasonable way. I figured this would be a good patch to start with, but maybe I need to experiment a bit further to see if it will work out.

The connection I saw between removing the _chkstk calls and this transformation is that the pushes will touch the stack in a safe order (starting at %esp, which should be safe, and progressing downwards), whereas the original stores might be in an order that starts by touching an address beyond the allocated stack.

That's a fair point. You do have to worry about the case where the pushes get prefaced by a "sub esp" which is typically done to pad the outgoing argument block for alignment. I do not know how common that is on IA-32 Windows, but it is a potential concern.

What worries me about the approach in r262370 is that it doesn't take other stack objects into account (and I think that's why it failed). What if our function has a 3 KB fixed array which hasn't been touched yet, can we then remove a 2 KB _chkstk? And IIUC (but I could be wrong here), we can't rely on checking the size of the stack frame at that stage, because it could increase due to register spills.

The situation you describe is no different than for a call to a non-inalloca function. That is, if the local frame has a 3k object, and the outgoing parameter block size is 2k, we need to call _chkstk. And it is the frame finalization pass that figures this out. IMO, inalloca functions should be handled in the same way. (I admit that isn't quite the same as the approach taken in r262370.) In addition to eliminating the unnecessary _chkstk calls, a fix in frame finalization will let us avoid forcing a frame pointer in routines with inalloca calls. At any rate, I think we need a comprehensive fix. This patch seems like a small band-aid for a gaping wound.

I think special casing inalloca allocas in X86TargetLowering::LowerDYNAMIC_STACKALLOC will be a better start for this.

In D20003#423533, @DavidKreitzer wrote:

I think X86CallFrameOptimization is the wrong place to be trying to eliminate the _chkstk calls for inalloca. The ability to do the store-to-push optimization has no bearing on whether the chkstk call is needed. Your comment about the pushes naturally probing the stack is true, but it is also true of the original stores.

I agree, we should avoid the chkstk at a higher level. Eventually, though, we wanted to do general conversion of inalloca to a sequence of subs, leas, and pushes. I figured that would live here.

I think David had the right idea in r262370. But I assume the problem he ran into is that clang is nesting these inalloca calls, so it isn't easy to tell how much stack space is ultimately going to be allocated. Consider a case like this:
... snip
See how there are two inalloca calls at the top? They effectively grow the stack by 6000 bytes (which is big enough to require _chkstk even though the separate 3000-byte allocations are not). Neither MSVC nor ICC do it like this. They both allocate space for the call to f2, call f2, and cleanup the stack from calling f2 before allocating space for the call to f1. I think we need to fix clang to do something similar and then David's solution ought to work.

We don't want to try to follow ICC or MSVC here. We want to do the copy elision instead, since it's actually required in C++17. IMO we should detect these large argument allocation cases and probe the stack.

For the example in pr27076, another thing clang could do is just avoid inalloca altogether. We shouldn't need it for passing objects that use the default copy constructor.

inalloca is used whenever a type is not trivially copyable, meaning it has a non-trivial copy constructor or destructor. My understanding is that we aren't allowed to introduce extra copies of non-trivially copyable objects, so we have to use inalloca here, at least in C++17.

lib/Target/X86/X86CallFrameOptimization.cpp
56	This makes me feel like we should model argument allocation as a single operation instead of five or so. What do you think about changing X86TargetLowering::LowerDYNAMIC_STACKALLOC to emit a new DAG node which selects to a single MI?
349	Yeah, this feels like too much pattern matching.

This revision now requires changes to proceed.May 9 2016, 12:03 PM

Thanks for the input everyone!

I'll start looking into introducing a pseudo-instruction for the dynamic alloca so we can lower it at a point where we have all the information needed to elide the _chkstk call.

lib/Target/X86/X86CallFrameOptimization.cpp
56	X86TargetLowering::LowerDYNAMIC_STACKALLOC already emits a X86ISD::WIN_ALLOCA, but instead of expanding that in X86TargetLowering::EmitLoweredWinAlloca, we could have a pseudo-instruction that gets expanded later -- when we know the size of the stack frame and can elide it.

hans mentioned this in D20263: X86: Avoid using _chkstk when lowering WIN_ALLOCA instructions.May 16 2016, 4:39 PM

Revision Contents

Path

Size

lib/

Target/

X86/

X86CallFrameOptimization.cpp

189 lines

test/

CodeGen/

X86/

inalloca-callframeopt.ll

96 lines

inalloca-stdcall.ll

6 lines

inalloca.ll

2 lines

Diff 56368

lib/Target/X86/X86CallFrameOptimization.cpp

Show All 14 Lines
// the transformation is performed pre-reg-alloc, it can help relieve		// the transformation is performed pre-reg-alloc, it can help relieve
// register pressure.		// register pressure.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include <algorithm>		#include <algorithm>

#include "X86.h"		#include "X86.h"
		#include "X86InstrBuilder.h"
#include "X86InstrInfo.h"		#include "X86InstrInfo.h"
#include "X86MachineFunctionInfo.h"		#include "X86MachineFunctionInfo.h"
#include "X86Subtarget.h"		#include "X86Subtarget.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineModuleInfo.h"		#include "llvm/CodeGen/MachineModuleInfo.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
Show All 15 Lines
namespace {		namespace {
class X86CallFrameOptimization : public MachineFunctionPass {		class X86CallFrameOptimization : public MachineFunctionPass {
public:		public:
X86CallFrameOptimization() : MachineFunctionPass(ID) {}		X86CallFrameOptimization() : MachineFunctionPass(ID) {}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

private:		private:
// Information we know about a particular call site		// Information about the setup for an inalloca call.
		struct InAllocaInfo {
		rnkUnsubmitted Not Done Reply Inline Actions This makes me feel like we should model argument allocation as a single operation instead of five or so. What do you think about changing X86TargetLowering::LowerDYNAMIC_STACKALLOC to emit a new DAG node which selects to a single MI? rnk: This makes me feel like we should model argument allocation as a single operation instead of…
		hansAuthorUnsubmitted Not Done Reply Inline Actions X86TargetLowering::LowerDYNAMIC_STACKALLOC already emits a X86ISD::WIN_ALLOCA, but instead of expanding that in X86TargetLowering::EmitLoweredWinAlloca, we could have a pseudo-instruction that gets expanded later -- when we know the size of the stack frame and can elide it. hans: X86TargetLowering::LowerDYNAMIC_STACKALLOC already emits a X86ISD::WIN_ALLOCA, but instead of…
		// Frame setup for the _chkstk call.
		MachineBasicBlock::iterator FrameSetup;

		// Move of _chkstk amount into virtual register.
		MachineBasicBlock::iterator AmountInstr;

		// Move of virtual register with _chkstk amount into %eax.
		MachineBasicBlock::iterator AmountMov;

		// Call to _chkstk.
		MachineBasicBlock::iterator ChkstkCall;

		// Copy of _chkstk result into virtual register.
		MachineBasicBlock::iterator ChkstkResultCopy;

		// Frame destroy for the _chkstk call.
		MachineBasicBlock::iterator FrameDestroy;
		};

		// Information we know about a particular call site.
struct CallContext {		struct CallContext {
CallContext()		CallContext()
: FrameSetup(nullptr), Call(nullptr), SPCopy(nullptr), ExpectedDist(0),		: FrameSetup(nullptr), IsInAlloca(false), Call(nullptr), SPCopy(nullptr),
MovVector(4, nullptr), NoStackParams(false), UsePush(false) {}		ExpectedDist(0), MovVector(4, nullptr), NoStackParams(false),
		UsePush(false) {}

// Iterator referring to the frame setup instruction		// Iterator referring to the frame setup instruction.
MachineBasicBlock::iterator FrameSetup;		MachineBasicBlock::iterator FrameSetup;

// Actual call instruction		// Whether this is an inalloca call.
		bool IsInAlloca;

		// If this is an inalloca call, information about the setup for that.
		InAllocaInfo InAllocaSetup;

		// Actual call instruction.
MachineInstr *Call;		MachineInstr *Call;

// A copy of the stack pointer		// A copy of the stack pointer.
MachineInstr *SPCopy;		MachineInstr *SPCopy;

// The total displacement of all passed parameters		// The total displacement of all passed parameters.
int64_t ExpectedDist;		int64_t ExpectedDist;

// The sequence of movs used to pass the parameters		// The sequence of movs used to pass the parameters.
SmallVector<MachineInstr *, 4> MovVector;		SmallVector<MachineInstr *, 4> MovVector;

// True if this call site has no stack parameters		// True if this call site has no stack parameters.
bool NoStackParams;		bool NoStackParams;

// True if this call site can use push instructions		// True if this call site can use push instructions.
bool UsePush;		bool UsePush;
};		};

typedef SmallVector<CallContext, 8> ContextVector;		typedef SmallVector<CallContext, 8> ContextVector;

bool isLegal(MachineFunction &MF);		bool isLegal(MachineFunction &MF);

bool isProfitable(MachineFunction &MF, ContextVector &CallSeqMap);		bool isProfitable(MachineFunction &MF, ContextVector &CallSeqMap);

		bool matchInAlloca(MachineBasicBlock::iterator &I, InAllocaInfo &Info,
		unsigned int &MaxAdjust);

void collectCallInfo(MachineFunction &MF, MachineBasicBlock &MBB,		void collectCallInfo(MachineFunction &MF, MachineBasicBlock &MBB,
MachineBasicBlock::iterator I, CallContext &Context);		MachineBasicBlock::iterator I, CallContext &Context);

void adjustCallSequence(MachineFunction &MF, const CallContext &Context);		void adjustCallSequence(MachineFunction &MF, const CallContext &Context);

MachineInstr *canFoldIntoRegPush(MachineBasicBlock::iterator FrameSetup,		MachineInstr *canFoldIntoRegPush(MachineBasicBlock::iterator FrameSetup,
unsigned Reg);		unsigned Reg);

▲ Show 20 Lines • Show All 213 Lines • ▼ Show 20 Lines	if (MO.isDef()) {
if (RegInfo.regsOverlap(Reg, U))		if (RegInfo.regsOverlap(Reg, U))
return Exit;		return Exit;
}		}
}		}

return Skip;		return Skip;
}		}

		bool X86CallFrameOptimization::matchInAlloca(MachineBasicBlock::iterator &I,
		rnkUnsubmitted Not Done Reply Inline Actions Yeah, this feels like too much pattern matching. rnk: Yeah, this feels like too much pattern matching.
		InAllocaInfo &Info,
		unsigned int &MaxAdjust) {
		// inalloca is only expected to occur in 32-bit code.
		if (STI->is64Bit())
		return false;

		assert(I->getOpcode() == TII->getCallFrameSetupOpcode());

		// FrameSetup for the _chkstk call.
		if (I->getOperand(0).getImm() != 0 \|\| I->getOperand(1).getImm() != 0)
		return false;
		Info.FrameSetup = I++;

		// Often, there's an instruction here that moves the _chkstk amount into a
		mkuperUnsubmitted Not Done Reply Inline Actions Is this often or always? It seems like we're only going to match the vreg case, right? mkuper: Is this often or always? It seems like we're only going to match the vreg case, right?
		// virtual register. Ignore it for now; we'll look into that below.
		if (I->getOpcode() == X86::MOV32ri)
		I++;

		// Match move of virtual register to %eax, the _chkstk argument.
		if (!I->isCopy() \|\| !I->getOperand(0).isReg() \|\| !I->getOperand(1).isReg() \|\|
		I->getOperand(0).getReg() != X86::EAX)
		return false;
		Info.AmountMov = I++;

		// Get the definition of that virtual register.
		unsigned ChkstkAmountVreg = Info.AmountMov->getOperand(1).getReg();
		MachineInstr *Def = MRI->getUniqueVRegDef(ChkstkAmountVreg);
		if (!Def \|\| Def->getOpcode() != X86::MOV32ri \|\| !Def->getOperand(1).isImm())
		return false;
		Info.AmountInstr = Def;
		MaxAdjust = Def->getOperand(1).getImm() >> Log2SlotSize;

		// Match call to chkstk.
		if (!I->isCall() \|\| !I->getOperand(0).isSymbol() \|\|
		StringRef(I->getOperand(0).getSymbolName()) != "_chkstk")
		mkuperUnsubmitted Not Done Reply Inline Actions We use _alloca() for cygwin/mingw. Handling only _chkstk sounds reasonable, but I'm not sure I'm a fan of having this string hardcoded here. Perhaps factor the code that decides on the symbol name out of X86FrameLowering::emitStackProbeCall() and call that here as well? mkuper: We use _alloca() for cygwin/mingw. Handling only _chkstk sounds reasonable, but I'm not sure…
		return false;
		Info.ChkstkCall = I++;

		// Match copy of %esp (the result of _chkstk) to a register.
		if (!I->isCopy() \|\| !I->getOperand(0).isReg() \|\| !I->getOperand(1).isReg() \|\|
		I->getOperand(1).getReg() != X86::ESP)
		return false;
		Info.ChkstkResultCopy = I++;

		// Match FrameDestroy for _chkstk call.
		if (I->getOpcode() != TII->getCallFrameDestroyOpcode() \|\|
		I->getOperand(0).getImm() != 0 \|\| I->getOperand(1).getImm() != 0)
		return false;
		Info.FrameDestroy = I++;

		return true;
		}

void X86CallFrameOptimization::collectCallInfo(MachineFunction &MF,		void X86CallFrameOptimization::collectCallInfo(MachineFunction &MF,
MachineBasicBlock &MBB,		MachineBasicBlock &MBB,
MachineBasicBlock::iterator I,		MachineBasicBlock::iterator I,
CallContext &Context) {		CallContext &Context) {
// Check that this particular call sequence is amenable to the		// Check that this particular call sequence is amenable to the
// transformation.		// transformation.
const X86RegisterInfo &RegInfo =		const X86RegisterInfo &RegInfo =
static_cast<const X86RegisterInfo >(STI->getRegisterInfo());		static_cast<const X86RegisterInfo >(STI->getRegisterInfo());
		unsigned FrameSetupOpcode = TII->getCallFrameSetupOpcode();
unsigned FrameDestroyOpcode = TII->getCallFrameDestroyOpcode();		unsigned FrameDestroyOpcode = TII->getCallFrameDestroyOpcode();

// We expect to enter this at the beginning of a call sequence		// We expect to enter this at the beginning of a call sequence
assert(I->getOpcode() == TII->getCallFrameSetupOpcode());		assert(I->getOpcode() == FrameSetupOpcode);
MachineBasicBlock::iterator FrameSetup = I++;		MachineBasicBlock::iterator FrameSetup = I++;
Context.FrameSetup = FrameSetup;		Context.FrameSetup = FrameSetup;

// How much do we adjust the stack? This puts an upper bound on		// How much do we adjust the stack? This puts an upper bound on
// the number of parameters actually passed on it.		// the number of parameters actually passed on it.
unsigned int MaxAdjust =		unsigned int MaxAdjust =
FrameSetup->getOperand(0).getImm() >> Log2SlotSize;		FrameSetup->getOperand(0).getImm() >> Log2SlotSize;

// A zero adjustment means no stack parameters
if (!MaxAdjust) {		if (!MaxAdjust) {
		// It might be an inalloca call. Back up and check for that.
		--I;
		if (matchInAlloca(I, Context.InAllocaSetup, MaxAdjust)) {
		Context.IsInAlloca = true;
		} else {
		// Otherwise, a zero adjustment means no stack parameters.
Context.NoStackParams = true;		Context.NoStackParams = true;
return;		return;
}		}
		}

// For globals in PIC mode, we can have some LEAs here.		// For globals in PIC mode, we can have some LEAs here.
// Ignore them, they don't bother us.		// Ignore them, they don't bother us.
// TODO: Extend this to something that covers more cases.		// TODO: Extend this to something that covers more cases.
while (I->getOpcode() == X86::LEA32r)		while (I->getOpcode() == X86::LEA32r)
++I;		++I;

		unsigned StackPtr;

		if (Context.IsInAlloca) {
		StackPtr = Context.InAllocaSetup.ChkstkResultCopy->getOperand(0).getReg();
		} else {
// We expect a copy instruction here.		// We expect a copy instruction here.
// TODO: The copy instruction is a lowering artifact.		// TODO: The copy instruction is a lowering artifact.
// We should also support a copy-less version, where the stack		// We should also support a copy-less version, where the stack
// pointer is used directly.		// pointer is used directly.
if (!I->isCopy() \|\| !I->getOperand(0).isReg())		if (!I->isCopy() \|\| !I->getOperand(0).isReg())
return;		return;
Context.SPCopy = I++;		Context.SPCopy = I++;
		StackPtr = Context.SPCopy->getOperand(0).getReg();
unsigned StackPtr = Context.SPCopy->getOperand(0).getReg();		}

// Scan the call setup sequence for the pattern we're looking for.		// Scan the call setup sequence for the pattern we're looking for.
// We only handle a simple case - a sequence of store instructions that		// We only handle a simple case - a sequence of store instructions that
		mkuperUnsubmitted Not Done Reply Inline Actions As long as you're touching this - the comment is no longer correct. Could you please delete it? mkuper: As long as you're touching this - the comment is no longer correct. Could you please delete it?
// push a sequence of stack-slot-aligned values onto the stack, with		// push a sequence of stack-slot-aligned values onto the stack, with
// no gaps between them.		// no gaps between them.
if (MaxAdjust > 4)		if (MaxAdjust > 4)
Context.MovVector.resize(MaxAdjust, nullptr);		Context.MovVector.resize(MaxAdjust, nullptr);

InstClassification Classification;		InstClassification Classification;
DenseSet<unsigned int> UsedRegs;		DenseSet<unsigned int> UsedRegs;

▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	for (const MachineOperand &MO : I->uses()) {
unsigned int Reg = MO.getReg();		unsigned int Reg = MO.getReg();
if (RegInfo.isPhysicalRegister(Reg))		if (RegInfo.isPhysicalRegister(Reg))
UsedRegs.insert(Reg);		UsedRegs.insert(Reg);
}		}

++I;		++I;
}		}

		// For an inalloca call, the FrameSetup instruction for the call is here.
		if (Context.IsInAlloca) {
		if (I == MBB.end() \|\| I->getOpcode() != FrameSetupOpcode)
		return;
		Context.FrameSetup = I++;
		}

// We now expect the end of the sequence. If we stopped early,		// We now expect the end of the sequence. If we stopped early,
// or reached the end of the block without finding a call, bail.		// or reached the end of the block without finding a call, bail.
if (I == MBB.end() \|\| !I->isCall())		if (I == MBB.end() \|\| !I->isCall())
return;		return;

Context.Call = I;		Context.Call = I;
if ((++I)->getOpcode() != FrameDestroyOpcode)		if ((++I)->getOpcode() != FrameDestroyOpcode)
return;		return;
Show All 15 Lines	for (; MMI != MME; ++MMI)
if (*MMI != nullptr)		if (*MMI != nullptr)
return;		return;

Context.UsePush = true;		Context.UsePush = true;
}		}

void X86CallFrameOptimization::adjustCallSequence(MachineFunction &MF,		void X86CallFrameOptimization::adjustCallSequence(MachineFunction &MF,
const CallContext &Context) {		const CallContext &Context) {
		if (Context.IsInAlloca) {
		// Move the FrameSetup instruction for the call to before the moves.
		assert(Context.MovVector.size() > 0 && "No moves?");
		auto *MBB = Context.MovVector[0]->getParent();
		MBB->insert(Context.MovVector[0], Context.FrameSetup->removeFromParent());
		}

// Ok, we can in fact do the transformation for this call.		// Ok, we can in fact do the transformation for this call.
// Do not remove the FrameSetup instruction, but adjust the parameters.		// Do not remove the FrameSetup instruction, but adjust the parameters.
// PEI will end up finalizing the handling of this.		// PEI will end up finalizing the handling of this.
MachineBasicBlock::iterator FrameSetup = Context.FrameSetup;		MachineBasicBlock::iterator FrameSetup = Context.FrameSetup;
MachineBasicBlock &MBB = *(FrameSetup->getParent());		MachineBasicBlock &MBB = *(FrameSetup->getParent());
FrameSetup->getOperand(1).setImm(Context.ExpectedDist);		FrameSetup->getOperand(1).setImm(Context.ExpectedDist);

DebugLoc DL = FrameSetup->getDebugLoc();		DebugLoc DL = FrameSetup->getDebugLoc();
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	for (int Idx = (Context.ExpectedDist >> Log2SlotSize) - 1; Idx >= 0; --Idx) {
if (!TFL->hasFP(MF))		if (!TFL->hasFP(MF))
TFL->BuildCFI(		TFL->BuildCFI(
MBB, std::next(Push), DL,		MBB, std::next(Push), DL,
MCCFIInstruction::createAdjustCfaOffset(nullptr, SlotSize));		MCCFIInstruction::createAdjustCfaOffset(nullptr, SlotSize));

MBB.erase(MOV);		MBB.erase(MOV);
}		}

		if (!Context.IsInAlloca) {
// The stack-pointer copy is no longer used in the call sequences.		// The stack-pointer copy is no longer used in the call sequences.
// There should not be any other users, but we can't commit to that, so:		// There should not be any other users, but we can't commit to that, so:
if (MRI->use_empty(Context.SPCopy->getOperand(0).getReg()))		if (MRI->use_empty(Context.SPCopy->getOperand(0).getReg()))
Context.SPCopy->eraseFromParent();		Context.SPCopy->eraseFromParent();
		} else {
		// Remove the incalloca call setup (in reverse to delete uses before defs).
		mkuperUnsubmitted Not Done Reply Inline Actions incalloca -> inalloca mkuper: incalloca -> inalloca
		Context.InAllocaSetup.FrameDestroy->eraseFromParent();

		unsigned ChkstkRes =
		Context.InAllocaSetup.ChkstkResultCopy->getOperand(0).getReg();
		if (!MRI->use_empty(ChkstkRes)) {
		// Something is using the result of _chkstk. Provide a replacement.
		unsigned NewReg = MRI->createVirtualRegister(&X86::GR32RegClass);
		addRegOffset(BuildMI(MBB, Context.InAllocaSetup.ChkstkResultCopy, DL,
		TII->get(X86::LEA32r), NewReg),
		X86::ESP, false, -Context.ExpectedDist);
		MRI->replaceRegWith(ChkstkRes, NewReg);
		}
		Context.InAllocaSetup.ChkstkResultCopy->eraseFromParent();

		Context.InAllocaSetup.ChkstkCall->eraseFromParent();
		Context.InAllocaSetup.AmountMov->eraseFromParent();
		mkuperUnsubmitted Not Done Reply Inline Actions Is it possible for ChkstkResultCopy and AmountMov to have more uses? I don't see why that would happen, but I don't see anything completely preventing it either. mkuper: Is it possible for ChkstkResultCopy and AmountMov to have more uses? I don't see why that would…
		if (MRI->use_empty(
		Context.InAllocaSetup.AmountInstr->getOperand(0).getReg())) {
		Context.InAllocaSetup.AmountInstr->eraseFromParent();
		}
		Context.InAllocaSetup.FrameSetup->eraseFromParent();
		}

// Once we've done this, we need to make sure PEI doesn't assume a reserved		// Once we've done this, we need to make sure PEI doesn't assume a reserved
// frame.		// frame.
X86MachineFunctionInfo *FuncInfo = MF.getInfo<X86MachineFunctionInfo>();		X86MachineFunctionInfo *FuncInfo = MF.getInfo<X86MachineFunctionInfo>();
FuncInfo->setHasPushSequences(true);		FuncInfo->setHasPushSequences(true);
}		}

MachineInstr *X86CallFrameOptimization::canFoldIntoRegPush(		MachineInstr *X86CallFrameOptimization::canFoldIntoRegPush(
Show All 35 Lines

test/CodeGen/X86/inalloca-callframeopt.ll

This file was added.

				; RUN: llc < %s -mtriple=i686-pc-win32 \| FileCheck %s

				%struct.S = type { i32 }
				declare void @f(<{ %struct.S }>* inalloca)

				define void @basic() {
				entry:
				%argmem = alloca inalloca <{ %struct.S }>, align 4
				%x = getelementptr inbounds <{ %struct.S }>, <{ %struct.S }>* %argmem, i32 0, i32 0, i32 0
				store i32 42, i32* %x, align 4
				call void @f(<{ %struct.S }>* inalloca %argmem)
				ret void

				; CHECK-LABEL: basic:
				; TODO: We've removed the dynamic alloca; make frame pointer omission possible.
				; CHECK: pushl %ebp
				; CHECK-NEXT: movl %esp, %ebp
				; CHECK-NOT: calll __chkstk
				; CHECK-NOT: movl $42
				; CHECK-NEXT: pushl $42
				; CHECK-NEXT: calll _f
				; CHECK-NEXT: movl %ebp, %esp
				; CHECK-NEXT: popl %ebp
				; CHECK-NEXT: retl
				}


				%struct.T = type { i32*, [1 x i32] }
				declare void @g(<{ %struct.T }>* inalloca)

				define void @stack_reference_arg() {
				entry:
				%argmem = alloca inalloca <{ %struct.T }>, align 4
				%arrayinit.begin.i = getelementptr inbounds <{ %struct.T }>, <{ %struct.T }>* %argmem, i32 0, i32 0, i32 1, i32 0
				store i32 1, i32* %arrayinit.begin.i, align 4
				%p.i = getelementptr inbounds <{ %struct.T }>, <{ %struct.T }>* %argmem, i32 0, i32 0, i32 0
				store i32* %arrayinit.begin.i, i32** %p.i, align 4
				call void @g(<{ %struct.T }>* inalloca %argmem)
				ret void

				; One of the arguments is a pointer into the call frame.
				; FIXME: It would be cool if we could fold away the add instruction.
				; CHECK-LABEL: stack_reference_arg:
				; CHECK: pushl %ebp
				; CHECK-NEXT: movl %esp, %ebp
				; CHECK-NEXT: leal -8(%esp), %eax
				; CHECK-NEXT: addl $4, %eax
				; CHECK-NEXT: pushl $1
				; CHECK-NEXT: pushl %eax
				; CHECK-NEXT: calll _g
				; CHECK-NEXT: movl %ebp, %esp
				; CHECK-NEXT: popl %ebp
				; CHECK-NEXT: retl
				}


				define void @two_calls() {
				entry:
				%inalloca.save = tail call i8* @llvm.stacksave()
				%argmem = alloca inalloca <{ %struct.S }>, align 4
				%x = getelementptr inbounds <{ %struct.S }>, <{ %struct.S }>* %argmem, i32 0, i32 0, i32 0
				store i32 42, i32* %x, align 4
				call void @f(<{ %struct.S }>* inalloca nonnull %argmem)
				call void @llvm.stackrestore(i8* %inalloca.save)
				%argmem3 = alloca inalloca <{ %struct.S }>, align 4
				%x2 = getelementptr inbounds <{ %struct.S }>, <{ %struct.S }>* %argmem3, i32 0, i32 0, i32 0
				store i32 42, i32* %x2, align 4
				call void @f(<{ %struct.S }>* inalloca nonnull %argmem3)
				ret void

				; Two inalloca calls after eachother. The vreg used for the _chkstk argument
				; is shared between them.
				; FIXME: Clang puts stacksave/restore around the first call, which are
				; redundant since the stack is adjusted back after the call anyway. If
				; we wanted to be aggressive, we could even skip adjusting the stack
				; back between the calls.
				; CHECK-LABEL: two_calls
				; CHECK: pushl %ebp
				; CHECK-NEXT: movl %esp, %ebp
				; CHECK-NEXT: pushl %esi
				; CHECK-NEXT: movl %esp, %esi
				; CHECK-NEXT: pushl $42
				; CHECK-NEXT: calll _f
				; CHECK-NEXT: addl $4, %esp
				; CHECK-NEXT: movl %esi, %esp
				; CHECK-NEXT: pushl $42
				; CHECK-NEXT: calll _f
				; CHECK-NEXT: leal -4(%ebp), %esp
				; CHECK-NEXT: popl %esi
				; CHECK-NEXT: popl %ebp
				; CHECK-NEXT: retl
				}


				declare i8* @llvm.stacksave()
				declare void @llvm.stackrestore(i8*)

test/CodeGen/X86/inalloca-stdcall.ll

	; RUN: llc < %s -mtriple=i686-pc-win32 \| FileCheck %s			; RUN: llc < %s -mtriple=i686-pc-win32 -no-x86-call-frame-opt \| FileCheck %s

	%Foo = type { i32, i32 }			%Foo = type { i32, i32 }

	declare x86_stdcallcc void @f(%Foo* inalloca %a)			declare x86_stdcallcc void @f(%Foo* inalloca %a)
	declare x86_stdcallcc void @i(i32 %a)			declare x86_stdcallcc void @i(i32 %a)

	define void @g() {			define void @g() {
	; CHECK-LABEL: _g:			; CHECK-LABEL: _g:
	%b = alloca inalloca %Foo			%b = alloca inalloca %Foo
	; CHECK: movl $8, %eax			; CHECK: movl $8, %eax
	; CHECK: calll __chkstk			; CHECK: calll __chkstk
	%f1 = getelementptr %Foo, %Foo* %b, i32 0, i32 0			%f1 = getelementptr %Foo, %Foo* %b, i32 0, i32 0
	%f2 = getelementptr %Foo, %Foo* %b, i32 0, i32 1			%f2 = getelementptr %Foo, %Foo* %b, i32 0, i32 1
	store i32 13, i32* %f1			store i32 13, i32* %f1
	store i32 42, i32* %f2			store i32 42, i32* %f2
	; CHECK: movl %esp, %eax			; CHECK: movl %esp, %eax
	; CHECK: movl $13, (%eax)			; CHECK: movl $13, (%eax)
	; CHECK: movl $42, 4(%eax)			; CHECK: movl $42, 4(%eax)
	call x86_stdcallcc void @f(%Foo* inalloca %b)			call x86_stdcallcc void @f(%Foo* inalloca %b)
	; CHECK: calll _f@8			; CHECK: calll _f@8
	; CHECK-NOT: %esp			; CHECK: subl $4, %esp
	; CHECK: pushl			; CHECK: movl $0, (%esp)
	; CHECK: calll _i@4			; CHECK: calll _i@4
	call x86_stdcallcc void @i(i32 0)			call x86_stdcallcc void @i(i32 0)
	ret void			ret void
	}			}

test/CodeGen/X86/inalloca.ll

	; RUN: llc < %s -mtriple=i686-pc-win32 \| FileCheck %s			; RUN: llc < %s -mtriple=i686-pc-win32 -no-x86-call-frame-opt \| FileCheck %s
				mkuperUnsubmitted Not Done Reply Inline Actions Are you sure you want to have these tests run only with -no-x86-call-frame-opt? This way we don't actually cover the "normal" case of x86-call-frame-opt + thiscall, etc. mkuper: Are you sure you want to have these tests run only with -no-x86-call-frame-opt? This way we…

	%Foo = type { i32, i32 }			%Foo = type { i32, i32 }

	declare void @f(%Foo* inalloca %b)			declare void @f(%Foo* inalloca %b)

	define void @a() {			define void @a() {
	; CHECK-LABEL: _a:			; CHECK-LABEL: _a:
	entry:			entry:
	▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines