Page MenuHomePhabricator

X86CallFrameOpt: a first step towards optimizing inalloca calls (PR27076)
AbandonedPublic

Authored by hans on May 5 2016, 4:30 PM.

Details

Summary

Using pushes to move arguments into the stack results in significantly smaller code. We can also remove the _chkstk call, as the pushes probe the stack naturally.

This patch only covers basic cases where there are no complicated instructions in the call sequence. Inalloca calls often have e.g. nested calls or control flow in the call sequence, so in practice this patch doesn't fire a lot, but it's a start.

Please take a look.

Diff Detail

Event Timeline

hans updated this revision to Diff 56368.May 5 2016, 4:30 PM
hans retitled this revision from to X86CallFrameOpt: a first step towards optimizing inalloca calls (PR27076).
hans updated this object.
hans added reviewers: rnk, mkuper, DavidKreitzer.
hans added a subscriber: llvm-commits.
mkuper edited edge metadata.May 6 2016, 12:12 AM

Hi Hans,
I have a few small comments, none of them very important.

Regarding the general direction - are you sure this can be extended to the more complicated cases in a sane way?
The reason I went in this direction for the regular mov -> push transformation was that the amount of different patterns we actually generate for call sequences is pretty limited. So it seemed a dumb linear instruction search will be able to cover most cases. There were some exceptions - I think one was a pseudo instruction expansion inside the call sequence that created control flow, which meant the FrameSetup and FrameDestroy ended up in different basic blocks. But this was very rare.
I'm not really familiar with the IR we generate for inalloca, but the nested stuff seems scary.

lib/Target/X86/X86CallFrameOptimization.cpp
363

Is this often or always? It seems like we're only going to match the vreg case, right?

384

We use _alloca() for cygwin/mingw.
Handling only _chkstk sounds reasonable, but I'm not sure I'm a fan of having this string hardcoded here. Perhaps factor the code that decides on the symbol name out of X86FrameLowering::emitStackProbeCall() and call that here as well?

458

As long as you're touching this - the comment is no longer correct. Could you please delete it?

658

incalloca -> inalloca

674

Is it possible for ChkstkResultCopy and AmountMov to have more uses?
I don't see why that would happen, but I don't see anything completely preventing it either.

test/CodeGen/X86/inalloca.ll
1

Are you sure you want to have these tests run only with -no-x86-call-frame-opt?
This way we don't actually cover the "normal" case of x86-call-frame-opt + thiscall, etc.

DavidKreitzer edited edge metadata.May 6 2016, 8:34 AM

Hi Hans,

I think X86CallFrameOptimization is the wrong place to be trying to eliminate the _chkstk calls for inalloca. The ability to do the store-to-push optimization has no bearing on whether the chkstk call is needed. Your comment about the pushes naturally probing the stack is true, but it is also true of the original stores.

I think David had the right idea in r262370. But I assume the problem he ran into is that clang is nesting these inalloca calls, so it isn't easy to tell how much stack space is ultimately going to be allocated. Consider a case like this:

struct S
{
    S(const S&);
    char a[3000];
};

void f1(S, int);
int f2(S);
void f3(S *s)
{
  f1(*s, f2(*s));
}

We get this from clang:

define void @"\01?f3@@YAXPAUS@@@Z"(%struct.S* %s) #0 {
entry:
  %argmem4 = alloca inalloca <{ %struct.S, i32 }>, align 4
  %inalloca.save1 = tail call i8* @llvm.stacksave()
  %argmem = alloca inalloca <{ %struct.S }>, align 4
  %0 = getelementptr inbounds <{ %struct.S }>, <{ %struct.S }>* %argmem, i32 0, 
i32 0
  %call = call x86_thiscallcc %struct.S* @"\01??0S@@QAE@ABU0@@Z"(%struct.S* %0, 
%struct.S* dereferenceable(3000) %s)
  %call2 = call i32 @"\01?f2@@YAHUS@@@Z"(<{ %struct.S }>* inalloca nonnull %argm
em)
  call void @llvm.stackrestore(i8* %inalloca.save1)
  %1 = getelementptr inbounds <{ %struct.S, i32 }>, <{ %struct.S, i32 }>* %argme
m4, i32 0, i32 0
  %call3 = call x86_thiscallcc %struct.S* @"\01??0S@@QAE@ABU0@@Z"(%struct.S* %1,
 %struct.S* dereferenceable(3000) %s)
  %2 = getelementptr inbounds <{ %struct.S, i32 }>, <{ %struct.S, i32 }>* %argme
m4, i32 0, i32 1
  store i32 %call2, i32* %2, align 4, !tbaa !1
  call void @"\01?f1@@YAXUS@@H@Z"(<{ %struct.S, i32 }>* inalloca nonnull %argmem
4)
  ret void
}

See how there are two inalloca calls at the top? They effectively grow the stack by 6000 bytes (which is big enough to require _chkstk even though the separate 3000-byte allocations are not). Neither MSVC nor ICC do it like this. They both allocate space for the call to f2, call f2, and cleanup the stack from calling f2 before allocating space for the call to f1. I think we need to fix clang to do something similar and then David's solution ought to work.

For the example in pr27076, another thing clang could do is just avoid inalloca altogether. We shouldn't need it for passing objects that use the default copy constructor.

hans added a comment.May 6 2016, 11:09 AM

Thank you both for the review comments!

I think X86CallFrameOptimization is the wrong place to be trying to eliminate the _chkstk calls for inalloca. The ability to do the store-to-push optimization has no bearing on whether the chkstk call is needed. Your comment about the pushes naturally probing the stack is true, but it is also true of the original stores.

I think David had the right idea in r262370. But I assume the problem he ran into is that clang is nesting these inalloca calls, so it isn't easy to tell how much stack space is ultimately going to be allocated. Consider a case like this:

The connection I saw between removing the _chkstk calls and this transformation is that the pushes will touch the stack in a safe order (starting at %esp, which should be safe, and progressing downwards), whereas the original stores might be in an order that starts by touching an address beyond the allocated stack.

What worries me about the approach in r262370 is that it doesn't take other stack objects into account (and I think that's why it failed). What if our function has a 3 KB fixed array which hasn't been touched yet, can we then remove a 2 KB _chkstk? And IIUC (but I could be wrong here), we can't rely on checking the size of the stack frame at that stage, because it could increase due to register spills.

I figured tying this to the push-conversion would resolve these concerns in a neat way.

I agree with your point, and Michael's, that it's not clear yet if we could make this work for more complicated calls in a reasonable way. I figured this would be a good patch to start with, but maybe I need to experiment a bit further to see if it will work out.

The connection I saw between removing the _chkstk calls and this transformation is that the pushes will touch the stack in a safe order (starting at %esp, which should be safe, and progressing downwards), whereas the original stores might be in an order that starts by touching an address beyond the allocated stack.

That's a fair point. You do have to worry about the case where the pushes get prefaced by a "sub esp" which is typically done to pad the outgoing argument block for alignment. I do not know how common that is on IA-32 Windows, but it is a potential concern.

What worries me about the approach in r262370 is that it doesn't take other stack objects into account (and I think that's why it failed). What if our function has a 3 KB fixed array which hasn't been touched yet, can we then remove a 2 KB _chkstk? And IIUC (but I could be wrong here), we can't rely on checking the size of the stack frame at that stage, because it could increase due to register spills.

The situation you describe is no different than for a call to a non-inalloca function. That is, if the local frame has a 3k object, and the outgoing parameter block size is 2k, we need to call _chkstk. And it is the frame finalization pass that figures this out. IMO, inalloca functions should be handled in the same way. (I admit that isn't quite the same as the approach taken in r262370.) In addition to eliminating the unnecessary _chkstk calls, a fix in frame finalization will let us avoid forcing a frame pointer in routines with inalloca calls. At any rate, I think we need a comprehensive fix. This patch seems like a small band-aid for a gaping wound.

rnk requested changes to this revision.May 9 2016, 12:03 PM
rnk edited edge metadata.

I think special casing inalloca allocas in X86TargetLowering::LowerDYNAMIC_STACKALLOC will be a better start for this.


I think X86CallFrameOptimization is the wrong place to be trying to eliminate the _chkstk calls for inalloca. The ability to do the store-to-push optimization has no bearing on whether the chkstk call is needed. Your comment about the pushes naturally probing the stack is true, but it is also true of the original stores.

I agree, we should avoid the chkstk at a higher level. Eventually, though, we wanted to do general conversion of inalloca to a sequence of subs, leas, and pushes. I figured that would live here.

I think David had the right idea in r262370. But I assume the problem he ran into is that clang is nesting these inalloca calls, so it isn't easy to tell how much stack space is ultimately going to be allocated. Consider a case like this:
... snip
See how there are two inalloca calls at the top? They effectively grow the stack by 6000 bytes (which is big enough to require _chkstk even though the separate 3000-byte allocations are not). Neither MSVC nor ICC do it like this. They both allocate space for the call to f2, call f2, and cleanup the stack from calling f2 before allocating space for the call to f1. I think we need to fix clang to do something similar and then David's solution ought to work.

We don't want to try to follow ICC or MSVC here. We want to do the copy elision instead, since it's actually required in C++17. IMO we should detect these large argument allocation cases and probe the stack.

For the example in pr27076, another thing clang could do is just avoid inalloca altogether. We shouldn't need it for passing objects that use the default copy constructor.

inalloca is used whenever a type is not trivially copyable, meaning it has a non-trivial copy constructor or destructor. My understanding is that we aren't allowed to introduce extra copies of non-trivially copyable objects, so we have to use inalloca here, at least in C++17.

lib/Target/X86/X86CallFrameOptimization.cpp
56

This makes me feel like we should model argument allocation as a single operation instead of five or so. What do you think about changing X86TargetLowering::LowerDYNAMIC_STACKALLOC to emit a new DAG node which selects to a single MI?

349

Yeah, this feels like too much pattern matching.

This revision now requires changes to proceed.May 9 2016, 12:03 PM
hans abandoned this revision.May 10 2016, 5:35 PM

Thanks for the input everyone!

I'll start looking into introducing a pseudo-instruction for the dynamic alloca so we can lower it at a point where we have all the information needed to elide the _chkstk call.

lib/Target/X86/X86CallFrameOptimization.cpp
56

X86TargetLowering::LowerDYNAMIC_STACKALLOC already emits a X86ISD::WIN_ALLOCA, but instead of expanding that in X86TargetLowering::EmitLoweredWinAlloca, we could have a pseudo-instruction that gets expanded later -- when we know the size of the stack frame and can elide it.