This is an archive of the discontinued LLVM Phabricator instance.

X86: Avoid using _chkstk when lowering WIN_ALLOCA instructions
ClosedPublic

Authored by hans on May 13 2016, 4:40 PM.

Details

Summary

This patch moves the expansion of WIN_ALLOCA pseudo-instructions into a separate pass that walks the CFG and lowers the instructions based on a conservative estimate of the offset between the stack pointer and the lowest accessed stack address.

The goal is to reduce binary size and run-time costs by removing calls to _chkstk.

Diff Detail

Event Timeline

hans updated this revision to Diff 57268.May 13 2016, 4:40 PM
hans retitled this revision from to X86: Avoid using _chkstk when lowering WIN_ALLOCA instructions.
hans updated this object.
hans added reviewers: rnk, DavidKreitzer, mkuper.
hans added a subscriber: llvm-commits.
mkuper added inline comments.May 13 2016, 5:11 PM
lib/Target/X86/X86.h
62

I think this needs a description. :-)

lib/Target/X86/X86ISelLowering.cpp
16614

Do we still need a CopyToReg? The previous code had it because it needed specifically a copy to EAX glued to the future-chkstk. Can you use Size directly?

lib/Target/X86/X86WinAllocaExpander.cpp
121

We have a ReversePostOrderTraversal<> in ADT. Can you use that, or is it inappropriate?

167

Are you sure this is conservative enough?
Perhaps blacklist anything that defs the stack pointer, except for a specific whitelist (calls, pushes, pops...) that touches the tip? Or would we get a huge whitelist?

264

StackProbeSize = MF.... ?
(Perhaps add a test for this?)

rnk added inline comments.May 13 2016, 5:12 PM
include/llvm/CodeGen/MachineFrameInfo.h
282 ↗(On Diff #57268)

I think WIN_ALLOCA is x86-specific, so a better place for this would be lib/Target/X86/X86MachineFunctionInfo.h

lib/Target/X86/X86.h
62

write the comment :)

lib/Target/X86/X86ISelLowering.cpp
16614

I wonder if we could do something clever to avoid creating a copy of an immediate. Might be too much work.

lib/Target/X86/X86WinAllocaExpander.cpp
117

s/lowerst/lowest/

120

This can be:

ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);
169

Should this be INT32_MAX instead? It seems like we could do alloca, save+restore, alloca and run over the guard page this way. Maybe a good test case?

198

Maybe cache Is64Bit ? 8 : 4 as a SlotSize member variable. Also RAX as... some variable. RegA?

mkuper added inline comments.May 13 2016, 5:24 PM
lib/Target/X86/X86WinAllocaExpander.cpp
198

We can probably just query MRI for getSlotSize instead.

DavidKreitzer edited edge metadata.May 16 2016, 3:39 PM

Hi Hans,

I made a few detailed comments, but I have higher level concerns about the approach. This new pass will eliminate most of the _chkstk calls, which is great. But we'll still be leaving some performance on the table. We'd really like calls that take inalloca arguments to be optimized in all the same ways as other calls. In particular, we'd like to be able to reserve the outgoing argument block & do the store-to-push optimization for outgoing arguments.

I think the "inalloca" name is unfortunate, because these "WIN_ALLOCA" stack allocations have more in common with ADJCALLSTACKDOWN than they do with alloca. For one thing, they always allocate a fixed amount of space. Additionally, there is an implicit requirement that the stack pointer value immediately following the WIN_ALLOCA matches the stack pointer value at the time of the call to the inalloca function. This requirement is not currently enforced correctly in some cases. You'll find that this program doesn't work with clang on Win32, because the stack space for the _alloca call is allocated *after* the stack space for the "WIN_ALLOCA":

#include <stdio.h>
#include <malloc.h>
struct X
{
    int x;
    X() {x = 42;}
    X(const X&v) { x = v.x; }
};

X x;

void f2(X v, void*p) { printf("v.x = %d (should be 42)\n", v.x); }
void f(int n)
{
    f2(x, _alloca(n));
}

int main()
{
    f(1000);
    return 0;
}

Fixing that bug may be orthogonal to how we model & implement inalloca calls in the back end. But that illustrates why I think a more accurate modeling would be to move the ADJCALLSTACKDOWN instruction that precedes the inalloca call (i.e. the ADJCALLSTACKDOWN instruction that is hacked to always use an allocation size of 0) to the current location of the WIN_ALLOCA, use an accurate allocation size rather than 0, and get rid of the WIN_ALLOCA altogether. That would have the effect of introducing nested ADJCALLSTACKDOWN/ADJCALLSTACKUP structures, which I'm sure is going to cause some implementation problems. But I think it will allow us to more naturally handle & optimize these calls during FrameLowering. It's kind of mess, but I think it is just a natural consequence of Microsoft's choice to construct inalloca arguments in the outgoing argument block rather than in an arbitrary location on the stack and passing a pointer to that location as is done on Linux.

I'll try to put together a prototype. That might make it easier to see what I have in mind.

Meanwhile, I have no objection to using this new pass as a shorter term performance fix. :)

Thanks,
-Dave

BTW, any thoughts on how to fix the above bug? This is another thing that I think ought to be handled by clang.

lib/Target/X86/X86WinAllocaExpander.cpp
51

Do you really want to return a LoweringMap? This will cause the entire container to be copied. Maybe pass in a LoweringMap& and fill it in instead?

100

The indentation looks off here.

102

To be extra safe, you could assert that AllocaAmount is >= -1.

110

This assertion seems unnecessary given line 101.

127

"Out" isn't very descriptive. Can you name this differently, e.g. OutOffset?

153

You should probably add a default: notreached().

mkuper added inline comments.May 16 2016, 3:50 PM
lib/Target/X86/X86WinAllocaExpander.cpp
51

Won't this be ok with RVO?

hans marked 18 inline comments as done.May 16 2016, 4:39 PM

Thanks for the great comments everyone!

David, I agree this only addresses a small part of the problem and leaves a lot of performance on the table. The way I think about it is as an incremental improvement, easier than tackling the full inalloca problem all at once, and also helpful for dynamic allocas in general.

Having the WIN_ALLOCA pseudos expanded late might also help getting better code for these calls eventually, as it should remove the need for some of the pattern matching I tried in D20003 (unless we manage to fix them with some higher-level approach).

include/llvm/CodeGen/MachineFrameInfo.h
282 ↗(On Diff #57268)

Done.

lib/Target/X86/X86.h
62

Done.

lib/Target/X86/X86ISelLowering.cpp
16614

We can avoid creating the copy here, as Michael pointed out, but since the MI instruction expects a register (because the amount might not be a constant), there will spill be a copy at the MI level.

We could create a variant of the pseudo-instruction that takes an immediate, but I'm not sure that would end up being a simplification.

lib/Target/X86/X86WinAllocaExpander.cpp
51

I figured the call would be inlined anyway, and the compiler could avoid the copy.

Anyway, I've changed the call to return via a reference parameter instead, as that seems more idiomatic.

100

Done.

102

Hmm, I'll just change it to check AllocaAmount < 0 in the condition instead.

110

Removed.

117

Done.

120

Done.

127

Done.

153

I left it out intentionally, relying on -Wswitch to detect any missing cases: http://llvm.org/docs/CodingStandards.html#don-t-use-default-labels-in-fully-covered-switches-over-enumerations

167

I couldn't come up with any example code that would break this in practice. But you're right, a white-list would be better, and I don't think it will be big.

169

Oops, -1 was a leftover from an earlier version of the patch. I've added a test.

198

Done.

198

Done.

264

It's passed as an argument to .getAsInteger below. Or are you saying there's some shorter way to get at it?

Added a test.

hans updated this revision to Diff 57417.May 16 2016, 4:39 PM
hans edited edge metadata.
hans marked 13 inline comments as done.

Addressing comments and adding another test.

mkuper added inline comments.May 16 2016, 5:02 PM
lib/Target/X86/X86WinAllocaExpander.cpp
181

This looks a bit scary to me. It can cause an offset to be temporary negative, and negative offsets are special-cased now, right? (Well, we'll probably never hit "-1" due to alignment anyway, but...)

I think I can live with this, though, especially since, as Dave's example shows, allocations within this region are probably broken as is, so it's not like this would make anything worse. But if we keep it as is, then I think we need at least a FIXME.

264

Never mind, I misread, sorry for the noise.

hans added a comment.May 17 2016, 11:12 AM

Any more comments, or are you all OK with this going in?

lib/Target/X86/X86WinAllocaExpander.cpp
181

Thinking more about this, I don't think a negative offset would break anything. The interesting code is on line 107, which I think will do the right thing (if we ever ended up in this situation).

Negative-offsets are not special-cased (a negative WinAlloca amount means "unknown" though.)

rnk edited edge metadata.May 17 2016, 11:45 AM

BTW, any thoughts on how to fix the above bug? This is another thing that I think ought to be handled by clang.

We did notice this, it's https://llvm.org/bugs/show_bug.cgi?id=26776. We can leverage the fact that argument evaluation order is currently unsequenced, and evaluate the argument containing the alloca call first.

However, if C++17 defines argument evaluation order, then it will become impossible to satisfy the copy elision requirement and the evaluation order requirement in this old and busted 32-bit ABI. =/


I think we can transform WIN_ALLOCA to ADJCALLSTACK today in certain limited cases, and if we can, it would definitely be a win. I doubt we can tolerate nested ADJCALLSTACK frames, though, and inalloca typically happens precisely when there is a nested call sequence. Obviously, stack frame reservation conflicts with nested call stack adjustments, but we should already be protected from that by the hasVarSizedObjects flag, which is set earlier in FunctionLoweringInfo.

Do you intend to pursue store-to-push conversion for calls where an argument is still address-taken by the time we reach codegen? Consider this kind of code:

struct S { S(const S &); ~S(); int s; };
S makeS(); // cannot inline away
void takeS(int, S, int);
void passS() { takeS(1, makeS(), 3); }

MSVC can make this into "push 1, sub 4, call makeS, push 3", but by the time we reach LLVM codegen, we have no ability to reason about what memory makeS writes. I'm happy if LLVM can turn this into "sub, call makeS, store 1, store 3, call takeS, add". This is what we used to get before push conversion, and I think we can achieve it with our current approach. If we want to do full push conversion, then we need to revisit the IR.

I just have a couple other minor suggestions. Otherwise, LGTM.

lib/Target/X86/X86WinAllocaExpander.cpp
116

Isn't the formatting convention to line up "case" and "switch"?

153

Ah, cool. That's good to know, thanks!

217

Since SlotSize is always a power of 2, a minor compile-time efficiency improvement would be to change the assert condition to (Amount & (SlotSize - 1)) == 0.

mkuper accepted this revision.May 17 2016, 12:40 PM
mkuper edited edge metadata.

LGTM

lib/Target/X86/X86WinAllocaExpander.cpp
181

Ok, that sounds reasonable.

This revision is now accepted and ready to land.May 17 2016, 12:40 PM

Thanks for the pointer to https://llvm.org/bugs/show_bug.cgi?id=26776, Reid.

I doubt we can tolerate nested ADJCALLSTACK frames, though, and inalloca typically happens precisely when there is a nested call sequence.

I'm sure you are right. I'd still like to give it a try and see how badly things break.

Obviously, stack frame reservation conflicts with nested call stack adjustments, but we should already be protected from that by the hasVarSizedObjects flag, which is set earlier in FunctionLoweringInfo.

We can still do stack frame reservation with nested call stack adjustments. It's just that the reserved call frame is only available to the top-level calls. Nested calls would have to do their own stack adjusts. I'm sure making that distinction would create some implementation challenges, and it isn't likely to be high impact, but it seems theoretically possible.

This revision was automatically updated to reflect the committed changes.
hans marked an inline comment as done.