Page MenuHomePhabricator

[X86] Convert esp-relative movs of function arguments to pushes, step 2

Authored by mkuper on Dec 28 2014, 6:11 AM.



This is a first stab at the next step of the mov-to-push transformation.

It moves the transformation earlier in the pass order so that it can do load-folding, and prepares the required infrastructure.
It is still enabled only in cases where it should be a clear win - when we don't expect to have a reserved call frame, or when optimizing for size.
The next step will be a heuristic that makes a smarter decision on when this should be enabled.

As a side note - I've done some internal testing for effects on the code size, but I'd like to do some testing for things other people care about as well. So, if you have a x86-32 code-base where you care about the code size, and is publicly available, let me know.

Diff Detail


Event Timeline

mkuper updated this revision to Diff 17656.Dec 28 2014, 6:11 AM
mkuper retitled this revision from to [X86] Convert esp-relative movs of function arguments to pushes, step 2.
mkuper updated this object.
mkuper edited the test plan for this revision. (Show Details)
mkuper added reviewers: nadav, rnk, delena.
mkuper added a subscriber: Unknown Object (MLST).
mkuper updated this revision to Diff 17788.Jan 5 2015, 2:00 AM

Removed a horrible hack that was, in addition to being horrible, completely wrong, and added a test-case to cover the issue.

Also, ping?

delena edited edge metadata.Jan 5 2015, 5:31 AM

I suggest to check also varargs and stdcall functions, were the callee clears the stack.

17 ↗(On Diff #17788)

Can you add this code to X86FrameLowering.cpp ?

70 ↗(On Diff #17788)

I suggest to choose another name, something like optimizeCallFrameForSize

39 ↗(On Diff #17788)

I don't think that we really need this knob.

113 ↗(On Diff #17788)

If you change instructions inside bb, your iterator may be broken.

212 ↗(On Diff #17788)

It should be immediate, right? Can we have a relocation here?

224 ↗(On Diff #17788)

SlowPush should be a property of the target, like slowLea

226 ↗(On Diff #17788)

The comment is missing here.

mkuper added a comment.Jan 5 2015, 9:07 AM

Thanks, Elena!
Will upload a new version.

70 ↗(On Diff #17788)

I wasn't happy with the name either, but didn't have any good ideas at the time.
Will do.

39 ↗(On Diff #17788)

I'd rather keep this knob, it's fairly useful for debugging.
Of course, it's internal only, not exposed to clang.

113 ↗(On Diff #17788)

As far as I know, MBB iterators aren't invalidated by removing other instructions, and we don't remove the FrameSetup itself.
But it's probably better to keep going from the FrameDestroy instead of the next instruction.
Will change that.

212 ↗(On Diff #17788)

It can be a relocation, but in that case, isImm() will fail.
Will document that more clearly

224 ↗(On Diff #17788)

I agree. Unfortunately, I've run out of bits.
The Subtarget features are 64-bit bitfield, and they're all taken.

508 ↗(On Diff #17788)

And, apparently, this is still wrong, because eliminateCallFramePseudoInstr() may actually adjust the SP by a different amount than what PEI passes as the SPAdj, e.g. due to stack alignment concerns.

mkuper added inline comments.Jan 6 2015, 3:19 AM
285 ↗(On Diff #17788)

Argh. This is nonsense. Commented one thing, coded another...
(mayStore() is extremely far from being a strong enough condition to allow this.)

rnk added inline comments.Jan 12 2015, 4:02 PM
39 ↗(On Diff #17788)

I would also like this as a temporary testing knob so that I can evaluate this across a large codebase.

mkuper updated this revision to Diff 18084.Jan 13 2015, 6:33 AM
mkuper edited edge metadata.

So, this version should actually work (e.g. it can self-host and past check-llvm. Without the stackalign restriction of course, since that currently makes it a nop except on windows).
Unfortunately, it has several big warts, so I'm not planning to commit it as is. This is more of a request for ideas on how to improve the code.

So, any ideas on how to make this sane, especially X86InstrInfo::getSPAdjust(), are welcome.

rnk edited edge metadata.Jan 13 2015, 9:04 AM

I haven't finished reviewing yet, but I've got to run and handle something personal.

At a high level, is there any reason we shouldn't commit to push/pop earlier to allow for better ISel, rather than trying to transform call sequences later? Specifically, I'm thinking about adding an X86ISD::PUSH DAG node and changing X86TargetLowering::LowerCall() to use it.

855–856 ↗(On Diff #18084)

This seems like an x86-specific quirk, right? Given "push [esp + 8]", x86 chips will load [esp + 8] before adjusting esp, and I think this code motion accomplishes that.

I'm OK with that motion so long as there are no other upstream LLVM backends with CISC-y instructions like "push [SP-mem]". :)

11 ↗(On Diff #18084)


82–83 ↗(On Diff #18084)

I think it's important to at least support __thiscall eventually, since that's a very common convention with one regparm.

84–86 ↗(On Diff #18084)

I guess I would justify this more in terms of reducing the extra CFI that we would have to emit to describe the SP adjustments. Converting a few movs to pushes isn't worth the complexity.

143 ↗(On Diff #18084)

Can you explain why this is unprofitable? I guess if we get here we are in dyanamic alloca plus stack realignment land, i.e. the worst thing that could possibly happen. Is this about extra code for preserving the outgoing stack alignment then? Like on Linux, where we provide 16 byte stack alignment?

Thanks, Reid!

Waiting for the second part, you didn't get to the really horrible stuff yet...

Regarding the high level, two reasons:

  1. It seemed like it was going to be simpler. I'm not so sure anymore, but I still think it is. (Note that we'll still need to fix all of the code that tracks SP adjustment, that's not going away in either case).
  2. The main problem is that next step after this is going to be a function-scope heuristic. To use this transformation for even one call-site, I have to disable the reserved frame for the whole function. So, I need to try to approximate the impact on the whole function (which contains some calls that will be converted to use pushes, and some calls that won't be). I don't see how this can be done on the DAG level.
855–856 ↗(On Diff #18084)

This call to SPAdjust() always returns 0 right now (barring the code in this patch), it was added as part of my refactoring in D6863, and I added it in the wrong place.

The motivation here wasn't a push, actually, since I try to never generate push [esp + 8], that's filtered out by the code in the optimization. Although I can probably start generating them - I was trying to filter them out precisely because I didn't want all of this complexity at the first stage, but apparently it's necessary.

The problem is that once we don't have a reserved call frame (regardless of the push transformation), you can have things like
CALL32r <fi#1>, where the call is callee-pop.
So you need to resolve the indirect call using the stack-pointer from before the call.

82–83 ↗(On Diff #18084)

Yes, and maybe even for _fastcall (It looks like gcc will do this for fastcall, icc won't).
But I am still trying to do this gradually, to the extent that I can. :-)

84–86 ↗(On Diff #18084)

You're right, that too.

143 ↗(On Diff #18084)

If we get here, we're in opt-for-size + stack-realignment land.

And, yes, that's exactly what it is is about.
If you are passing only one parameter, the original code would be:

mov %eax, 128(%esp)
call $foo

Without re-alignment, you have

push %eax
call $foo
add $4, %esp

which is still a win in terms of code-size

With re-alignment, you get:

sub $16, %esp
push %eax
call $foo
add $12, %esp

Which is... questionable. The code size for the sequence is the same (in this case, 7 bytes for both, not including the call), but if you have other call sites which you didn't convert, you may actually lose. And, of course, you lose performance (3 instructions instead of 1) without anything to show for it.

Once there is a heuristic that tries to estimate the overhead, we can address this on a case-by-case basis (e.g. if we have 16-byte stack re-alignment, but most call-sites have a lot of parameters, then it's still worth it.)

rnk added inline comments.Jan 13 2015, 3:58 PM
128–130 ↗(On Diff #18084)

I think I misinterpreted this on the first pass. We always expect this to be profitable if we know we *can't* reserve space for the call frame. Maybe rename the bool to CannotReserveFrame to match the sense?

143 ↗(On Diff #18084)

Based on my misinterpretation, I think I understand why you get this code. SP is assumed to be aligned coming into the sequence. We realign SP after dynamic allocas. The sequence is probably more like:

sub $12, %esp
push %eax
call $foo
add $16, %esp

I can see why this is less profitable.

208 ↗(On Diff #18084)

std::map is really malloc heavy. This can probably be a SmallVector<MachineInstr*, 8> or something, mapping slot index to the MI that fills it. The frame setup opcode should tell you how much stack space to allocate up front, and you can index into the vector by StackOffset / 4.

220–222 ↗(On Diff #18084)

This seems worth tackling, given that you had to handle the call <fi> case. :)

364–368 ↗(On Diff #18084)

It's not clear to me that same BB is sufficient, consider this potential BB:

movl (%edi), %eax
movl $42, (%edi)
<call setup>
movl %eax, (%esp)
calll foo
<call end>

We can't move the load if there is a potentially aliasing store in the way. There might be a utility to help with the aliasing query, or you can assume that any stores other than arg stores might alias it and bail on that.

1717–1718 ↗(On Diff #18084)

This is the best thing I can think of at the moment. =/

206 ↗(On Diff #18084)

Test case suggestions:

; Where the callee is indirect via the stack, `call <fi>`
define void @test10() optsize {
  %stack_fptr = alloca void (i32, i32, i32, i32)*
  store void (i32, i32, i32, i32)* @good, void (i32, i32, i32, i32)** %stack_fptr
  %good_ptr = load void (i32, i32, i32, i32)** %stack_fptr
  call void (i32, i32, i32, i32)* %good_ptr(i32 1, i32 2, i32 3, i32 4)
  ret void

; We can't fold the load into the push here, skipping the store.
@the_global = global i32
define void @test11() optsize {
  %myload = load i32* @the_global
  store i32 42, i32* @the_global
  call void @good(i32 %myload, i32 2, i32 3, i32 4)
  ret void

Thanks, Reid!

128–130 ↗(On Diff #18084)

Err, yes, you're right, sorry about that... got distracted while naming the variable, I guess, I meant the opposite.

143 ↗(On Diff #18084)

Yes, that sequence. :-)

It doesn't depend on dynamic allocas, though.
If you don't have a reserved frame (for whatever reason - for x86 after this patch, it's either dynamic allocas, or because we forced it not to reserve by using pushes), then you need this re-alignment.

208 ↗(On Diff #18084)

That can work. Thanks, I'll try.

220–222 ↗(On Diff #18084)

Yes, definitely. :-)
It may even work out of the box now. But I think I still want to split it into a separate commit.

364–368 ↗(On Diff #18084)

Right now I'm way more conservative than even that - I'm checking below that everything between this mov and the call setup is a MOV32rm. The "same basic block" check here is just a way to short-circuit the obviously wrong cases.

This catches some common cases like the one in the comment above, but of course misses other opportunities.
I could check for a mayStore() instead, but I'm not sure that's safe enough. I'd like to relax the condition - but again, I think that ought to be a separate commit.

1717–1718 ↗(On Diff #18084)

Too bad. :-\

So you think I should commit with this code as is?
This shouldn't be a huge problem in terms of compile-time (since I'm looking only until the next call, it can't go quadratic), but it's insanely ugly.

mkuper updated this revision to Diff 18222.Jan 15 2015, 4:51 AM
mkuper edited edge metadata.
  • Applied review comments
  • Fixed another bug in the way PEI was handling push sequences (argh) - this required adding a target query.
  • Made the tests check a bit more (which would have exposed the bug above earlier).
rnk added inline comments.Jan 15 2015, 10:19 AM
1717–1718 ↗(On Diff #18084)

Yeah, if we go with this MI pass approach to mov -> push conversion, then we'll have to keep this ADJCALLSTACKUP scan. We aren't going to move the callee cleanup stack adjustment onto the CALL instr without major changes.

1745 ↗(On Diff #18084)

I wonder if it's possible for __readeflags() (pushf ; pop %reg) or others to get folded into a call sequence. Probably not.

mkuper added inline comments.Jan 16 2015, 5:57 AM
1717–1718 ↗(On Diff #18084)

This will have to happen regardless of the MI pass vs. DAG approach.

I mean, I still think doing it on the DAG is unfeasible, but even if we could do that, it wouldn't help.
This code is used for the case where fi resolution needs to handle a a sequence where there is a fi reference between the call and the adjcallstackup, with callee cleanup for the call.
This is just a side effect of making canSimplifyCallFramePseudos return false.

1745 ↗(On Diff #18084)

I don't see how it could happen.
In any case, we won't match either the pushf or the pop, so it should be ok.

rnk accepted this revision.Jan 22 2015, 12:57 PM
rnk edited edge metadata.


I still think forming pushes prior to isel is the way to go long term. It's a lot easier to convert pushes to 'load, SP adjust, store' than it is to go the other way.

196 ↗(On Diff #18222)

"- Do" uppercase

100 ↗(On Diff #18222)

Can this be for (MachineBasicBlock &BB : *MF) {?

102 ↗(On Diff #18222)

Ditto, for (MachineInstr &MI : BB) { ?

1717–1718 ↗(On Diff #18084)

I was imagining in the DAG LowerCall implementation we emit FrameIndex operands with some kind of SP offset to indicate the current stack level. We'd end up with MI looking like this:

ADJCALLSTACKDOWN32 <N> ; N is <size-of-args> % <stack-alignment>, which is usually zero
PUSH32rmm <fi> <sp offset, N>
PUSH32rmm <fi> <sp offset, N + 4>
PUSH32rmm <fi> <sp offset, N + 8>
CALL32rm <fi> <sp offset, N + 12>

The main thing is that if we commit to pushes instead of movs at DAG time, it's impossible for the push conversion to fail for hard to diagnose reasons.

It looks like the frame index MachineOperand type has an unused offset field.

1713–1725 ↗(On Diff #18222)

I would shorten this to just something like "look for the ADJCALLSTACKUP instr that follows the call".

This revision is now accepted and ready to land.Jan 22 2015, 12:57 PM

Hi Chandler,

This is something that Reid and I talked about on IRC, but I don’t think we came to a conclusion both of us were happy with (hence Reid’s “lgtm with reservations”, I guess :-) )

First, I don’t think the decision on whether to use movs or pushes belongs in the DAG.
The decision on whether a call-site should use movs or pushes needs to be aware of its context, because having even one call-site use pushes means we will not have a reserved call frame, which affects the way all other call sites are treated as well. This patch makes the decision based on global attributes only (opt for size vs. speed, stack alignment), but the next step will be to make it based on an analysis of the call-sites – e.g. even with stack alignment of 16, it can still often be a win, depending on just how many of the function calls we can actually transform, and how many memory arguments each call has.

So the way I envision the next step is that the pass will:

a) Collect the necessary information from all call sites in the function.

b) Make a judgment on whether the transformation is worth it – in terms of size for Os/Oz, in terms of performance for other opt levels.

c) Perform the transformation.
I don’t see how we can do this on the DAG.

If I understand Reid’s last suggestion, he proposed to flip the default – that is, emit pushes in the DAG, and have an MI pass that does the opposite (push -> mov) transformation if necessary.

I don’t believe that removes a lot of complexity or would improve performance.
The code in PEI, InstrInfo and FrameLowering is just a side effect on not being able to rely on a 0 SPAdj in PEI anymore (that is, canSimplifyCallFramePseudos() can now return false), and is needed regardless of how the transformation is performed. And we will still need the heuristic decision.
Some of the logic in looking for sequences where the conversion is possible will disappear, but I think a lot of it will remain as conditions on the incoming operand DAG nodes. And since we don’t want to transform each push into a “adjust esp, mov” but rather want to group all the esp adjustments back into the ADJCALLSTACKs, we will still need to have code in the pass that make sure this is safe w.r.t to the final sequence.
The main benefit I see is that we will no longer need to have the folding code – rather, we will have to unfold PUSH32rmm, which is simpler. However, I hope I can eventually get rid of the folding here by teaching PeepholeOptimizer to be smarter about this.

On the other hand, X86TargetLowering::LowerCall() is already, IMHO, a fairly complex piece of code, and I’d rather avoid making it even more complex.
Conceptually, I’d prefer that LowerCall() did standard mov-based lowering in all cases like it does now (we aren’t always going to lower to pushes anyway – it doesn’t really make sense for x864-64) and treat pushes as an optimization where available.

What do you think?


From: Chandler Carruth []
Sent: Thursday, January 22, 2015 23:07
Cc: Kuperstein, Michael M; Nadav Rotem; Demikhovsky, Elena; Commit Messages and Patches for LLVM
Subject: Re: [PATCH] [X86] Convert esp-relative movs of function arguments to pushes, step 2

This revision was automatically updated to reflect the committed changes.