This is a first stab at the next step of the mov-to-push transformation.
It moves the transformation earlier in the pass order so that it can do load-folding, and prepares the required infrastructure.
It is still enabled only in cases where it should be a clear win - when we don't expect to have a reserved call frame, or when optimizing for size.
The next step will be a heuristic that makes a smarter decision on when this should be enabled.
As a side note - I've done some internal testing for effects on the code size, but I'd like to do some testing for things other people care about as well. So, if you have a x86-32 code-base where you care about the code size, and is publicly available, let me know.
This seems like an x86-specific quirk, right? Given "push [esp + 8]", x86 chips will load [esp + 8] before adjusting esp, and I think this code motion accomplishes that.
I'm OK with that motion so long as there are no other upstream LLVM backends with CISC-y instructions like "push [SP-mem]". :)