This is a first stab at the next step of the mov-to-push transformation.
It moves the transformation earlier in the pass order so that it can do load-folding, and prepares the required infrastructure.
It is still enabled only in cases where it should be a clear win - when we don't expect to have a reserved call frame, or when optimizing for size.
The next step will be a heuristic that makes a smarter decision on when this should be enabled.
As a side note - I've done some internal testing for effects on the code size, but I'd like to do some testing for things other people care about as well. So, if you have a x86-32 code-base where you care about the code size, and is publicly available, let me know.