Using pushes to move arguments into the stack results in significantly smaller code. We can also remove the _chkstk call, as the pushes probe the stack naturally.
This patch only covers basic cases where there are no complicated instructions in the call sequence. Inalloca calls often have e.g. nested calls or control flow in the call sequence, so in practice this patch doesn't fire a lot, but it's a start.
Please take a look.
This makes me feel like we should model argument allocation as a single operation instead of five or so. What do you think about changing X86TargetLowering::LowerDYNAMIC_STACKALLOC to emit a new DAG node which selects to a single MI?