This makes the prologue match the windows canonical layout, for cases without a frame pointer.
How badly do we really want to match the canonical packed prologue? Is it really worth generating less efficient instructions to reduce the size of the unwind data? (I guess it's not a lot less efficient, but still.)
Well, the space savings are quite notable, and without this in place, we seldom end up matching the canonical forms allowing use of packed, so I'd hold off pushing D88677 until this one is settled (because there's little point in bending over backwards with the register order if we don't hit the packed forms regularly).
Is it really worth generating less efficient instructions to reduce the size of the unwind data? (I guess it's not a lot less efficient, but still.)
I guess it's marginally less efficient, but in most cases, the produced number of instructions should at least be the same. (In some of the testcase updates, it may look like we're getting more instructions, but that's in cases with sparse CHECK lines without thoroughly checking all with CHECK-NEXT.)
AFAIK in most cases, this patch should amount to changing this:
sub sp, sp, #48 stp x19, x20, [sp, #16] stp x21, x30, [sp, #32]
stp x19, x20, [sp, #-32]! stp x21, x30, [sp, #16] sub sp, sp, #16
So the same number of instructions, but sp is updated twice instead of once - that's the only inefficiency I can think of.
Yes, the dependency chain of sp is one instruction longer, and there's one extra arithmetic op on some cores. Those double if you count the epilogue that has the same issue.
Not sure how significant that is in practice, but I suspect it's observable in code with a lot of small functions.
Maybe it makes sense to distinguish between -O2 vs. -Os here?