[AArch64] Prefer prologues with sp adjustments merged into stp/ldp for WinCFI, if optimizing for size
This makes the prologue match the windows canonical layout, for
cases without a frame pointer.
This can potentially be a slower (a longer dependency chain of the
sp register, and potentially one arithmetic operation more on some
cores), but gives notable size improvements.
The previous two commits shrinks a 166 KB xdata section by 49 KB,
and if the change from this commit is enabled, it shrinks the xdata
section by another 25 KB.
In total, since the start of the recent arm64 unwind info cleanups
and optimizations (since before commit 37ef743cbf3), the xdata+pdata
sections of the same test DLL has shrunk from 407 KB in total
originally, to 163 KB now.
Differential Revision: https://reviews.llvm.org/D88701