This new implementation emits instructions such as these:
movb $0, -4096(%rsp)
which is both faster and smaller than pairs of
sub $4096, %rsp movq $0, (%rsp)
This implementation also trivially preserves the preciseness of the
uwtables during the preamble by not modifying the stack pointer in the
first place.
Testing the generated code for stacks of 0x4000 bytes (4 probes) llvm-mca reports
(over 100 iterations):
test case | mcpu | cycles | IPC | RThroughput |
old | znver2 | 603 | 1.66 | 2.5 |
new | znver2 | 204 | 2.94 | 1.5 |
old | skylake | 603 | 1.66 | 4.0 |
new | skylake | 403 | 1.49 | 4.0 |
old | bdver1 | 603 | 1.66 | 6.0 |
new | bdver1 | 403 | 1.49 | 4.0 |
old | haswell | 603 | 1.66 | 4.0 |
new | haswell | 403 | 1.49 | 4.0 |
So overall in terms of throughput its either the same or
an improvement.
Depends on D98909