The idea here is to make use of the frame pointer as a GPR when it isn't
absolutely required. The observation is that the FP is only needed at
callsites, but at any other point in the function, it can be used as a
GPR. This complicates some things, but is another valid way to use the
FP, but regain some of the performance lost by reserving a register.
Largely, we want to allow the RA to work normally as if FP was not
reserved, and use it freely. We add constraints by restoring FP at any
call site. In limited testing this seems to work as intended.
This prototype only focuses on X86, since its a bit more register starved
than other architectures, like ARM or RISCV. This type of optimization
may also not be worth it over just -momit-leaf.
This is entirely experimental at this point. Internal testing showed
that this does have an improvement over FP, but not as significant as we
would hope. Further analysis shows that using FP is often in the noise
over omitting the FP altogether. Benchmarking seems to exhibit a lot of
variance, so its is hard to estimate the impact. Linux perf is also
influenced by the use of FP, so getting good results is challenging.