For Tail calls identified during DAG generation, the target address will
be loaded into a register by use of the constant pool.
If R3 is used for argument passing, the target address is forced
to hard reg R12 in order to overcome limitations thumb1 register
allocator with respect to the upper registers.
During epilog generation, spill register restoring will be done within
the emit epilogue function. Three different cases are to be distinguished.
- If LR is not pushed on the stack. Then simply a BX is generated.
- If LR is pushed on the stack and R3 is available as scratch, LR is restored after pop { ... } for the remaining callee saved regs.
- If R3 is not available for LR restore, LR is restored before pop { ... } and the stack pointer is re-adjusted afterwards
- If all regs R0...R3 are used for function call parameters, The target address will be copied to R12.
For a cortex M0 I did count that the sequence 2) will take one cycle longer than a version based on BL / pop { ..., pc } without a tail call. Option 3 will be 2 cycles slower than a version without a tail call (additional SP+=4) and option 4 will be 3 cycles slower. Also, 2) and 3) and 4) will generate sligthly larger code (have a look at the test cases).
In discussions on llvm-dev some did argue that for this reason, tail call optimization should not be integrated as part of the default options. In my personal perception the spared precious stack memory is readily worth it.
Unnecessary braces