This is a follow-up to D20119 (although not dependent on it).
unwind_phase1 and unwind_phase2 allocate their own copies of unw_cursor_t buffers on the stack. This can blow-up stack usage of the unwinder depending on how these two functions get inlined into _Unwind_RaiseException. Clang seems to inline unwind_phase1 into _Unwind_RaiseException but not unwind_phase2, thus creating two unw_cursor_t buffers on the stack.
One way to work-around this problem is to mark both unwind_phase1 and unwind_phase2 as noinline. I chose to take the less macro-dependent approach and explicitly allocate a unw_cursor_t buffer and pass that into unwind_phase1 and unwind_phase2 functions.
This shaves off about ~1000 bytes of stack usage on ARM targets.
The current patch together with D20119, shaves off about roughly 2KB of stack usage.