The current unwinding logic uses global locks, which leads to
terrible exception performance in multi-threaded programs.
For example on an AMD EPYC 7713 with 128 cores we see a performance
degradation from 36ms single threaded to 6424ms with 128 threads
when throwing multi-threaded.
See https://isocpp.org/files/papers/P2544R0.html for a detailed
discussion.
This commit fixes this problem by maintaining the exception tables
in a b-tree using optimistic lock coupling, which allows for
lock-free reads. Writers still lock, but changes to exception tables
are rare, they are usually triggered only by dlopen/dlclose calls or
by JITed code. During unwinding that mechanism does not require any
atomic writes, which allows for running the test program mentioned above
in 59ms instead of over 6s.
Unfortunately the mechanism requires cooperation from the application,
as the glibc has no mechanism to notify us when a shared library is
removed. Thus, it is double opt-in: First, the libunwind itself has
to be compiled with -DLIBUNWIND_USE_BTREE=On. And then at runtime,
the application has to enable the mechanism by calling
void __libunwind_btreelookup_sync();
That sync function must be called once at startup and then again
after every dlopen and every dlclose call. Failing to call sync
after dlopen is safe, it can just lead to suboptimal performance by
falling back to the existing mechanism. But failing to call sync
between a dlclose/dlopen sequence is unsafe, as the new library
might remain associated with the old unwinder logic.
But if the application knows what it is doing, and guarantees to
call sync between dlclose and dlopen, it gets dramatically better
performance in exception unwinding.
Note that the current implementation is quit conservative, as it
leaves the ehframe tables as they are and just indexes the tables
themselves with a btree. Which means that the btree will usually
have a few of entries per shared library. An alternative design
would be to store the content of the ehframe tables in the btree,
just as with dynamic frames from JITed code. This would make unnesting
even faster, but would consume more memory, as we would have one entry
per function instead of one entry per library.