This is a follow-on to D49217 which simplifies and optimises the
implementation of the segmented array. In this patch we co-locate the
book-keeping for segments in the __xray::Array<T> with the data it's
managing. We take the chance in this patch to actually rename Chunk to
Segment to better align with the high-level description of the
segmented array.
With measurements using benchmarks landed in D48879, we've identified
that calls to pthread_getspecific started dominating the cycles, which
led us to revert the change made in D49217 to use C++ thread_local
initialisation instead (it reduces the cost by a huge margin, since we
save one PLT-based call to pthread functions in the hot path). In
particular, this is in __xray::getThreadLocalData().
We also took the opportunity to remove the least-common-multiple based
calculation and instead pack as much data into segments of the array.
This greatly simplifies the API of the container which hides as much of
the implementation details as possible. For instance, we calculate the
number of elements we need for the each segment internally in the Array
instead of making it part of the type.
With the changes here, we're able to get a measurable improvement on the
performance of profiling mode on top of what D48879 already provides.
Depends on D48879.