Page MenuHomePhabricator

[XRay][compiler-rt] Segmented Array: Simplify and Optimise
ClosedPublic

Authored by dberris on Jul 16 2018, 12:26 AM.

Details

Summary

This is a follow-on to D49217 which simplifies and optimises the
implementation of the segmented array. In this patch we co-locate the
book-keeping for segments in the __xray::Array<T> with the data it's
managing. We take the chance in this patch to actually rename Chunk to
Segment to better align with the high-level description of the
segmented array.

With measurements using benchmarks landed in D48879, we've identified
that calls to pthread_getspecific started dominating the cycles, which
led us to revert the change made in D49217 to use C++ thread_local
initialisation instead (it reduces the cost by a huge margin, since we
save one PLT-based call to pthread functions in the hot path). In
particular, this is in __xray::getThreadLocalData().

We also took the opportunity to remove the least-common-multiple based
calculation and instead pack as much data into segments of the array.
This greatly simplifies the API of the container which hides as much of
the implementation details as possible. For instance, we calculate the
number of elements we need for the each segment internally in the Array
instead of making it part of the type.

With the changes here, we're able to get a measurable improvement on the
performance of profiling mode on top of what D48879 already provides.

Depends on D48879.

Diff Detail

Event Timeline

dberris created this revision.Jul 16 2018, 12:26 AM

For reference, here's one run of this:

Run on (48 X 3500 MHz CPU s)
2018-07-16 16:52:42
---------------------------------------------------------------------------------------------
Benchmark                                                      Time           CPU Iterations
---------------------------------------------------------------------------------------------
BM_XRayProfilingDeepCallStack/1/real_time/threads:1          210 ns        210 ns    3308489
BM_XRayProfilingDeepCallStack/1/real_time/threads:2          178 ns        356 ns    5002350
BM_XRayProfilingDeepCallStack/1/real_time/threads:4          145 ns        580 ns    6354504
BM_XRayProfilingDeepCallStack/1/real_time/threads:8          104 ns        830 ns    9115392
BM_XRayProfilingDeepCallStack/1/real_time/threads:16         185 ns       2953 ns   11361168
BM_XRayProfilingDeepCallStack/1/real_time/threads:32         131 ns       4174 ns    3908320
BM_XRayProfilingDeepCallStack/2/real_time/threads:1          300 ns        300 ns    2346595
BM_XRayProfilingDeepCallStack/2/real_time/threads:2          214 ns        428 ns    3257884
BM_XRayProfilingDeepCallStack/2/real_time/threads:4          158 ns        634 ns    4000000
BM_XRayProfilingDeepCallStack/2/real_time/threads:8          111 ns        887 ns    8249424
BM_XRayProfilingDeepCallStack/2/real_time/threads:16         106 ns       1693 ns    9066288
BM_XRayProfilingDeepCallStack/2/real_time/threads:32         152 ns       4849 ns    5675424
BM_XRayProfilingDeepCallStack/4/real_time/threads:1          493 ns        493 ns    1364467
BM_XRayProfilingDeepCallStack/4/real_time/threads:2          319 ns        638 ns    2303518
BM_XRayProfilingDeepCallStack/4/real_time/threads:4          248 ns        990 ns    2819940
BM_XRayProfilingDeepCallStack/4/real_time/threads:8          205 ns       1641 ns    5950176
BM_XRayProfilingDeepCallStack/4/real_time/threads:16         136 ns       2173 ns    7368064
BM_XRayProfilingDeepCallStack/4/real_time/threads:32         164 ns       5236 ns    4473248
BM_XRayProfilingDeepCallStack/8/real_time/threads:1          844 ns        844 ns     801574
BM_XRayProfilingDeepCallStack/8/real_time/threads:2          499 ns        998 ns    1443832
BM_XRayProfilingDeepCallStack/8/real_time/threads:4          347 ns       1389 ns    2580568
BM_XRayProfilingDeepCallStack/8/real_time/threads:8          222 ns       1779 ns    4014088
BM_XRayProfilingDeepCallStack/8/real_time/threads:16         180 ns       2878 ns    5293888
BM_XRayProfilingDeepCallStack/8/real_time/threads:32         199 ns       6356 ns    3462080
BM_XRayProfilingDeepCallStack/16/real_time/threads:1        1616 ns       1616 ns     436321
BM_XRayProfilingDeepCallStack/16/real_time/threads:2         901 ns       1802 ns     785188
BM_XRayProfilingDeepCallStack/16/real_time/threads:4         537 ns       2146 ns    1450452
BM_XRayProfilingDeepCallStack/16/real_time/threads:8         324 ns       2589 ns    2484176
BM_XRayProfilingDeepCallStack/16/real_time/threads:16        265 ns       4239 ns    2503008
BM_XRayProfilingDeepCallStack/16/real_time/threads:32        232 ns       7386 ns    3167712
BM_XRayProfilingDeepCallStack/32/real_time/threads:1        3168 ns       3168 ns     222286
BM_XRayProfilingDeepCallStack/32/real_time/threads:2        1671 ns       3341 ns     422232
BM_XRayProfilingDeepCallStack/32/real_time/threads:4         941 ns       3764 ns     722448
BM_XRayProfilingDeepCallStack/32/real_time/threads:8         549 ns       4393 ns    1367408
BM_XRayProfilingDeepCallStack/32/real_time/threads:16        389 ns       6229 ns    1802800
BM_XRayProfilingDeepCallStack/32/real_time/threads:32        314 ns      10027 ns    2194848
BM_XRayProfilingDeepCallStack/64/real_time/threads:1        6237 ns       6236 ns     111448
BM_XRayProfilingDeepCallStack/64/real_time/threads:2        3251 ns       6501 ns     213690
BM_XRayProfilingDeepCallStack/64/real_time/threads:4        1752 ns       7009 ns     392108
BM_XRayProfilingDeepCallStack/64/real_time/threads:8         989 ns       7914 ns     742304
BM_XRayProfilingDeepCallStack/64/real_time/threads:16        653 ns      10454 ns    1098144
BM_XRayProfilingDeepCallStack/64/real_time/threads:32        451 ns      14443 ns    1472192

Comparing against a run without this change but with D49217:

Run on (48 X 3500 MHz CPU s)                                                                
2018-07-12 17:01:35                                                                         
---------------------------------------------------------------------------------------------
Benchmark                                                      Time           CPU Iterations
---------------------------------------------------------------------------------------------
BM_XRayProfilingDeepCallStack/1/real_time/threads:1          202 ns        202 ns    3477313
BM_XRayProfilingDeepCallStack/1/real_time/threads:2          179 ns        357 ns    4581178
BM_XRayProfilingDeepCallStack/1/real_time/threads:4          144 ns        577 ns    5875828
BM_XRayProfilingDeepCallStack/1/real_time/threads:8          125 ns       1003 ns    8311456                               
BM_XRayProfilingDeepCallStack/1/real_time/threads:16         174 ns       2792 ns    9522368
BM_XRayProfilingDeepCallStack/1/real_time/threads:32         146 ns       4687 ns    6358400
BM_XRayProfilingDeepCallStack/2/real_time/threads:1          295 ns        295 ns    2359216
BM_XRayProfilingDeepCallStack/2/real_time/threads:2          239 ns        478 ns    2888910
BM_XRayProfilingDeepCallStack/2/real_time/threads:4          160 ns        638 ns    5410336
BM_XRayProfilingDeepCallStack/2/real_time/threads:8          125 ns        999 ns    7721696
BM_XRayProfilingDeepCallStack/2/real_time/threads:16         103 ns       1647 ns    5126384
BM_XRayProfilingDeepCallStack/2/real_time/threads:32         131 ns       4153 ns    5427136
BM_XRayProfilingDeepCallStack/4/real_time/threads:1          490 ns        490 ns    1326060
BM_XRayProfilingDeepCallStack/4/real_time/threads:2          367 ns        734 ns    2276550                                          
BM_XRayProfilingDeepCallStack/4/real_time/threads:4          249 ns        994 ns    3981604                   
BM_XRayProfilingDeepCallStack/4/real_time/threads:8          174 ns       1394 ns    5467368
BM_XRayProfilingDeepCallStack/4/real_time/threads:16         129 ns       2057 ns    4399568
BM_XRayProfilingDeepCallStack/4/real_time/threads:32         148 ns       4718 ns    4695104
BM_XRayProfilingDeepCallStack/8/real_time/threads:1          873 ns        873 ns     788744
BM_XRayProfilingDeepCallStack/8/real_time/threads:2          535 ns       1071 ns    1177912
BM_XRayProfilingDeepCallStack/8/real_time/threads:4          339 ns       1354 ns    2235540
BM_XRayProfilingDeepCallStack/8/real_time/threads:8          256 ns       2051 ns    3818424                             
BM_XRayProfilingDeepCallStack/8/real_time/threads:16         208 ns       3323 ns    4687040
BM_XRayProfilingDeepCallStack/8/real_time/threads:32         211 ns       6751 ns    3579136
BM_XRayProfilingDeepCallStack/16/real_time/threads:1        1652 ns       1652 ns     414737
BM_XRayProfilingDeepCallStack/16/real_time/threads:2         975 ns       1950 ns     785698
BM_XRayProfilingDeepCallStack/16/real_time/threads:4         601 ns       2402 ns    1400136
BM_XRayProfilingDeepCallStack/16/real_time/threads:8         365 ns       2918 ns    2308440
BM_XRayProfilingDeepCallStack/16/real_time/threads:16        313 ns       5003 ns    1600000
BM_XRayProfilingDeepCallStack/16/real_time/threads:32        256 ns       8177 ns    3033056
BM_XRayProfilingDeepCallStack/32/real_time/threads:1        3419 ns       3418 ns     209959
BM_XRayProfilingDeepCallStack/32/real_time/threads:2        1858 ns       3716 ns     405304
BM_XRayProfilingDeepCallStack/32/real_time/threads:4        1051 ns       4204 ns     690604
BM_XRayProfilingDeepCallStack/32/real_time/threads:8         611 ns       4890 ns    1233168
BM_XRayProfilingDeepCallStack/32/real_time/threads:16        425 ns       6798 ns    1634992
BM_XRayProfilingDeepCallStack/32/real_time/threads:32        336 ns      10737 ns    1958368
BM_XRayProfilingDeepCallStack/64/real_time/threads:1        6438 ns       6438 ns     105337                    
BM_XRayProfilingDeepCallStack/64/real_time/threads:2        3432 ns       6864 ns     197488
BM_XRayProfilingDeepCallStack/64/real_time/threads:4        2477 ns       9906 ns     376460
BM_XRayProfilingDeepCallStack/64/real_time/threads:8        1069 ns       8547 ns     578224
BM_XRayProfilingDeepCallStack/64/real_time/threads:16        684 ns      10949 ns    1079040
BM_XRayProfilingDeepCallStack/64/real_time/threads:32        482 ns      15417 ns    1298176
grandinj added inline comments.
compiler-rt/lib/xray/xray_segmented_array.h
34 ↗(On Diff #155623)

Parameter N seems to be dead?

dberris updated this revision to Diff 155802.Jul 16 2018, 6:24 PM
  • fixup: remove static assert on size and remove outdated comment
dberris marked an inline comment as done.Jul 16 2018, 6:25 PM
dberris added inline comments.
compiler-rt/lib/xray/xray_segmented_array.h
34 ↗(On Diff #155623)

Good catch, thanks!

This revision was not accepted when it landed; it landed in state Needs Review.Jul 17 2018, 7:13 PM
This revision was automatically updated to reflect the committed changes.
dberris marked an inline comment as done.
Herald added a subscriber: Restricted Project. · View Herald TranscriptJul 17 2018, 7:13 PM