This is an archive of the discontinued LLVM Phabricator instance.

[scudo] Improve the scalability of the shared TSD model
ClosedPublic

Authored by cryptoad on May 23 2018, 3:05 PM.

Details

Summary

The shared TSD model in its current form doesn't scale. Here is an example of
rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core
machines (defaulting to a CompactSizeClassMap and no Quarantine):

  • with tcmalloc: 337K reqs/sec, peak RSS of 338MB;
  • with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB;
  • with scudo (shared): 241K reqs/sec, peak RSS of 324MB.

This isn't great, since the exclusive model uses a lot of memory, while the
shared model doesn't even come close to be competitive.

This is mostly due to the fact that we are consistently scanning the TSD pool
starting at index 0 for an available TSD, which can result in a lot of failed
lock attempts, and touching some memory that needs not be touched.

This CL attempts to make things better in most situations:

  • first, use a thread local variable on Linux (intead of pthread APIs) to store the current TSD in the shared model;
  • move the locking boolean out of the TSD: this allows the compiler to use a register and potentially optimize out a branch instead of reading it from the TSD everytime (we also save a tiny bit of memory per TSD);
  • 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so store the Precedence in a uptr instead of a u64. We lose some nanoseconds of precision and we'll wrap around at some point, but the benefit is worth it;
  • change a CHECK to a DCHECK: this should never happen, but if something is ever terribly wrong, we'll crash on a near null AV if the TSD happens to be null;
  • based on an idea by dvyukov@, we are implementing a bound random scan for an available TSD. This requires computing the coprimes for the number of TSDs, and attempting to lock up to 4 TSDs in an random order before falling back to the current one. This is obviously slightly more expansive when we have just 2 TSDs (barely noticeable) but is otherwise beneficial. The Precedence still basically corresponds to the moment of the first contention on a TSD. To seed on random choice, we use the precedence of the current TSD since it is very likely to be non-zero (since we are in the slow path after a failed tryLock)

With those modifications, the benchmark yields to:

  • with scudo (shared): 330K reqs/sec, peak RSS of 327MB.

So the shared model for this specific situation not only becomes competitive but
outperforms the exclusive model. I experimented with some values greater than 4
for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just
sticking with the current TSD is also a tad slower. Numbers on platforms with
less cores (eg: Android) remain similar.

Diff Detail

Event Timeline

cryptoad created this revision.May 23 2018, 3:05 PM
alekseyshl accepted this revision.May 25 2018, 3:26 PM
This revision is now accepted and ready to land.May 25 2018, 3:26 PM
dvyukov added inline comments.May 26 2018, 1:28 AM
lib/scudo/scudo_tsd_shared.cpp
70–83

The comment from the change description as to why we use precedence as rand state should be duplicated as a comment here. This is confusing.

72

Do you see getTSDAndLockSlow in profiles? Here we have 2 divisions. Divisions are expensive. We could replace them with multiplication+shift if they show up in profiles.

77

Have you tried dropping precedence entirely and then just going over all TSDs pseudo-randomly once (or maybe twice) and tryLock'ing them? If all tryLock's fail then we could lock the old one, or maybe a random one.
That would shave few cycles from malloc/free fast paths. The benefit of using precedence scheme is still unproved, right?
Also, if we have a hundred of cores/caches, then giving up after just 4 can lead to unnecessary blocking (there still can be dozens of available caches).

cryptoad planned changes to this revision.May 28 2018, 6:11 PM

I will experiment with Dmitry's suggestions and report back with results.

Here are some answers to Dmitry's requests:

  • Regarding getTSDAndLockSlow and the division: pprof shows no significant time spent in the function outside of the tryLock & lock so I think we are good here;
  • Regarding the precedence: I tested a version where I dropped it entirely, results are mixed:
    • For Android's "improved" memory_replay: it is faster in all cases, but we only have 2 caches for that specific platform (due to memory constraints compared to the default allocator);
    • For rpc2-benchmark: mostly similar numbers;
    • For t-test1: the version with precedence shows better performances in almost all situations; this benchmark also demonstrates a slowdown with the number of TSDs scanned in the slowpath, eg: scanning 4 and slow locking if they all failed to tryLock performs better overall than scanning 32. And this can be a significant slowdown, for example with t-test1 800 40 800000 100000, it's 900s spent in allocation functions vs 1150s. The argument here is that this benchmark only does {de}allocations (& memset) and as such isn't very representative of "real" programs, but it's exercising the most contention on the caches.

I can't seem to get a definitive answer overall as with or without precedence have both win & lose situations.
The only sure thing so far is that both are better than the current version.

I am open to suggestion or potential improvements, otherwise I'd keep the current version of the CL (and will address the review comments).

What exactly versions did you test? Current vs removing precedence and trylocking all caches? If yes, then it's unclear what factor affects performance as we change 2 things at the same time. So I would also try:

  1. Remove precedence, but scan only 4 random caches, then lock the current cache.
  2. Remove precedence, but scan only 4 random caches, then lock a random cache.
  3. Keep precedence but scan all caches.

I would not expect 3 to be faster, but who knows.
Looking at your results I think it mostly needs to be tested on t-test1, because it should not affect android with 2/4 caches.

Here are more detail numbers for t-test1.

The machine has 72 cores. We are using the shared TSD version with 32 caches (to exercise some contention).
The numbers are the total time (averaged and rounded over 3 consecutive runs) spent in allocation functions only with 40, then 80 concurrent threads:

  • current upstream: 960s, 3315s
  • with precedence, max 4 caches scanned, lock current: 810s, 3200s (current CL proposed)
  • with precedence, max 4 caches scanned, lock random: 815s, 3125s
  • with precedence, all caches scanned, lock current: 880s, 3940s
  • with precedence, all caches scanned, lock random: 890s, 3755s
  • no precedence, max 4 caches scanned, lock current: 900s, 3365s
  • no precedence, max 4 caches scanned, lock random: 840s, 3300s
  • no precedence, all caches scanned, lock current: 1025s, 3600s
  • no precedence, all caches scanned, lock random: 890s, 3785s

Locking a random cache in the event of heavier contention seems to be beneficial, but not necessarily with lesser contention.
Since I am more interested in striking a middle ground rather than aiming for contentious applications, it looks like the precedence matters, as well as not scanning all the caches but limiting ourselves to 4.

dvyukov accepted this revision.Jun 7 2018, 10:47 PM

So your current version is actually the best one.
Thanks for bearing with me.

cryptoad updated this revision to Diff 150521.Jun 8 2018, 8:22 AM
cryptoad marked 3 inline comments as done.

Adding a comment to explain why we use the precedence as random seed.
Pass the current TSD to getTSDAndLockSlow to avoid an extra getCurrentTSD
call.

This revision is now accepted and ready to land.Jun 8 2018, 8:22 AM
cryptoad updated this revision to Diff 150541.Jun 8 2018, 10:48 AM

Remove the continue clause from the loop when the TSD is the current TSD.
This is greatly beneficial to scenarios (Android for example) where we only
have a small number of TSDs. This doesn't impact larger pools of TSDs.

cryptoad requested review of this revision.Jun 8 2018, 10:50 AM

Thanks for the review Aleksey &Dmitry.
If I could get a final LGTM after the last couple of changes that would be great.

This revision is now accepted and ready to land.Jun 8 2018, 12:23 PM
This revision was automatically updated to reflect the committed changes.