This is an archive of the discontinued LLVM Phabricator instance.

tsan: optimize sync clock memory consumption
ClosedPublic

Authored by dvyukov on Jul 12 2017, 12:55 PM.

Details

Reviewers
kcc
alekseyshl
Summary

This change implements 2 optimizations of sync clocks that reduce memory consumption:

  1. Use previously unused first level block space to store clock elements.

Currently a clock for 100 threads consumes 3 512-byte blocks:

  • 2 64-bit second level blocks to store clock elements
  • +1 32-bit first level block to store indices to second level blocks

Only 8 bytes of the first level block are actually used.
With this change such clock consumes only 2 blocks.

  1. Share similar clocks differing only by a single clock entry for the current thread.

When a thread does several release operations on fresh sync objects without intervening
acquire operations in between (e.g. initialization of several fields in ctor),
the resulting clocks differ only by a single entry for the current thread.
This change reuses a single clock for such release operations. The current thread time
(which is different for different clocks) is stored in dirty entries.

We are experiencing issues with a large program that eats all 64M clock blocks
(32GB of non-flushable memory) and crashes with dense allocator overflow.
Max number of threads in the program is ~170 which is currently quite unfortunate
(consume 4 blocks per clock). Currently it crashes after consuming 60+ GB of memory.
The first optimization brings clock block consumption down to ~40M and
allows the program to work. The second optimization further reduces block consumption
to "modest" 16M blocks (~8GB of RAM) and reduces overall RAM consumption to ~30GB.

Measurements on another real world C++ RPC benchmark show RSS reduction
from 3.491G to 3.186G and a modest speedup of ~5%.

Go parallel client/server HTTP benchmark:
https://github.com/golang/benchmarks/blob/master/http/http.go
shows RSS reduction from 320MB to 240MB and a few percent speedup.

Diff Detail

Event Timeline

dvyukov created this revision.Jul 12 2017, 12:55 PM
alekseyshl accepted this revision.Jul 13 2017, 1:30 PM
alekseyshl added inline comments.
lib/tsan/rtl/tsan_clock.cc
247

but, for simplicity, we currently

263

For my education, why increasing refcount needs less strict memory order than decreasing it?

341

threads -> thread

445–514

dirst -> dirty

lib/tsan/rtl/tsan_clock.h
34

Maybe name them get_test_only and get_clean_test_only then?

115

Cachable -> Cacheable

lib/tsan/rtl/tsan_defs.h
102

Curious why invalid tid has more bits than KTidBits but less bits than tid field in Dirty struct (22)?

lib/tsan/tests/unit/tsan_clock_test.cc
59

ARRAY_SIZE

This revision is now accepted and ready to land.Jul 13 2017, 1:30 PM
dvyukov updated this revision to Diff 106618.Jul 14 2017, 4:28 AM
dvyukov marked 6 inline comments as done.
dvyukov added inline comments.
lib/tsan/rtl/tsan_clock.cc
263

Because with traditional ref counting a thread can only acquire a reference to an object only if it already owns one. So it actually does not acquire anything new and acquire is not a synchronization operation. While release operation does release current thread's ownership of the object and is a synchronization operation.

lib/tsan/rtl/tsan_defs.h
102

The current version does not work because we now put it into a bitfield with 22 bits (64-kClkBits), so

  1. we get warnings about constant truncation.
  2. the const gets truncated and then becomes not equal to kInvalidTid after extraction (it's unsigned so it's not sign extended)

0xffff solves both problems.
Changed it to kMaxTid + 1, it's less magical this way.

dvyukov closed this revision.Jul 14 2017, 4:30 AM

Submitted as 308018.