This is an archive of the discontinued LLVM Phabricator instance.

tsan: speed up race deduplication
ClosedPublic

Authored by dvyukov on Sep 2 2015, 6:37 AM.

Details

Reviewers
kcc
Summary

Race deduplication code proved to be a performance bottleneck in the past if suppressions/annotations are used, or just some races left unaddressed. And we still get user complaints about this:
https://groups.google.com/forum/#!topic/thread-sanitizer/hB0WyiTI4e4
ReportRace already has several layers of caching for racy pcs/addresses to make deduplication faster. However, ReportRace still takes a global mutex (ThreadRegistry and ReportMutex) during deduplication and also calls mmap/munmap (which take process-wide semaphore in kernel), this makes deduplication non-scalable.

This patch moves race deduplication outside of global mutexes and also removes all mmap/munmap calls.
As the result, race_stress.cc with 100 threads and 10000 iterations become 30x faster:

before:
real 0m21.673s
user 0m5.932s
sys 0m34.885s

after:
real 0m0.720s
user 0m23.646s
sys 0m1.254s

Diff Detail

Event Timeline

dvyukov updated this revision to Diff 33802.Sep 2 2015, 6:37 AM
dvyukov updated this revision to Diff 33803.
dvyukov retitled this revision from to tsan: speed up race deduplication.
dvyukov updated this object.
dvyukov added a reviewer: kcc.
dvyukov added subscribers: eugenis, samsonov, glider, llvm-commits.
kcc added inline comments.Sep 2 2015, 10:33 AM
lib/lsan/lsan_common.cc
601–602

May I ask you to introduce atomic_{load,store}_relaxed to make the code less verbose?

test/tsan/race_stress.cc
2

Do we still need %deflake?

samsonov added inline comments.Sep 2 2015, 11:10 AM
lib/sanitizer_common/sanitizer_suppressions.h
25

Why is it not atomic_u32?

dvyukov updated this revision to Diff 33848.Sep 2 2015, 12:44 PM

address comments

ptal

lib/lsan/lsan_common.cc
601–602

done

lib/sanitizer_common/sanitizer_suppressions.h
25

done

test/tsan/race_stress.cc
2

done

kcc accepted this revision.Sep 2 2015, 12:46 PM
kcc edited edge metadata.

LGTM

This revision is now accepted and ready to land.Sep 2 2015, 12:46 PM
samsonov added inline comments.Sep 2 2015, 1:43 PM
lib/tsan/rtl/tsan_interface_ann.cc
66–67

I don't see why you're not using atomic_uint32_t here, but leaving this to you.

163–164

Any reason for not hoisting all the loads from race->*counter here?

lib/tsan/rtl/tsan_rtl_report.cc
384–385

Why do you need this change? If you want to improve performance and get rid of mmaps, consider changing __tsan::VarSizeStackTrace to somehow take ownership of the buffer passed into it. Or just make it use Vector internally.

dvyukov updated this revision to Diff 33927.Sep 3 2015, 4:19 AM
dvyukov edited edge metadata.
dvyukov added inline comments.
lib/tsan/rtl/tsan_interface_ann.cc
66–67

Because there is no reason to use atomic_uint32_t.

163–164

done

lib/tsan/rtl/tsan_rtl_report.cc
384–385

__tsan::VarSizeStackTrace is OK. It uses internal_alloc which is cheap for small blocks.
The issue is exactly with this call, it allocates a huge chunk of memory for worst case. Internal_alloc just calls mmap for huge blocks. To fix this, we need to not allocate a huge block. That's why I use growable Vector.

committed as rev 246758