This is an archive of the discontinued LLVM Phabricator instance.

tsan: lock internal allocator around fork
ClosedPublic

Authored by dvyukov on Nov 24 2021, 5:47 AM.

Details

Summary

There is a small chance that the internal allocator is locked
during fork and then the new process is created with locked
internal allocator and any attempts to use it will deadlock.
For example, if detected a suppressed race in the parent during fork
and then another suppressed race after the fork.
This becomes much more likely with the new tsan runtime
as it uses the internal allocator for more things.

Diff Detail

Event Timeline

dvyukov requested review of this revision.Nov 24 2021, 5:47 AM
dvyukov created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptNov 24 2021, 5:47 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
melver accepted this revision.Nov 24 2021, 6:15 AM
This revision is now accepted and ready to land.Nov 24 2021, 6:15 AM
This revision was automatically updated to reflect the committed changes.

TSan tests on the buildbot are timing out after this change and https://reviews.llvm.org/D114532. Not sure if there's a deadlock or if it just slowed things down too much.

[31/32] Running ThreadSanitizer tests
-- Testing: 439 tests, 80 workers --
command timed out: 1200 seconds without output running [b'python', b'../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1967.457687

https://lab.llvm.org/buildbot/#/builders/70/builds/14576

TSan tests on the buildbot are timing out after this change and https://reviews.llvm.org/D114532. Not sure if there's a deadlock or if it just slowed things down too much.

[31/32] Running ThreadSanitizer tests
-- Testing: 439 tests, 80 workers --
command timed out: 1200 seconds without output running [b'python', b'../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1967.457687

https://lab.llvm.org/buildbot/#/builders/70/builds/14576

I stared at this failure before but couldn't understand if it was a real one or a flake (only one bot failed).
Do we know what's special about this particular bot? Does it use some older OS version or something? I can't find this info anywhere and the output does not contain much useful information (what tests hanged at least)...

Maybe https://reviews.llvm.org/D114597 will somehow fix it...

The bot is green again.