This is an archive of the discontinued LLVM Phabricator instance.

[BOLT] Fix concurrent hash table modification in the instrumentation runtime
ClosedPublic

Authored by michoecho on Jul 4 2022, 9:20 AM.

Details

Summary

__bolt_instr_data_dump() does not lock the hash tables when iterating
over them, so the iteration can happen concurrently with a modification
done in another thread, when the table is in an inconsistent state. This
also has been observed in practice, when it caused a segmentation fault.

We fix this by locking hash tables during iteration. This is done by taking
the lock in forEachElement().
The only other site of iteration, resetCounters(), has been correctly
locking the table even before this patch. This patch removes its Lock
because the lock is now taken in the inner forEachElement().

Diff Detail

Event Timeline

michoecho created this revision.Jul 4 2022, 9:20 AM
michoecho requested review of this revision.Jul 4 2022, 9:20 AM
michoecho edited the summary of this revision. (Show Details)Jul 4 2022, 9:21 AM

Why can't we use standard std::lock_guard here?

Note: those bugs were found when applying BOLT's instrumentation to ScyllaDB. After the fix, I haven't found any other issues.

yota9 added a comment.Jul 4 2022, 9:38 AM

@ayermolo The runtime libs such as instrumentation must not depend on external libraries like libstdc++. But in our (currently internal) implementation we are using __atomic_test_and_set, I think it is better option here which eliminates inline asm usage and is arch independent

Why can't we use standard std::lock_guard here?

The comments in bolt/runtime/instr.cpp say:

// Since this code is intended to be inserted into any executable, we decided to
// make it standalone and do not depend on any external libraries (i.e. language
// support libraries, such as glibc or stdc++).

@ayermolo The runtime libs such as instrumentation must not depend on external libraries like libstdc++. But in our (currently internal) implementation we are using __atomic_test_and_set, I think it is better option here which eliminates inline asm usage and is arch independent

Ah I see. I am all for eliminating inline assembly, but I leave this to @maksfb

yota9 added a subscriber: Elvina.Jul 5 2022, 10:11 AM

@maksfb If the __atomic_test_and_set is good enough variant @Elvina may prepare this small change for review :)

maksfb accepted this revision.Jul 5 2022, 11:31 AM

@michoecho, thank you for the investigation and the patch. Both changes look good to me.

@yota9, @Elvina, great. Could you submit a patch with __atomic_test_and_set()?

This revision is now accepted and ready to land.Jul 5 2022, 11:31 AM
Elvina added a comment.Jul 5 2022, 2:44 PM

@michoecho, thank you for the investigation and the patch. Both changes look good to me.

@yota9, @Elvina, great. Could you submit a patch with __atomic_test_and_set()?

Sure, here it is https://reviews.llvm.org/D129162

michoecho updated this revision to Diff 442531.Jul 6 2022, 5:36 AM
michoecho retitled this revision from [BOLT] Fix concurrency bugs in the instrumentation runtime to [BOLT] Fix concurrent hash table modification in the instrumentation runtime.
michoecho edited the summary of this revision. (Show Details)

I have removed the Mutex fix from the diff, since the alternative fix (https://reviews.llvm.org/D129162) has been merged.
The __bolt_instr_data_dump() fix remains.

maksfb accepted this revision.Jul 6 2022, 11:23 AM
yota9 accepted this revision.Jul 6 2022, 11:27 AM

Thanks!

I don't have commit access. Could you commit the change, please?

yota9 added a comment.Jul 7 2022, 12:42 AM

@michoecho Sure, but we need your git name and email to specify it in patch, since diff does not contain this information we need to set it manually

Sure, but we need your git name and email to specify it in patch, since diff does not contain this information we need to set it manually

Oh.
Name: Michał Chojnowski
E-mail: michal.chojnowski@scylladb.com
Thanks!

yota9 added a comment.Dec 13 2022, 7:58 AM

For some reason after this patch I observe random deadlocks when running multiple tests. At the beginning of bolt_instr_data_dump() I observe that one of the random GlobalIndCallCounters might become locked and when it comes to the writeIndirectCallProfile->forEachElement it becomes deadlocked. Currently I'm unable to say what is the reason of such behaviour, if patch becomes reverted everything works fine. Also it looks like since visitIndCallCounter is called by forEachElement with mutex lock the CallFlowTable->get() would never be able to lock the mutex (with TryLock) or would always stuck in a deadlock if bolt_instr_conservative option is used.

Also it looks like since visitIndCallCounter is called by forEachElement with mutex lock the CallFlowTable->get() would never be able to lock the mutex

But these are two different mutexes, aren't they? The lock taken by the forEachElement surrounding visitIndCallCounter is in GlobalIndCallCounters[I]. The lock in CallFlowTable->get() sits in Ctx.CallFlowTable. Unless you mean some other locks?

I observe that one of the random GlobalIndCallCounters might become locked and when it comes to the writeIndirectCallProfile->forEachElement it becomes deadlocked.

It seems that GlobalIndCallCounters are only locked in __bolt_instr_data_dump --> writeIndirectCallProfile --> forEachElement, in __bolt_instr_clear_counters --> resetCounters, and in __bolt_instr_indirect_{tail}call --> instrumentIndirectCall --> incrementVal --> get. So they shouldn't lock each other out. Unless... __bolt_instr_data_dump is called from a signal handler.
I see that Lock takes care to mask signals while it's locked, but TryLock doesn't. So if a signal comes while a TryLock is held in __bolt_instr_indirect_{tail}call --> instrumentIndirectCall --> incrementVal --> get, and the signal handler calls __bolt_instr_data_dump, then __bolt_instr_data_dump will get deadlocked, just as you describe.

So, are your deadlocks happening in a signal handler? If yes, then case closed. (But TryLock should be fixed regardless of the answer.)

yota9 added a comment.Jan 7 2023, 10:26 AM

@michoecho
Oh, you are totally right about the locks, I've briefly checked that moment and made a wrong assumption, thank you!
As for tryLocks I'm totally agree with you that basically it should have the same behaviour as lock. But I don't see how __bolt_instr_data_dump could be called from signal handler. Basically we've got 2 situation when this function is called - when DT_FINI is called or (my situation) when we've got forked process waiting for the parent to be dead. The DT_FINI (probably) never get called from the signal handler and our fork doesn't have signal handlers.
Plus when the deadlock occurs the parent process is already normally exited at that moment, so no deadlock should be at that moment.. So I'm still confused why it happens sometimes.