Page MenuHomePhabricator

[BOLT] Fix concurrent hash table modification in the instrumentation runtime
ClosedPublic

Authored by michoecho on Jul 4 2022, 9:20 AM.

Details

Summary

__bolt_instr_data_dump() does not lock the hash tables when iterating
over them, so the iteration can happen concurrently with a modification
done in another thread, when the table is in an inconsistent state. This
also has been observed in practice, when it caused a segmentation fault.

We fix this by locking hash tables during iteration. This is done by taking
the lock in forEachElement().
The only other site of iteration, resetCounters(), has been correctly
locking the table even before this patch. This patch removes its Lock
because the lock is now taken in the inner forEachElement().

Diff Detail

Event Timeline

michoecho created this revision.Jul 4 2022, 9:20 AM
michoecho requested review of this revision.Jul 4 2022, 9:20 AM
michoecho edited the summary of this revision. (Show Details)Jul 4 2022, 9:21 AM

Why can't we use standard std::lock_guard here?

Note: those bugs were found when applying BOLT's instrumentation to ScyllaDB. After the fix, I haven't found any other issues.

yota9 added a comment.Jul 4 2022, 9:38 AM

@ayermolo The runtime libs such as instrumentation must not depend on external libraries like libstdc++. But in our (currently internal) implementation we are using __atomic_test_and_set, I think it is better option here which eliminates inline asm usage and is arch independent

Why can't we use standard std::lock_guard here?

The comments in bolt/runtime/instr.cpp say:

// Since this code is intended to be inserted into any executable, we decided to
// make it standalone and do not depend on any external libraries (i.e. language
// support libraries, such as glibc or stdc++).

@ayermolo The runtime libs such as instrumentation must not depend on external libraries like libstdc++. But in our (currently internal) implementation we are using __atomic_test_and_set, I think it is better option here which eliminates inline asm usage and is arch independent

Ah I see. I am all for eliminating inline assembly, but I leave this to @maksfb

yota9 added a subscriber: Elvina.Jul 5 2022, 10:11 AM

@maksfb If the __atomic_test_and_set is good enough variant @Elvina may prepare this small change for review :)

maksfb accepted this revision.Jul 5 2022, 11:31 AM

@michoecho, thank you for the investigation and the patch. Both changes look good to me.

@yota9, @Elvina, great. Could you submit a patch with __atomic_test_and_set()?

This revision is now accepted and ready to land.Jul 5 2022, 11:31 AM
Elvina added a comment.Jul 5 2022, 2:44 PM

@michoecho, thank you for the investigation and the patch. Both changes look good to me.

@yota9, @Elvina, great. Could you submit a patch with __atomic_test_and_set()?

Sure, here it is https://reviews.llvm.org/D129162

michoecho updated this revision to Diff 442531.Jul 6 2022, 5:36 AM
michoecho retitled this revision from [BOLT] Fix concurrency bugs in the instrumentation runtime to [BOLT] Fix concurrent hash table modification in the instrumentation runtime.
michoecho edited the summary of this revision. (Show Details)

I have removed the Mutex fix from the diff, since the alternative fix (https://reviews.llvm.org/D129162) has been merged.
The __bolt_instr_data_dump() fix remains.

maksfb accepted this revision.Jul 6 2022, 11:23 AM
yota9 accepted this revision.Jul 6 2022, 11:27 AM

Thanks!

I don't have commit access. Could you commit the change, please?

yota9 added a comment.Jul 7 2022, 12:42 AM

@michoecho Sure, but we need your git name and email to specify it in patch, since diff does not contain this information we need to set it manually

Sure, but we need your git name and email to specify it in patch, since diff does not contain this information we need to set it manually

Oh.
Name: Michał Chojnowski
E-mail: michal.chojnowski@scylladb.com
Thanks!