This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
bolt/runtime/
-
runtime/
-
instr.cpp

Differential D129089

[BOLT] Fix concurrent hash table modification in the instrumentation runtime
ClosedPublic

Authored by michoecho on Jul 4 2022, 9:20 AM.

Download Raw Diff

Details

Reviewers

maksfb
rafaelauler
rafauler
Amir
yota9

Commits

rGbd301a418bf2: [BOLT] Fix concurrent hash table modification in the instrumentation runtime

Summary

__bolt_instr_data_dump() does not lock the hash tables when iterating
over them, so the iteration can happen concurrently with a modification
done in another thread, when the table is in an inconsistent state. This
also has been observed in practice, when it caused a segmentation fault.

We fix this by locking hash tables during iteration. This is done by taking
the lock in forEachElement().
The only other site of iteration, resetCounters(), has been correctly
locking the table even before this patch. This patch removes its Lock
because the lock is now taken in the inner forEachElement().

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

michoecho created this revision.Jul 4 2022, 9:20 AM

Herald added a reviewer: rafauler. · View Herald TranscriptJul 4 2022, 9:20 AM

Herald added a reviewer: Amir. · View Herald Transcript

Herald added a subscriber: ayermolo. · View Herald Transcript

michoecho requested review of this revision.Jul 4 2022, 9:20 AM

Herald added subscribers: llvm-commits, yota9. · View Herald TranscriptJul 4 2022, 9:20 AM

michoecho edited the summary of this revision. (Show Details)Jul 4 2022, 9:21 AM

Harbormaster completed remote builds in B173575: Diff 442113.Jul 4 2022, 9:27 AM

Why can't we use standard std::lock_guard here?

Note: those bugs were found when applying BOLT's instrumentation to ScyllaDB. After the fix, I haven't found any other issues.

@ayermolo The runtime libs such as instrumentation must not depend on external libraries like libstdc++. But in our (currently internal) implementation we are using __atomic_test_and_set, I think it is better option here which eliminates inline asm usage and is arch independent

In D129089#3628617, @ayermolo wrote:

Why can't we use standard std::lock_guard here?

The comments in bolt/runtime/instr.cpp say:

// Since this code is intended to be inserted into any executable, we decided to
// make it standalone and do not depend on any external libraries (i.e. language
// support libraries, such as glibc or stdc++).

The CI at https://buildkite.com/llvm-project/premerge-checks/builds/100883#0181ca04-ccc6-4cd2-98aa-df8a08a31b12 seems to have failed due to the problem which will be fixed by https://reviews.llvm.org/D128971.

In D129089#3628628, @yota9 wrote:

@ayermolo The runtime libs such as instrumentation must not depend on external libraries like libstdc++. But in our (currently internal) implementation we are using __atomic_test_and_set, I think it is better option here which eliminates inline asm usage and is arch independent

Ah I see. I am all for eliminating inline assembly, but I leave this to @maksfb

@maksfb If the __atomic_test_and_set is good enough variant @Elvina may prepare this small change for review :)

@michoecho, thank you for the investigation and the patch. Both changes look good to me.

@yota9, @Elvina, great. Could you submit a patch with __atomic_test_and_set()?

This revision is now accepted and ready to land.Jul 5 2022, 11:31 AM

Thanks!

In D129089#3630961, @maksfb wrote:

@michoecho, thank you for the investigation and the patch. Both changes look good to me.

@yota9, @Elvina, great. Could you submit a patch with __atomic_test_and_set()?

Sure, here it is https://reviews.llvm.org/D129162

I have removed the Mutex fix from the diff, since the alternative fix (https://reviews.llvm.org/D129162) has been merged.
The __bolt_instr_data_dump() fix remains.

Harbormaster completed remote builds in B173869: Diff 442531.Jul 6 2022, 5:42 AM

maksfb accepted this revision.Jul 6 2022, 11:23 AM

Thanks!

I don't have commit access. Could you commit the change, please?

@michoecho Sure, but we need your git name and email to specify it in patch, since diff does not contain this information we need to set it manually

Sure, but we need your git name and email to specify it in patch, since diff does not contain this information we need to set it manually

Oh.
Name: Michał Chojnowski
E-mail: michal.chojnowski@scylladb.com
Thanks!

Closed by commit rGbd301a418bf2: [BOLT] Fix concurrent hash table modification in the instrumentation runtime (authored by michoecho, committed by yota9). · Explain WhyJul 7 2022, 4:29 AM

This revision was automatically updated to reflect the committed changes.

yota9 added a commit: rGbd301a418bf2: [BOLT] Fix concurrent hash table modification in the instrumentation runtime.

For some reason after this patch I observe random deadlocks when running multiple tests. At the beginning of bolt_instr_data_dump() I observe that one of the random GlobalIndCallCounters might become locked and when it comes to the writeIndirectCallProfile->forEachElement it becomes deadlocked. Currently I'm unable to say what is the reason of such behaviour, if patch becomes reverted everything works fine. Also it looks like since visitIndCallCounter is called by forEachElement with mutex lock the CallFlowTable->get() would never be able to lock the mutex (with TryLock) or would always stuck in a deadlock if bolt_instr_conservative option is used.

Herald added a subscriber: treapster. · View Herald TranscriptDec 13 2022, 7:58 AM

Also it looks like since visitIndCallCounter is called by forEachElement with mutex lock the CallFlowTable->get() would never be able to lock the mutex

But these are two different mutexes, aren't they? The lock taken by the forEachElement surrounding visitIndCallCounter is in GlobalIndCallCounters[I]. The lock in CallFlowTable->get() sits in Ctx.CallFlowTable. Unless you mean some other locks?

I observe that one of the random GlobalIndCallCounters might become locked and when it comes to the writeIndirectCallProfile->forEachElement it becomes deadlocked.

It seems that GlobalIndCallCounters are only locked in __bolt_instr_data_dump --> writeIndirectCallProfile --> forEachElement, in __bolt_instr_clear_counters --> resetCounters, and in __bolt_instr_indirect_{tail}call --> instrumentIndirectCall --> incrementVal --> get. So they shouldn't lock each other out. Unless... __bolt_instr_data_dump is called from a signal handler.
I see that Lock takes care to mask signals while it's locked, but TryLock doesn't. So if a signal comes while a TryLock is held in __bolt_instr_indirect_{tail}call --> instrumentIndirectCall --> incrementVal --> get, and the signal handler calls __bolt_instr_data_dump, then __bolt_instr_data_dump will get deadlocked, just as you describe.

So, are your deadlocks happening in a signal handler? If yes, then case closed. (But TryLock should be fixed regardless of the answer.)

@michoecho
Oh, you are totally right about the locks, I've briefly checked that moment and made a wrong assumption, thank you!
As for tryLocks I'm totally agree with you that basically it should have the same behaviour as lock. But I don't see how __bolt_instr_data_dump could be called from signal handler. Basically we've got 2 situation when this function is called - when DT_FINI is called or (my situation) when we've got forked process waiting for the parent to be dead. The DT_FINI (probably) never get called from the signal handler and our fork doesn't have signal handlers.
Plus when the deadlock occurs the parent process is already normally exited at that moment, so no deadlock should be at that moment.. So I'm still confused why it happens sometimes.

Revision Contents

Path

Size

bolt/

runtime/

instr.cpp

2 lines

Diff 442862

bolt/runtime/instr.cpp

Show First 20 Lines • Show All 283 Lines • ▼ Show 20 Lines	MapEntry &get(uint64_t Key, BumpPtrAllocator &Alloc) {
}		}
Lock L(M);		Lock L(M);
return getOrAllocEntry(Key, Alloc);		return getOrAllocEntry(Key, Alloc);
}		}

/// Traverses all elements in the table		/// Traverses all elements in the table
template <typename... Args>		template <typename... Args>
void forEachElement(void (*Callback)(MapEntry &, Args...), Args... args) {		void forEachElement(void (*Callback)(MapEntry &, Args...), Args... args) {
		Lock L(M);
if (!TableRoot)		if (!TableRoot)
return;		return;
return forEachElement(Callback, InitialSize, TableRoot, args...);		return forEachElement(Callback, InitialSize, TableRoot, args...);
}		}

void resetCounters();		void resetCounters();

private:		private:
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines
};		};

template <typename T> void resetIndCallCounter(T &Entry) {		template <typename T> void resetIndCallCounter(T &Entry) {
Entry.Val = 0;		Entry.Val = 0;
}		}

template <typename T, uint32_t X, uint32_t Y>		template <typename T, uint32_t X, uint32_t Y>
void SimpleHashTable<T, X, Y>::resetCounters() {		void SimpleHashTable<T, X, Y>::resetCounters() {
Lock L(M);
forEachElement(resetIndCallCounter);		forEachElement(resetIndCallCounter);
}		}

/// Represents a hash table mapping a function target address to its counter.		/// Represents a hash table mapping a function target address to its counter.
using IndirectCallHashTable = SimpleHashTable<>;		using IndirectCallHashTable = SimpleHashTable<>;

/// Initialize with number 1 instead of 0 so we don't go into .bss. This is the		/// Initialize with number 1 instead of 0 so we don't go into .bss. This is the
/// global array of all hash tables storing indirect call destinations happening		/// global array of all hash tables storing indirect call destinations happening
▲ Show 20 Lines • Show All 1,283 Lines • Show Last 20 Lines