Download Raw Diff

Details

Reviewers

rafauler
Amir
maksfb

Commits

rG0cc19b564dd3: Reland "[BOLT][Instrumentation] Put Allocator itslef in shared memory by…
rG47934c119ee2: [BOLT][Instrumentation] Add dumping function to instrumentation hash tables
rGf6682ad03f29: [BOLT][Instrumentation] Disallow combining append-pid with sleep-time/wait-forks
rGad4e0770ca7e: [BOLT][Instrumentation] Put Allocator itslef in shared memory by default
rG02c3724d4384: [BOLT][Instrumentation] Don't share counters when using append-pid

Summary

This diff fixes a few related issues:

Shared counters when using instrumentation-file-append-pid.
The point of append-pid option is to record separate profiles for separate forks, which is impossible when counters are the same for every process. It leads to a sum of all profiles in every file, plus GlobalWriteProfileMutex located in a shared memory prevents some processes from dumping their data at all. So, in this patch we only map counters as shared when append-pid is not used, and provide a test to ensure that different processes don't pollute each other's profiles.

Hash table corruption
In absence of instrumentation-file-append-pid option, global allocator uses shared pages for allocation. However, since it is a global variable, it gets COW'd after fork if instrumentation-sleep-time is used, or each time a process forks by itself. This means it handles the same pages to every process which causes hash table corruption - different entries overwrite each other, sometimes creating endless cycles. Thus, if we want shared pages, we need to put the allocator itself in a shared page, which we do in this patch in __bolt_instr_setup.

Unexpected/unspecified behavior of instrumentation-file-append-pid combined with instrumentation-{sleep-time,wait-forks}
The point of instrumentation-sleep-time option is to have a watcher process which shares memory with all other forks and dumps a common profile each n seconds. The append-pid is the opposite - it should record a private profile of each process. Combining the two suggests that we should get a private profile of each fork every n seconds, but such behavior is not implemented currently and is not easy to implement in general, because we somehow need to intercept each individual fork, launch a watcher process just for that fork, and also map counters so that they're only shared with that single fork. Since we're not doing that, the most reasonable thing to do seems to be disallowing such combination of options. I can make a separate diff for that if you think it doesn't fit here.

Also, while debugging all that i created a simple dump() function to understand what's happening, which i include here in case other hash table issues arise.

Diff Detail

Event Timeline

treapster created this revision.Jun 26 2023, 7:37 AM

Herald added a reviewer: rafauler. · View Herald TranscriptJun 26 2023, 7:37 AM

Herald added a reviewer: Amir. · View Herald Transcript

Herald added a reviewer: maksfb. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added a subscriber: ayermolo. · View Herald Transcript

treapster requested review of this revision.Jun 26 2023, 7:37 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 26 2023, 7:37 AM

Herald added subscribers: llvm-commits, yota9. · View Herald Transcript

treapster edited the summary of this revision. (Show Details)Jun 26 2023, 7:39 AM

treapster edited the summary of this revision. (Show Details)

treapster edited the summary of this revision. (Show Details)Jun 26 2023, 7:42 AM

treapster edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B241173: Diff 534547.Jun 26 2023, 7:46 AM

treapster edited the summary of this revision. (Show Details)Jun 26 2023, 7:49 AM

For some reason CHECK-ALL-DAG clause does not always work as expected on the second invocation, what may be happening here? I tried using {{}} regex to match the words exactly, but it didn't help

rafauler added inline comments.Jun 26 2023, 2:46 PM

bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗	(On Diff #534547)	For some reason this test is failing on shared build. Build BOLT with BUILD_SHARED_LIBS=On to check that.

In D153771#4449272, @treapster wrote:

For some reason CHECK-ALL-DAG clause does not always work as expected on the second invocation, what may be happening here? I tried using {{}} regex to match the words exactly, but it didn't help

missed this comment. Yes, it looks flaky.

Thanks for working on fixing these issues, @treapster! I have some suggestions below.

bolt/runtime/instr.cpp
254	report -> dump see Graph::dump() for an easier way to write (and read) this code also include a call to this function inside a DEBUG() macro to showcase how/where you want this printed. If you don't want to always print it in debug, leave it commented out, so at least we know how to use it when we need it.
bolt/test/runtime/instrumentation-indirect-2.c
90–91 ↗	(On Diff #534547)	Here I would rather have a command that moves the child/parent fdata to fixed file names such as: mv $t.$child_pid.fdata child.fdata mv $t.$par_pid.fdata parent.fdata The reason is because if we don't do that, each time a developer runs "ninja check-bolt", it will create a new unique file (with a different PID attached to the file name) in the Output folder and won't replace the previous one (the expected behavior), unnecessarily using more disk space over time.
96–97 ↗	(On Diff #534547)	This can accidentally match other entries in the profile, such as the activity in funcX calling printf(). I imagine what we want to match is the indirect call itself, right? For that I would rather use: RUN: llvm-bolt %t.exe -data %t.child.fdata \ RUN: -print-finalized -print-only=main -o /dev/null \| FileCheck %s --check-prefix=CHECK_CHILD And then match this string: {{.}}: callq %rax # CallProfile: 8 (0 misses) : { func1: 1 (0 misses) }, { func3: 1 (0 misses) }, ... Actually, I ran BOLT like that and surprisingly the profile is incorrect and it is printing this: .LFT15 (8 instructions, align : 1) Exec Count : 8 CFI State : 3 Predecessors: .Ltmp10 000000d1: movslq -0x94(%rbp), %rax 000000d8: movq -0x90(%rbp,%rax,8), %rax 000000e0: movl -0x98(%rbp), %edi 000000e6: callq *%rax # CallProfile: 8 (0 misses) : { <unknown>: 1 (0 misses) }, { <unknown>: 1 (0 misses) }, { <unknown>: 1 (0 misses) }, { <unknown>: 1 (0 misses) }, { <unknown>: 1 (0 misses) }, { <unknown>: 1 (0 misses) }, { <unknown>: 1 (0 misses) }, { <unknown>: 1 (0 misses) } 000000e8: movl -0x94(%rbp), %eax 000000ee: addl $0x2, %eax 000000f1: movl %eax, -0x94(%rbp) 000000f7: jmp .Ltmp10 Successors: .Ltmp10 (mispreds: 0, count: 8) I'm not sure why yet.

rafauler added inline comments.Jun 26 2023, 4:12 PM

bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗	(On Diff #534547)	The reason this doesn't work is because "func1" is matching to "func10" Either remove func10-func16 or rename them with letters. e.g.: funca, funcb, funcc, funcd, funcf, funcg

rafauler mentioned this in D151920: [BOLT] Instrumentation: Fix tests.Jun 26 2023, 5:34 PM

Fix nits

treapster added inline comments.Jun 27 2023, 1:15 AM

bolt/runtime/instr.cpp
254	It is done this way because we want to construct a string once and print it atomically in a single call to write(). If write is called more than once, we get garbage when threads and processes are involved.

treapster marked an inline comment as done.Jun 27 2023, 1:16 AM

treapster added inline comments.Jun 27 2023, 1:19 AM

bolt/test/runtime/instrumentation-indirect-2.c
96–97 ↗	(On Diff #534547)	If we match instructions, it unnecessarily becomes arch-specific. I think we need to figure out why indirect calls are not recorded in profile and construct a regex to match them there.

Harbormaster completed remote builds in B241395: Diff 534865.Jun 27 2023, 1:20 AM

treapster added inline comments.Jun 27 2023, 1:23 AM

bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗	(On Diff #534547)	I tried matching {{\bfunc1\b}} and similar constructs and it didn't help. I now changed numbers to letters and it still seems to fail sometimes..

treapster edited the summary of this revision. (Show Details)Jun 27 2023, 2:14 AM

Turns out, the [unknown] entries in profile are because addresses in indirect call descriptions are not relocated, which makes them meaningless in PIE because of ASLR. When the test is compiled with no-pie, indirect calls are recorded accurately. So, we should either compute base address and add it to stored entries, or produce dynamic relocations.

In D153771#4452452, @treapster wrote:

Turns out, the [unknown] entries in profile are because addresses in indirect call descriptions are not relocated, which makes them meaningless in PIE because of ASLR. When the test is compiled with no-pie, indirect calls are recorded accurately. So, we should either compute base address and add it to stored entries, or produce dynamic relocations.

But since .bolt.instr.tables is not allocatable, the only option is to perform relocation by hand. BTW, why isn't it allocatable?

rafauler added inline comments.Jun 27 2023, 1:56 PM

bolt/runtime/instr.cpp
254	Sounds good, can you write a comment explaining that? (regarding the single call to write() )
bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗	(On Diff #534547)	You're right, unfortunately it seems to be failing some times. That's annoying. I gave up on the DAG thing and tried this: https://pastebin.com/rgn1hKr9 And it seems to be working consistently. That's one option forward, if you're OK with it.
96–97 ↗	(On Diff #534547)	Sounds good. Matching against the CallProfile annotation (ignoring the instruction opcode) shouldn't be arch-specific, though. But I'm also OK with regex matching fdata, whichever you prefer.

In D153771#4452452, @treapster wrote:

Turns out, the [unknown] entries in profile are because addresses in indirect call descriptions are not relocated, which makes them meaningless in PIE because of ASLR. When the test is compiled with no-pie, indirect calls are recorded accurately. So, we should either compute base address and add it to stored entries, or produce dynamic relocations.

Good catch, that looks like a nasty bug in instrumentation for PIE objects. If you like, we can commit this diff forcing the test to be no-pie and then work on the fix for PIE on another diff.

In D153771#4452461, @treapster wrote:

In D153771#4452452, @treapster wrote:

Turns out, the [unknown] entries in profile are because addresses in indirect call descriptions are not relocated, which makes them meaningless in PIE because of ASLR. When the test is compiled with no-pie, indirect calls are recorded accurately. So, we should either compute base address and add it to stored entries, or produce dynamic relocations.

But since .bolt.instr.tables is not allocatable, the only option is to perform relocation by hand. BTW, why isn't it allocatable?

Because it is encoded as an ELF note section. That's not allocatable at load time, but our runtime will open the ELF file and read them. I don't remember why exactly I did that, I think at the time my motivation was to avoid as much as possible relying on the linker (RuntimeDyld) resolving all references from code to this table. So we just manually deserialize it instead of encoding as a global object in the binary. When this was written I knew we were abusing RuntimeDyld and that it wouldn't work in a variety of scenarios, that's why I was trying to keep the code as easy on the linker as possible (see bolt/docs/RuntimeLibrary.md - section Limitations).

Anyway, even we do encode this section (.bolt.instr.tables) as allocatable data section with dynamic R_X86_64_RELATIVE relocs fixing these addresses, we will likely have to figure out how to generate and insert the .rela section correctly in the binary. Maybe now that we're using JITLink, this task will be easier, I don't know.

Manually adding the PIE load address in instr.cpp:readDescription() might or might not be easier.

treapster added inline comments.Jun 27 2023, 2:51 PM

bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗	(On Diff #534547)	It's ok but it will fire if the order in profile changes. Is it specified currently? Also we probably need to use -sleep-time for the first run, because otherwise the first process may start writing before the second finishes, and the second won't overwrite with newer profile because write mutex is locked. I'll play with it a bit more, it may not be a bug in FileCheck:)

ayermolo added inline comments.Jun 27 2023, 2:58 PM

bolt/runtime/instr.cpp
255	Can you add some kind of check that we are not overflowing Buf?

In D153771#4453993, @rafauler wrote:

Manually adding the PIE load address in instr.cpp:readDescription() might or might not be easier.

The main issue here is we'll have to either map the whole BinContents as RW, or memcpy just descriptions to another writeble region. Then just loop over entries and patch. Sounds a bit easier than messing with JITLink, though. Another solution may be to just subtract the base address every time inside lookupIndCallTarget.

rafauler added inline comments.Jun 27 2023, 3:12 PM

bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗	(On Diff #534547)	You're right, the order is not guaranteed.

Add comment and buffer overflow assertion

treapster marked 2 inline comments as done.Jun 27 2023, 3:24 PM

Harbormaster completed remote builds in B241622: Diff 535156.Jun 27 2023, 3:45 PM

Fix test, use no-pie executable.

! In D153771#4453963, @rafauler wrote:
If you like, we can commit this diff forcing the test to be no-pie and then work on the fix for PIE on another diff.

Done, the test is non-pie now

Harbormaster completed remote builds in B241776: Diff 535379.Jun 28 2023, 7:02 AM

rafauler added inline comments.Jun 28 2023, 4:04 PM

bolt/runtime/instr.cpp
216	void *
1627–1629	Drop the cast GlobalMetadataStorage = __mmap(0, 4096, PROT_READ \| PROT_WRITE, (Shared ? MAP_SHARED : MAP_PRIVATE) \| MAP_ANONYMOUS, -1, 0);
bolt/test/runtime/instrumentation-indirect-2.c
111 ↗	(On Diff #535379)	Do you really need that? (Lines 111 and 66) I'm testing locally with the sleep timers removed and the test is not failing. If line 65 " Wait for profile and output to be fully written" is the problem, I don't know, perhaps something like https://linux.die.net/man/1/sync? (I don't know because I'm not sure what's happening in your machine, but it looks odd to me that the process will finish but your test script will access a half-finished output file). If the problem is the one described in line 110 "in case child outlives parent", then perhaps using bash's "wait" command? https://phoenixnap.com/kb/bash-wait-command

Amir added inline comments.Jun 28 2023, 5:48 PM

bolt/runtime/common.h
85 ↗	(On Diff #535379)	@rafauler: so why don't we include mman.h? @treapster: since this particular change is NFC, can you please split it out to reduce functional changes surface and simplify testing and reviewing? [The patch should] be an isolated change. Independent changes should be submitted as separate patches as this makes reviewing easier. https://llvm.org/docs/Contributing.html#how-to-submit-a-patch

rafauler added inline comments.Jun 28 2023, 6:05 PM

bolt/runtime/common.h
85 ↗	(On Diff #535379)	If we include any headers, it will pull the file from the host system, which might not match the target system. I actually prefer this particular patch as is (non-split). In general I think it's harder for me to work reviewing large stacks (unless there are lots of lines of code changed) and I will internally squash a stack of diffs into one for easier testing. But I agree there are many benefits to splitting a diff, so it's fine to me if @treapster wants to do that.

treapster added inline comments.Jun 29 2023, 1:09 AM

bolt/test/runtime/instrumentation-indirect-2.c
111 ↗	(On Diff #535379)	I do get spurious failures if i remove any of the sleeps, and in the second case it always looks like incomplete output. Wait only works for a background job of the shell, in our case we have the main process finished and the output file read by next commands before the child finishes and writes it's output. AFAIK there is no way for us to wait for the whole process tree so we have to just sleep and hope it's finished. Regarding sync, it is also not what we need because we don't care whether the file is flushed to disc or not, we just need to wait till a process stops writing to it. Although sync seems to be working, probably because it spends enough time flushing for child process to finish and write everything. But still can fail any time. There is another solution, however: for stdout redirection, i imagine we'll have the file open till both processes finish. So we can wait in a loop until `fuser` returns 1, which will guarantee both processes finished. For instrumentation profile, we'll need to change watchProcess() so that it opens profile only once and keeps FD open until it's done, seeking to zero on every iteration. This way we can also query `fuser` on the file to know when it's safe to read it. But this is a topic for another diff:).

treapster mentioned this in D154056: [BOLT][Instrumentation][NFC] define and use mmap constants.Jun 29 2023, 2:18 AM

Fix nits, add mmap check

Harbormaster completed remote builds in B242022: Diff 535701.Jun 29 2023, 3:14 AM

treapster added a parent revision: D154056: [BOLT][Instrumentation][NFC] define and use mmap constants.Jun 29 2023, 3:15 AM

treapster marked 2 inline comments as done.Jun 29 2023, 3:19 AM

treapster added a child revision: D154121: [BOLT][Instrumentation] Fix indirect call profile in PIE.Jun 29 2023, 10:56 AM

LGTM without the test

Just commit the change without the test. Honestly it's not worth it including a test that might fail because the children process is not synced, and I fear sleep wouldn't be a proper sync if the system is heavily loaded. We have too much infrastructure that depends on tests being solid and stable, and the risk of random failures isn't worth the pain. We can work on fixing instrumentation testability on other diffs.

This revision is now accepted and ready to land.Jun 29 2023, 1:29 PM

Closed by commit rG02c3724d4384: [BOLT][Instrumentation] Don't share counters when using append-pid (authored by treapster). · Explain WhyJun 29 2023, 3:05 PM

This revision was automatically updated to reflect the committed changes.

treapster added a commit: rG02c3724d4384: [BOLT][Instrumentation] Don't share counters when using append-pid.

treapster added a commit: rGad4e0770ca7e: [BOLT][Instrumentation] Put Allocator itslef in shared memory by default.

treapster added a commit: rGf6682ad03f29: [BOLT][Instrumentation] Disallow combining append-pid with sleep-time/wait-forks.

treapster added a commit: rG47934c119ee2: [BOLT][Instrumentation] Add dumping function to instrumentation hash tables.

Amir added a reverting change: rG4314f4ceb5c8: Revert "[BOLT][Instrumentation] Put Allocator itslef in shared memory by….Jun 29 2023, 7:31 PM

This change breaks upstream testing: https://lab.llvm.org/buildbot/#/builders/244/builds/13736. Reverted.

Amir added a reverting change: rGc15e9b6814e5: Revert "[BOLT][Instrumentation] Don't share counters when using append-pid".Jun 29 2023, 7:55 PM

Also reverted append-pid commit due to the breakage in https://lab.llvm.org/buildbot/#/builders/252/builds/2700.

Clang-BOLT still fails: https://lab.llvm.org/buildbot/#/builders/252/builds/2701. Not sure what could be the reason, but started failing with this set of changes.

I'll be reverting one by one until https://lab.llvm.org/buildbot/#/builders/252 passes

The builder is now green at https://lab.llvm.org/buildbot/#/builders/252/builds/2707.
Looks like the last two reverted commits (NFC with defines + asserts) were not related. Sorry about the churn, please reland. The other two appear to be related, or at least one. Lmk if you need help reproducing that build, but it should generally be straightforward based on cmake args from the builder.

Thanks for catching&reverting that, i'll try to reproduce a bit later and report back.

Relanded "Don't share counters when using append-pid" – the crash wasn't in instrumented binary.
So the issue must be in "Put Allocator itslef in shared memory by default".

At first i couldn't reproduce with cmake configured to use clang, but with gcc the runtime library does indeed break. Turns out, the definition for

void *operator new(size_t, void *) noexcept;

which was declared in the problematic commit is not provided by gcc, and in runtimelib it's use is replaced with a call and a relocation to undefined symbol(demangled to operator new(unsigned long, void*):

0000000000008352  0000005700000004 R_X86_64_PLT32         0000000000000000 _ZnwmPv - 4

For some reason JITLink does not catch it, and after linking the runtime lib to the binary, the call is just left there with zero immediate which causes segfault.
If we provide a definition for operator new, the problem goes away. Now why i did not provide a definition:

That answer says we only need declaration and doesn't mention definition
The page on cppreference does not list (9) and (10) as replacable and in Notes section only allows definition in class scope, or in global scope with non-void pointer type.
The page from that answer explicitly says: These functions are reserved; a C++ program may not define functions that displace the versions in the C++ standard library.

So it led me to believe that these forms of new are implicitly defined in the compiler and it is UB to define them in global scope. But apparently GCC requires them(or at least the one in question) to be explicitly defined. There is also ambiguity in whether we can call it "displacing the versions in the standard library" when we're not using standard library - whether UB arises from the definition alone or from definition when using standard library. We can probably assume the latter and get away with it, but it is not super clear. So, C++ is being C++ again.

Since we don't need generic global placement new operator, and defining it in class scope is legal according to cppreference, i decided to move the operator to BumpPtrAllocator scope and define it there.

Harbormaster completed remote builds in B242799: Diff 536765.Jul 3 2023, 7:55 AM

treapster mentioned this in D154436: [BOLT][Instrumentation] Keep profile open in WatchProcess.Jul 4 2023, 4:47 AM

Hello @Amir, i think it's ok to reland it now, do you agree?

Looks good on my end. Let's reland and monitor the buildbot.

treapster added a commit: rG0cc19b564dd3: Reland "[BOLT][Instrumentation] Put Allocator itslef in shared memory by….Jul 7 2023, 6:39 AM

treapster removed a child revision: D154121: [BOLT][Instrumentation] Fix indirect call profile in PIE.Jul 7 2023, 6:43 AM

Diff 536765

bolt/runtime/instr.cpp

	Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines
	void setShared(bool S) { Shared = S; }			void setShared(bool S) { Shared = S; }

	void destroy() {			void destroy() {
	if (StackBase == nullptr)			if (StackBase == nullptr)
	return;			return;
	__munmap(StackBase, MaxSize);			__munmap(StackBase, MaxSize);
	}			}

				// Placement operator to construct allocator in possibly shared mmaped memory
				static void operator new(size_t, void Ptr) { return Ptr; };

	private:			private:
	static constexpr uint64_t Magic = 0x1122334455667788ull;			static constexpr uint64_t Magic = 0x1122334455667788ull;
	uint64_t MaxSize = 0xa00000;			uint64_t MaxSize = 0xa00000;
	uint8_t *StackBase{nullptr};			uint8_t *StackBase{nullptr};
	uint64_t StackSize{0};			uint64_t StackSize{0};
	bool Shared{false};			bool Shared{false};
	Mutex M;			Mutex M;
	};			};

	/// Used for allocating indirect call instrumentation counters. Initialized by			/// Used for allocating indirect call instrumentation counters. Initialized by
	/// __bolt_instr_setup, our initialization routine.			/// __bolt_instr_setup, our initialization routine.
	BumpPtrAllocator GlobalAlloc;			BumpPtrAllocator *GlobalAlloc;
				rafaulerUnsubmitted Done Reply Inline Actions void * rafauler: void *

				// Storage for GlobalAlloc which can be shared if not using
				// instrumentation-file-append-pid.
				void *GlobalMetadataStorage;

	} // anonymous namespace			} // anonymous namespace

	// User-defined placement new operators. We only use those (as opposed to			// User-defined placement new operators. We only use those (as opposed to
	// overriding the regular operator new) so we can keep our allocator in the			// overriding the regular operator new) so we can keep our allocator in the
	// stack instead of in a data section (global).			// stack instead of in a data section (global).
	void *operator new(size_t Sz, BumpPtrAllocator &A) { return A.allocate(Sz); }			void *operator new(size_t Sz, BumpPtrAllocator &A) { return A.allocate(Sz); }
	void *operator new(size_t Sz, BumpPtrAllocator &A, char C) {			void *operator new(size_t Sz, BumpPtrAllocator &A, char C) {
	auto Ptr = reinterpret_cast<char >(A.allocate(Sz));			auto Ptr = reinterpret_cast<char >(A.allocate(Sz));
	Show All 16 Lines

	// Disable instrumentation optimizations that sacrifice profile accuracy			// Disable instrumentation optimizations that sacrifice profile accuracy
	extern "C" bool __bolt_instr_conservative;			extern "C" bool __bolt_instr_conservative;

	/// Basic key-val atom stored in our hash			/// Basic key-val atom stored in our hash
	struct SimpleHashTableEntryBase {			struct SimpleHashTableEntryBase {
	uint64_t Key;			uint64_t Key;
	uint64_t Val;			uint64_t Val;
	void dump(const char *Msg = nullptr) {			void dump(const char *Msg = nullptr) {
				rafaulerUnsubmitted Not Done Reply Inline Actions report -> dump see Graph::dump() for an easier way to write (and read) this code also include a call to this function inside a DEBUG() macro to showcase how/where you want this printed. If you don't want to always print it in debug, leave it commented out, so at least we know how to use it when we need it. rafauler: report -> dump see Graph::dump() for an easier way to write (and read) this code also include…
				treapsterAuthorUnsubmitted Done Reply Inline Actions It is done this way because we want to construct a string once and print it atomically in a single call to write(). If write is called more than once, we get garbage when threads and processes are involved. treapster: It is done this way because we want to construct a string once and print it atomically in a…
				rafaulerUnsubmitted Done Reply Inline Actions Sounds good, can you write a comment explaining that? (regarding the single call to write() ) rafauler: Sounds good, can you write a comment explaining that? (regarding the single call to write() )
	// TODO: make some sort of formatting function			// TODO: make some sort of formatting function
				ayermoloUnsubmitted Done Reply Inline Actions Can you add some kind of check that we are not overflowing Buf? ayermolo: Can you add some kind of check that we are not overflowing Buf?
	// Currently we have to do it the ugly way because			// Currently we have to do it the ugly way because
	// we want every message to be printed atomically via a single call to			// we want every message to be printed atomically via a single call to
	// __write. If we use reportNumber() and others nultiple times, we'll get			// __write. If we use reportNumber() and others nultiple times, we'll get
	// garbage in mulithreaded environment			// garbage in mulithreaded environment
	char Buf[BufSize];			char Buf[BufSize];
	char *Ptr = Buf;			char *Ptr = Buf;
	Ptr = intToStr(Ptr, __getpid(), 10);			Ptr = intToStr(Ptr, __getpid(), 10);
	*Ptr++ = ':';			*Ptr++ = ':';
	Show All 34 Lines
	uint32_t IncSize = 7>			uint32_t IncSize = 7>
	class SimpleHashTable {			class SimpleHashTable {
	public:			public:
	using MapEntry = T;			using MapEntry = T;

	/// Increment by 1 the value of \p Key. If it is not in this table, it will be			/// Increment by 1 the value of \p Key. If it is not in this table, it will be
	/// added to the table and its value set to 1.			/// added to the table and its value set to 1.
	void incrementVal(uint64_t Key, BumpPtrAllocator &Alloc) {			void incrementVal(uint64_t Key, BumpPtrAllocator &Alloc) {
	++get(Key, Alloc).Val;			if (!__bolt_instr_conservative) {
				TryLock L(M);
				if (!L.isLocked())
				return;
				auto &E = getOrAllocEntry(Key, Alloc);
				++E.Val;
				return;
				}
				Lock L(M);
				auto &E = getOrAllocEntry(Key, Alloc);
				++E.Val;
	}			}

	/// Basic member accessing interface. Here we pass the allocator explicitly to			/// Basic member accessing interface. Here we pass the allocator explicitly to
	/// avoid storing a pointer to it as part of this table (remember there is one			/// avoid storing a pointer to it as part of this table (remember there is one
	/// hash for each indirect call site, so we wan't to minimize our footprint).			/// hash for each indirect call site, so we wan't to minimize our footprint).
	MapEntry &get(uint64_t Key, BumpPtrAllocator &Alloc) {			MapEntry &get(uint64_t Key, BumpPtrAllocator &Alloc) {
	if (!__bolt_instr_conservative) {			if (!__bolt_instr_conservative) {
	TryLock L(M);			TryLock L(M);
	Show All 27 Lines
	template <typename... Args>			template <typename... Args>
	void forEachElement(void (*Callback)(MapEntry &, Args...),			void forEachElement(void (*Callback)(MapEntry &, Args...),
	uint32_t NumEntries, MapEntry *Entries, Args... args) {			uint32_t NumEntries, MapEntry *Entries, Args... args) {
	for (uint32_t I = 0; I < NumEntries; ++I) {			for (uint32_t I = 0; I < NumEntries; ++I) {
	MapEntry &Entry = Entries[I];			MapEntry &Entry = Entries[I];
	if (Entry.Key == VacantMarker)			if (Entry.Key == VacantMarker)
	continue;			continue;
	if (Entry.Key & FollowUpTableMarker) {			if (Entry.Key & FollowUpTableMarker) {
	forEachElement(Callback, IncSize,			MapEntry *Next =
	reinterpret_cast<MapEntry *>(Entry.Key &			reinterpret_cast<MapEntry *>(Entry.Key & ~FollowUpTableMarker);
	~FollowUpTableMarker),			assert(Next != Entries, "Circular reference!");
	args...);			forEachElement(Callback, IncSize, Next, args...);
	continue;			continue;
	}			}
	Callback(Entry, args...);			Callback(Entry, args...);
	}			}
	}			}

	MapEntry &firstAllocation(uint64_t Key, BumpPtrAllocator &Alloc) {			MapEntry &firstAllocation(uint64_t Key, BumpPtrAllocator &Alloc) {
	TableRoot = new (Alloc, 0) MapEntry[InitialSize];			TableRoot = new (Alloc, 0) MapEntry[InitialSize];
	▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	uint64_t(Entries),			uint64_t(Entries),
	"circular reference created!\n");			"circular reference created!\n");
	// DEBUG(NextLevelTbl[CurEntrySelector].dump("New level entry: "));			// DEBUG(NextLevelTbl[CurEntrySelector].dump("New level entry: "));
	// DEBUG(Entry.dump("Updated old entry: "));			// DEBUG(Entry.dump("Updated old entry: "));
	return getEntry(NextLevelTbl, Key, Remainder, Alloc, CurLevel + 1);			return getEntry(NextLevelTbl, Key, Remainder, Alloc, CurLevel + 1);
	}			}

	MapEntry &getOrAllocEntry(uint64_t Key, BumpPtrAllocator &Alloc) {			MapEntry &getOrAllocEntry(uint64_t Key, BumpPtrAllocator &Alloc) {
	if (TableRoot)			if (TableRoot) {
	return getEntry(TableRoot, Key, Key, Alloc, 0);			MapEntry &E = getEntry(TableRoot, Key, Key, Alloc, 0);
				assert(!(E.Key & FollowUpTableMarker), "Invalid entry!");
				return E;
				}
	return firstAllocation(Key, Alloc);			return firstAllocation(Key, Alloc);
	}			}
	};			};

	template <typename T> void resetIndCallCounter(T &Entry) {			template <typename T> void resetIndCallCounter(T &Entry) {
	Entry.Val = 0;			Entry.Val = 0;
	}			}

	▲ Show 20 Lines • Show All 179 Lines • ▼ Show 20 Lines
	DEBUG(reportNumber("replace mmap stop: ", CountersEnd, 16));			DEBUG(reportNumber("replace mmap stop: ", CountersEnd, 16));
	assert(CountersEnd > CountersStart, "no counters");			assert(CountersEnd > CountersStart, "no counters");

	const bool Shared = !__bolt_instr_use_pid;			const bool Shared = !__bolt_instr_use_pid;
	const uint64_t MapPrivateOrShared = Shared ? MAP_SHARED : MAP_PRIVATE;			const uint64_t MapPrivateOrShared = Shared ? MAP_SHARED : MAP_PRIVATE;

	void *Ret =			void *Ret =
	__mmap(CountersStart, CountersEnd - CountersStart, PROT_READ \| PROT_WRITE,			__mmap(CountersStart, CountersEnd - CountersStart, PROT_READ \| PROT_WRITE,
	MAP_ANONYMOUS \| MapPrivateOrShared \| MAP_FIXED, -1, 0);			MAP_ANONYMOUS \| MapPrivateOrShared \| MAP_FIXED, -1, 0);
	assert(Ret != MAP_FAILED, "__bolt_instr_setup: Failed to mmap counters!");			assert(Ret != MAP_FAILED, "__bolt_instr_setup: Failed to mmap counters!");

				rafaulerUnsubmitted Done Reply Inline Actions Drop the cast GlobalMetadataStorage = __mmap(0, 4096, PROT_READ \| PROT_WRITE, (Shared ? MAP_SHARED : MAP_PRIVATE) \| MAP_ANONYMOUS, -1, 0); rafauler: Drop the cast GlobalMetadataStorage = __mmap(0, 4096, PROT_READ \| PROT_WRITE…
	// Conservatively reserve 100MiB shared pages			GlobalMetadataStorage = __mmap(0, 4096, PROT_READ \| PROT_WRITE,
	GlobalAlloc.setMaxSize(0x6400000);			MapPrivateOrShared \| MAP_ANONYMOUS, -1, 0);
	GlobalAlloc.setShared(Shared);			assert(GlobalMetadataStorage != MAP_FAILED,
	GlobalWriteProfileMutex = new (GlobalAlloc, 0) Mutex();			"__bolt_instr_setup: failed to mmap page for metadata!");

				GlobalAlloc = new (GlobalMetadataStorage) BumpPtrAllocator;
				// Conservatively reserve 100MiB
				GlobalAlloc->setMaxSize(0x6400000);
				GlobalAlloc->setShared(Shared);
				GlobalWriteProfileMutex = new (*GlobalAlloc, 0) Mutex();
	if (__bolt_instr_num_ind_calls > 0)			if (__bolt_instr_num_ind_calls > 0)
	GlobalIndCallCounters =			GlobalIndCallCounters =
	new (GlobalAlloc, 0) IndirectCallHashTable[__bolt_instr_num_ind_calls];			new (*GlobalAlloc, 0) IndirectCallHashTable[__bolt_instr_num_ind_calls];

	if (__bolt_instr_sleep_time != 0) {			if (__bolt_instr_sleep_time != 0) {
	// Separate instrumented process to the own process group			// Separate instrumented process to the own process group
	if (__bolt_instr_wait_forks)			if (__bolt_instr_wait_forks)
	__setpgid(0, 0);			__setpgid(0, 0);

	if (long PID = __fork())			if (long PID = __fork())
	return;			return;
	watchProcess();			watchProcess();
	}			}
	}			}

	extern "C" __attribute((force_align_arg_pointer)) void			extern "C" __attribute((force_align_arg_pointer)) void
	instrumentIndirectCall(uint64_t Target, uint64_t IndCallID) {			instrumentIndirectCall(uint64_t Target, uint64_t IndCallID) {
	GlobalIndCallCounters[IndCallID].incrementVal(Target, GlobalAlloc);			GlobalIndCallCounters[IndCallID].incrementVal(Target, *GlobalAlloc);
	}			}

	/// We receive as in-stack arguments the identifier of the indirect call site			/// We receive as in-stack arguments the identifier of the indirect call site
	/// as well as the target address for the call			/// as well as the target address for the call
	extern "C" __attribute((naked)) void __bolt_instr_indirect_call()			extern "C" __attribute((naked)) void __bolt_instr_indirect_call()
	{			{
	__asm__ __volatile__(SAVE_ALL			__asm__ __volatile__(SAVE_ALL
	"mov 0xa0(%%rsp), %%rdi\n"			"mov 0xa0(%%rsp), %%rdi\n"
	▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[BOLT][Instrumentation] Fix hash table memory corruption and append-pid option
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 536765

bolt/runtime/instr.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[BOLT][Instrumentation] Fix hash table memory corruption and append-pid optionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 536765

bolt/runtime/instr.cpp

[BOLT][Instrumentation] Fix hash table memory corruption and append-pid option
ClosedPublic