This is an archive of the discontinued LLVM Phabricator instance.

[BOLT][Instrumentation] Fix hash table memory corruption and append-pid option
ClosedPublic

Authored by treapster on Jun 26 2023, 7:37 AM.

Details

Summary

This diff fixes a few related issues:

Shared counters when using instrumentation-file-append-pid.
The point of append-pid option is to record separate profiles for separate forks, which is impossible when counters are the same for every process. It leads to a sum of all profiles in every file, plus GlobalWriteProfileMutex located in a shared memory prevents some processes from dumping their data at all. So, in this patch we only map counters as shared when append-pid is not used, and provide a test to ensure that different processes don't pollute each other's profiles.

Hash table corruption
In absence of instrumentation-file-append-pid option, global allocator uses shared pages for allocation. However, since it is a global variable, it gets COW'd after fork if instrumentation-sleep-time is used, or each time a process forks by itself. This means it handles the same pages to every process which causes hash table corruption - different entries overwrite each other, sometimes creating endless cycles. Thus, if we want shared pages, we need to put the allocator itself in a shared page, which we do in this patch in __bolt_instr_setup.

Unexpected/unspecified behavior of instrumentation-file-append-pid combined with instrumentation-{sleep-time,wait-forks}
The point of instrumentation-sleep-time option is to have a watcher process which shares memory with all other forks and dumps a common profile each n seconds. The append-pid is the opposite - it should record a private profile of each process. Combining the two suggests that we should get a private profile of each fork every n seconds, but such behavior is not implemented currently and is not easy to implement in general, because we somehow need to intercept each individual fork, launch a watcher process just for that fork, and also map counters so that they're only shared with that single fork. Since we're not doing that, the most reasonable thing to do seems to be disallowing such combination of options. I can make a separate diff for that if you think it doesn't fit here.

Also, while debugging all that i created a simple dump() function to understand what's happening, which i include here in case other hash table issues arise.

Diff Detail

Event Timeline

treapster created this revision.Jun 26 2023, 7:37 AM
Herald added a reviewer: Amir. · View Herald Transcript
Herald added a reviewer: maksfb. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: ayermolo. · View Herald Transcript
treapster requested review of this revision.Jun 26 2023, 7:37 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 26 2023, 7:37 AM
treapster edited the summary of this revision. (Show Details)Jun 26 2023, 7:39 AM
treapster edited the summary of this revision. (Show Details)
treapster edited the summary of this revision. (Show Details)
treapster edited the summary of this revision. (Show Details)Jun 26 2023, 7:42 AM
treapster edited the summary of this revision. (Show Details)
treapster edited the summary of this revision. (Show Details)Jun 26 2023, 7:49 AM

For some reason CHECK-ALL-DAG clause does not always work as expected on the second invocation, what may be happening here? I tried using {{}} regex to match the words exactly, but it didn't help

rafauler added inline comments.Jun 26 2023, 2:46 PM
bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗(On Diff #534547)

For some reason this test is failing on shared build.

Build BOLT with BUILD_SHARED_LIBS=On to check that.

For some reason CHECK-ALL-DAG clause does not always work as expected on the second invocation, what may be happening here? I tried using {{}} regex to match the words exactly, but it didn't help

missed this comment. Yes, it looks flaky.

Thanks for working on fixing these issues, @treapster! I have some suggestions below.

bolt/runtime/instr.cpp
254

report -> dump

see Graph::dump() for an easier way to write (and read) this code

also include a call to this function inside a DEBUG() macro to showcase how/where you want this printed. If you don't want to always print it in debug, leave it commented out, so at least we know how to use it when we need it.

bolt/test/runtime/instrumentation-indirect-2.c
90–91 ↗(On Diff #534547)

Here I would rather have a command that moves the child/parent fdata to fixed file names such as:

mv $t.$child_pid.fdata child.fdata
mv $t.$par_pid.fdata parent.fdata

The reason is because if we don't do that, each time a developer runs "ninja check-bolt", it will create a new unique file (with a different PID attached to the file name) in the Output folder and won't replace the previous one (the expected behavior), unnecessarily using more disk space over time.

96–97 ↗(On Diff #534547)

This can accidentally match other entries in the profile, such as the activity in funcX calling printf(). I imagine what we want to match is the indirect call itself, right? For that I would rather use:

RUN: llvm-bolt %t.exe -data %t.child.fdata \
RUN:   -print-finalized -print-only=main -o /dev/null | FileCheck %s --check-prefix=CHECK_CHILD

And then match this string:

{{.*}}:   callq   *%rax # CallProfile: 8 (0 misses) :
    { func1: 1 (0 misses) },
    { func3: 1 (0 misses) },

...

Actually, I ran BOLT like that and surprisingly the profile is incorrect and it is printing this:
.LFT15 (8 instructions, align : 1)

Exec Count : 8
CFI State : 3
Predecessors: .Ltmp10
  000000d1:   movslq  -0x94(%rbp), %rax
  000000d8:   movq    -0x90(%rbp,%rax,8), %rax
  000000e0:   movl    -0x98(%rbp), %edi
  000000e6:   callq   *%rax # CallProfile: 8 (0 misses) :
      { <unknown>: 1 (0 misses) },
      { <unknown>: 1 (0 misses) },
      { <unknown>: 1 (0 misses) },
      { <unknown>: 1 (0 misses) },
      { <unknown>: 1 (0 misses) },
      { <unknown>: 1 (0 misses) },
      { <unknown>: 1 (0 misses) },
      { <unknown>: 1 (0 misses) }
  000000e8:   movl    -0x94(%rbp), %eax
  000000ee:   addl    $0x2, %eax
  000000f1:   movl    %eax, -0x94(%rbp)
  000000f7:   jmp     .Ltmp10
Successors: .Ltmp10 (mispreds: 0, count: 8)

I'm not sure why yet.

rafauler added inline comments.Jun 26 2023, 4:12 PM
bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗(On Diff #534547)

The reason this doesn't work is because "func1" is matching to "func10"

Either remove func10-func16 or rename them with letters. e.g.: funca, funcb, funcc, funcd, funcf, funcg

treapster added inline comments.Jun 27 2023, 1:15 AM
bolt/runtime/instr.cpp
254

It is done this way because we want to construct a string once and print it atomically in a single call to write(). If write is called more than once, we get garbage when threads and processes are involved.

treapster marked an inline comment as done.Jun 27 2023, 1:16 AM
treapster added inline comments.Jun 27 2023, 1:19 AM
bolt/test/runtime/instrumentation-indirect-2.c
96–97 ↗(On Diff #534547)

If we match instructions, it unnecessarily becomes arch-specific. I think we need to figure out why indirect calls are not recorded in profile and construct a regex to match them there.

treapster added inline comments.Jun 27 2023, 1:23 AM
bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗(On Diff #534547)

I tried matching {{\bfunc1\b}} and similar constructs and it didn't help. I now changed numbers to letters and it still seems to fail sometimes..

treapster edited the summary of this revision. (Show Details)Jun 27 2023, 2:14 AM

Turns out, the [unknown] entries in profile are because addresses in indirect call descriptions are not relocated, which makes them meaningless in PIE because of ASLR. When the test is compiled with no-pie, indirect calls are recorded accurately. So, we should either compute base address and add it to stored entries, or produce dynamic relocations.

treapster added a comment.EditedJun 27 2023, 7:19 AM

Turns out, the [unknown] entries in profile are because addresses in indirect call descriptions are not relocated, which makes them meaningless in PIE because of ASLR. When the test is compiled with no-pie, indirect calls are recorded accurately. So, we should either compute base address and add it to stored entries, or produce dynamic relocations.

But since .bolt.instr.tables is not allocatable, the only option is to perform relocation by hand. BTW, why isn't it allocatable?

rafauler added inline comments.Jun 27 2023, 1:56 PM
bolt/runtime/instr.cpp
254

Sounds good, can you write a comment explaining that? (regarding the single call to write() )

bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗(On Diff #534547)

You're right, unfortunately it seems to be failing some times. That's annoying. I gave up on the DAG thing and tried this:

https://pastebin.com/rgn1hKr9

And it seems to be working consistently. That's one option forward, if you're OK with it.

96–97 ↗(On Diff #534547)

Sounds good. Matching against the CallProfile annotation (ignoring the instruction opcode) shouldn't be arch-specific, though. But I'm also OK with regex matching fdata, whichever you prefer.

Turns out, the [unknown] entries in profile are because addresses in indirect call descriptions are not relocated, which makes them meaningless in PIE because of ASLR. When the test is compiled with no-pie, indirect calls are recorded accurately. So, we should either compute base address and add it to stored entries, or produce dynamic relocations.

Good catch, that looks like a nasty bug in instrumentation for PIE objects. If you like, we can commit this diff forcing the test to be no-pie and then work on the fix for PIE on another diff.

Turns out, the [unknown] entries in profile are because addresses in indirect call descriptions are not relocated, which makes them meaningless in PIE because of ASLR. When the test is compiled with no-pie, indirect calls are recorded accurately. So, we should either compute base address and add it to stored entries, or produce dynamic relocations.

But since .bolt.instr.tables is not allocatable, the only option is to perform relocation by hand. BTW, why isn't it allocatable?

Because it is encoded as an ELF note section. That's not allocatable at load time, but our runtime will open the ELF file and read them. I don't remember why exactly I did that, I think at the time my motivation was to avoid as much as possible relying on the linker (RuntimeDyld) resolving all references from code to this table. So we just manually deserialize it instead of encoding as a global object in the binary. When this was written I knew we were abusing RuntimeDyld and that it wouldn't work in a variety of scenarios, that's why I was trying to keep the code as easy on the linker as possible (see bolt/docs/RuntimeLibrary.md - section Limitations).

Anyway, even we do encode this section (.bolt.instr.tables) as allocatable data section with dynamic R_X86_64_RELATIVE relocs fixing these addresses, we will likely have to figure out how to generate and insert the .rela section correctly in the binary. Maybe now that we're using JITLink, this task will be easier, I don't know.

Manually adding the PIE load address in instr.cpp:readDescription() might or might not be easier.

treapster added inline comments.Jun 27 2023, 2:51 PM
bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗(On Diff #534547)

It's ok but it will fire if the order in profile changes. Is it specified currently? Also we probably need to use -sleep-time for the first run, because otherwise the first process may start writing before the second finishes, and the second won't overwrite with newer profile because write mutex is locked. I'll play with it a bit more, it may not be a bug in FileCheck:)

ayermolo added inline comments.Jun 27 2023, 2:58 PM
bolt/runtime/instr.cpp
255

Can you add some kind of check that we are not overflowing Buf?

treapster added a comment.EditedJun 27 2023, 3:01 PM

Manually adding the PIE load address in instr.cpp:readDescription() might or might not be easier.

The main issue here is we'll have to either map the whole BinContents as RW, or memcpy just descriptions to another writeble region. Then just loop over entries and patch. Sounds a bit easier than messing with JITLink, though. Another solution may be to just subtract the base address every time inside lookupIndCallTarget.

rafauler added inline comments.Jun 27 2023, 3:12 PM
bolt/test/runtime/instrumentation-indirect-2.c
65–80 ↗(On Diff #534547)

You're right, the order is not guaranteed.

treapster updated this revision to Diff 535156.Jun 27 2023, 3:22 PM

Add comment and buffer overflow assertion

treapster marked 2 inline comments as done.Jun 27 2023, 3:24 PM
treapster updated this revision to Diff 535379.Jun 28 2023, 6:55 AM

Fix test, use no-pie executable.

! In D153771#4453963, @rafauler wrote:
If you like, we can commit this diff forcing the test to be no-pie and then work on the fix for PIE on another diff.

Done, the test is non-pie now

rafauler added inline comments.Jun 28 2023, 4:04 PM
bolt/runtime/instr.cpp
216

void *

1627–1629

Drop the cast

GlobalMetadataStorage = 
    __mmap(0, 4096, PROT_READ | PROT_WRITE,
           (Shared ? MAP_SHARED : MAP_PRIVATE) | MAP_ANONYMOUS, -1, 0);
bolt/test/runtime/instrumentation-indirect-2.c
111 ↗(On Diff #535379)

Do you really need that? (Lines 111 and 66)

I'm testing locally with the sleep timers removed and the test is not failing. If line 65 " Wait for profile and output to be fully written" is the problem, I don't know, perhaps something like https://linux.die.net/man/1/sync? (I don't know because I'm not sure what's happening in your machine, but it looks odd to me that the process will finish but your test script will access a half-finished output file).

If the problem is the one described in line 110 "in case child outlives parent", then perhaps using bash's "wait" command? https://phoenixnap.com/kb/bash-wait-command

Amir added inline comments.Jun 28 2023, 5:48 PM
bolt/runtime/common.h
85 ↗(On Diff #535379)

@rafauler: so why don't we include mman.h?

@treapster: since this particular change is NFC, can you please split it out to reduce functional changes surface and simplify testing and reviewing?

[The patch should] be an isolated change. Independent changes should be submitted as separate patches as this makes reviewing easier.

https://llvm.org/docs/Contributing.html#how-to-submit-a-patch

rafauler added inline comments.Jun 28 2023, 6:05 PM
bolt/runtime/common.h
85 ↗(On Diff #535379)

If we include any headers, it will pull the file from the host system, which might not match the target system.

I actually prefer this particular patch as is (non-split). In general I think it's harder for me to work reviewing large stacks (unless there are lots of lines of code changed) and I will internally squash a stack of diffs into one for easier testing.

But I agree there are many benefits to splitting a diff, so it's fine to me if @treapster wants to do that.

treapster added inline comments.Jun 29 2023, 1:09 AM
bolt/test/runtime/instrumentation-indirect-2.c
111 ↗(On Diff #535379)

I do get spurious failures if i remove any of the sleeps, and in the second case it always looks like incomplete output. Wait only works for a background job of the shell, in our case we have the main process finished and the output file read by next commands before the child finishes and writes it's output. AFAIK there is no way for us to wait for the whole process tree so we have to just sleep and hope it's finished. Regarding sync, it is also not what we need because we don't care whether the file is flushed to disc or not, we just need to wait till a process stops writing to it. Although sync seems to be working, probably because it spends enough time flushing for child process to finish and write everything. But still can fail any time.
There is another solution, however: for stdout redirection, i imagine we'll have the file open till both processes finish. So we can wait in a loop until fuser returns 1, which will guarantee both processes finished. For instrumentation profile, we'll need to change watchProcess() so that it opens profile only once and keeps FD open until it's done, seeking to zero on every iteration. This way we can also query fuser on the file to know when it's safe to read it. But this is a topic for another diff:).

treapster updated this revision to Diff 535701.Jun 29 2023, 3:14 AM

Fix nits, add mmap check

treapster marked 2 inline comments as done.Jun 29 2023, 3:19 AM
rafauler accepted this revision.Jun 29 2023, 1:29 PM

LGTM without the test

Just commit the change without the test. Honestly it's not worth it including a test that might fail because the children process is not synced, and I fear sleep wouldn't be a proper sync if the system is heavily loaded. We have too much infrastructure that depends on tests being solid and stable, and the risk of random failures isn't worth the pain. We can work on fixing instrumentation testability on other diffs.

This revision is now accepted and ready to land.Jun 29 2023, 1:29 PM
Amir added a comment.Jun 29 2023, 7:31 PM

This change breaks upstream testing: https://lab.llvm.org/buildbot/#/builders/244/builds/13736. Reverted.

Amir added a comment.Jun 29 2023, 7:55 PM

Also reverted append-pid commit due to the breakage in https://lab.llvm.org/buildbot/#/builders/252/builds/2700.

Amir added a comment.Jun 29 2023, 8:04 PM

Clang-BOLT still fails: https://lab.llvm.org/buildbot/#/builders/252/builds/2701. Not sure what could be the reason, but started failing with this set of changes.

Amir added a comment.Jun 29 2023, 10:08 PM

I'll be reverting one by one until https://lab.llvm.org/buildbot/#/builders/252 passes

Amir added a comment.Jun 29 2023, 10:46 PM

The builder is now green at https://lab.llvm.org/buildbot/#/builders/252/builds/2707.
Looks like the last two reverted commits (NFC with defines + asserts) were not related. Sorry about the churn, please reland. The other two appear to be related, or at least one. Lmk if you need help reproducing that build, but it should generally be straightforward based on cmake args from the builder.

Thanks for catching&reverting that, i'll try to reproduce a bit later and report back.

Amir added a comment.Jun 30 2023, 9:19 AM

Relanded "Don't share counters when using append-pid" – the crash wasn't in instrumented binary.
So the issue must be in "Put Allocator itslef in shared memory by default".

treapster added a comment.EditedJul 3 2023, 6:06 AM

At first i couldn't reproduce with cmake configured to use clang, but with gcc the runtime library does indeed break. Turns out, the definition for

void *operator new(size_t, void *) noexcept;

which was declared in the problematic commit is not provided by gcc, and in runtimelib it's use is replaced with a call and a relocation to undefined symbol(demangled to operator new(unsigned long, void*):

0000000000008352  0000005700000004 R_X86_64_PLT32         0000000000000000 _ZnwmPv - 4

For some reason JITLink does not catch it, and after linking the runtime lib to the binary, the call is just left there with zero immediate which causes segfault.
If we provide a definition for operator new, the problem goes away. Now why i did not provide a definition:

  • That answer says we only need declaration and doesn't mention definition
  • The page on cppreference does not list (9) and (10) as replacable and in Notes section only allows definition in class scope, or in global scope with non-void pointer type.
  • The page from that answer explicitly says: These functions are reserved; a C++ program may not define functions that displace the versions in the C++ standard library.

So it led me to believe that these forms of new are implicitly defined in the compiler and it is UB to define them in global scope. But apparently GCC requires them(or at least the one in question) to be explicitly defined. There is also ambiguity in whether we can call it "displacing the versions in the standard library" when we're not using standard library - whether UB arises from the definition alone or from definition when using standard library. We can probably assume the latter and get away with it, but it is not super clear. So, C++ is being C++ again.

treapster updated this revision to Diff 536765.Jul 3 2023, 7:30 AM

Since we don't need generic global placement new operator, and defining it in class scope is legal according to cppreference, i decided to move the operator to BumpPtrAllocator scope and define it there.

Hello @Amir, i think it's ok to reland it now, do you agree?

Amir added a comment.Jul 6 2023, 1:22 PM

Looks good on my end. Let's reland and monitor the buildbot.