This is an archive of the discontinued LLVM Phabricator instance.

[compiler-rt] Add a critical section when flushing gcov counters
ClosedPublic

Authored by calixte on Dec 2 2019, 10:24 AM.

Details

Summary

Counters can be flushed in a multi-threaded context for example when the process is forked in different threads (https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Instrumentation/GCOVProfiling.cpp#L632-L663).
In order to avoid pretty bad things, a critical section is needed around the flush.
We had a lot of crashes in this code in Firefox CI when we switched to clang for linux ccov builds and those crashes disappeared with this patch.

Event Timeline

calixte created this revision.Dec 2 2019, 10:24 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 2 2019, 10:24 AM
Herald added subscribers: llvm-commits, Restricted Project, jfb. · View Herald Transcript
calixte retitled this revision from Add a critical section when flushing gcov counters to [compiler-rt] Add a critical section when flushing gcov counters.Dec 2 2019, 10:37 AM
dmajor added a subscriber: dmajor.Dec 2 2019, 10:42 AM

The extra setup/teardown work for Windows relative to Linux is unfortunate. How would you feel about using an SRWLock instead? They accept a static initializer. (https://docs.microsoft.com/en-us/windows/win32/sync/slim-reader-writer--srw--locks)

The extra setup/teardown work for Windows relative to Linux is unfortunate. How would you feel about using an SRWLock instead?

Are there minimum Windows version requirements for compiler-rt? SRWLock is only available on Win7 forward, I think.

froydnj resigned from this revision.Dec 2 2019, 11:11 AM

Also, I am the farthest thing from the owner of this code (I don't even have commit access!), so I don't think I'm the right person to review this.

dmajor added a subscriber: froydnj.Dec 2 2019, 12:27 PM

The extra setup/teardown work for Windows relative to Linux is unfortunate. How would you feel about using an SRWLock instead?

Are there minimum Windows version requirements for compiler-rt? SRWLock is only available on Win7 forward, I think.

I believe SRWLocks arrived in Vista. Anyway, I don't know about compiler-rt specifically, but LLVM in general has required Win7 since 2015: http://llvm.org/viewvc/llvm-project?view=revision&revision=249332

calixte updated this revision to Diff 231843.Dec 3 2019, 1:21 AM

Use SRWLock instead of a CRITICAL_SECTION.

calixte updated this revision to Diff 231844.Dec 3 2019, 1:33 AM

Remove unused function

calixte updated this revision to Diff 231846.Dec 3 2019, 1:49 AM

Use HEAD~3 with arc diff

dmajor accepted this revision.Dec 3 2019, 6:00 AM
dmajor added a reviewer: davidxl.

This lgtm from the Windows side but I'd feel better if you got an additional review from someone who knows gcov better than me.

This revision is now accepted and ready to land.Dec 3 2019, 6:00 AM
marco-c accepted this revision.Dec 3 2019, 2:23 PM

LGTM. GCC is doing the same.

After landing this, we should be able to drop the mutex we had implemented in Firefox to workaround the absence of synchronization when explicitly requesting flush (https://searchfox.org/mozilla-central/rev/8bc24752246aeac8a9aed566cf1caccf88d97d11/tools/code-coverage/CodeCoverageHandler.cpp#26-30).

I'm not so sure about the Windows 7 requirement, as this will prevent Windows Vista as a target, not just as a host. The limitation should only apply when coverage is enabled though, so I think we can live with it.

This revision was automatically updated to reflect the committed changes.
jfb added a comment.Dec 9 2019, 8:56 PM

When you say "when the process is forked in different threads" you mean threads, not fork, right? Because this won't fix any issues you're seeing with fork.

It's also unclear that this fixes the issue correctly. flush_fn_head is accessible from other functions, also unprotected. Can the other functions be invoked while this one is?

@ahatanak it's probably a problem with Darwin.cpp where I need to add some exported symbols.
@jfb sorry but it fixes issues with fork:

  • we had a lot of crashes in Firefox CI when dumping the GCDAs and so I tried this patch on the CI and no more weird crashes.
  • and a minimal test case to reproduce it:
#include <iostream>
#include <thread>
#include <vector>
#include <unistd.h>

void foo()
{
    fork();
}

int main()  
{
    std::vector<std::thread> pool;

    for (int i = 0; i < 10; i++) {
         pool.emplace_back(std::thread(foo));
    }

    for (auto & t : pool) {
        t.join();
    }
    
    return 0;
}

If I compile and run it (clang++-9 -std=c++11 foo.cpp -oo -lpthread --coverage && ./o) I get systematically a SEGFAULT and with the patch no more crashes.
The problem is not with flush_fn_list itself but it's when the functions in the list (gcda dumper) are called concurrently.

calixte reopened this revision.Dec 10 2019, 1:53 AM
This revision is now accepted and ready to land.Dec 10 2019, 1:53 AM
calixte updated this revision to Diff 233049.Dec 10 2019, 4:25 AM

Make symbol gcov_flush_mutex private

calixte added a reviewer: vsk.Dec 10 2019, 4:25 AM

@vsk: the test compiler-rt/trunk/test/profile/instrprof-darwin-exports.c was failing probably because of the missing static on gcov_flush_mutex, could you check that it's ok for you ?

vsk added a comment.Dec 10 2019, 9:29 AM

@vsk: the test compiler-rt/trunk/test/profile/instrprof-darwin-exports.c was failing probably because of the missing static on gcov_flush_mutex, could you check that it's ok for you ?

That should be fine, so long as the symbol is not external it cannot be exported.

This revision was automatically updated to reflect the committed changes.
jfb added a comment.Dec 12 2019, 9:48 AM

@jfb sorry but it fixes issues with fork:

  • we had a lot of crashes in Firefox CI when dumping the GCDAs and so I tried this patch on the CI and no more weird crashes.
  • and a minimal test case to reproduce it:
#include <iostream>
#include <thread>
#include <vector>
#include <unistd.h>

void foo()
{
    fork();
}

int main()  
{
    std::vector<std::thread> pool;

    for (int i = 0; i < 10; i++) {
         pool.emplace_back(std::thread(foo));
    }

    for (auto & t : pool) {
        t.join();
    }
    
    return 0;
}

If I compile and run it (clang++-9 -std=c++11 foo.cpp -oo -lpthread --coverage && ./o) I get systematically a SEGFAULT and with the patch no more crashes.
The problem is not with flush_fn_list itself but it's when the functions in the list (gcda dumper) are called concurrently.

I don't understand this fix then. Can you please explain more? Specifically, if the problem occurs when forking a profiled program then a mutex won't help you. I understand that you were having crashes and aren't anymore, but I don't understand what the actual problem was, and why the proposed fix is correct. If the missing synchronization is in gcda dumpers, why aren't you synchronizing there? What's the actual segfault, why is it crashing? What's the shared state?

At a minimum I'd expect the commit message to explain this... and asking these questions I'm not sure the actual problem has been addressed. Can you please clarify?

@jfb, sorry I thought my commit message was clear enough.
So when coverage is enabled a call to __gcov_flush is inserted just before forkcall:
https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Instrumentation/GCOVProfiling.cpp#L659
So if the parent process is forked in different threads then __gcov_flush is called from different threads.
And so different global variables are accessed asynchronously:
https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/profile/GCDAProfiling.c#L104-L108
Initially we saw crashes in our CI: https://bugzilla.mozilla.org/show_bug.cgi?id=1599436
The above test case leads to multiple errors: double-free, "cannot merge previous GCDA file..."
And in debbuging firefox, I saw output_file == NULL (https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/profile/GCDAProfiling.c#L603) where it mustn't.

jfb added a comment.Dec 12 2019, 11:13 AM

@jfb, sorry I thought my commit message was clear enough.
So when coverage is enabled a call to __gcov_flush is inserted just before forkcall:
https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Instrumentation/GCOVProfiling.cpp#L659
So if the parent process is forked in different threads then __gcov_flush is called from different threads.
And so different global variables are accessed asynchronously:
https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/profile/GCDAProfiling.c#L104-L108
Initially we saw crashes in our CI: https://bugzilla.mozilla.org/show_bug.cgi?id=1599436
The above test case leads to multiple errors: double-free, "cannot merge previous GCDA file..."
And in debbuging firefox, I saw output_file == NULL (https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/profile/GCDAProfiling.c#L603) where it mustn't.

OK that helps a bit. I still don't understand though: why do you only need to protect flush and nothing else? You're protecting specific variables from being accesses concurrently only though flush, and not through the other paths that access those variables. Are these variables never accessed concurrently from these other functions?

jfb added a comment.Jan 3 2020, 1:44 PM

OK that helps a bit. I still don't understand though: why do you only need to protect flush and nothing else? You're protecting specific variables from being accesses concurrently only though flush, and not through the other paths that access those variables. Are these variables never accessed concurrently from these other functions?

Following up, I'd like to understand this.

OK that helps a bit. I still don't understand though: why do you only need to protect flush and nothing else? You're protecting specific variables from being accesses concurrently only though flush, and not through the other paths that access those variables. Are these variables never accessed concurrently from these other functions?

As far as I know, the functions using global variables are called from __gcov_flush and from functions ran atexit, so the only way for these global variables to be used asynchronously is when __gcov_flush is called from different threads.
@jfb if you see other possible paths, please tell me.