Counters can be flushed in a multi-threaded context for example when the process is forked in different threads (https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Instrumentation/GCOVProfiling.cpp#L632-L663).
In order to avoid pretty bad things, a critical section is needed around the flush.
We had a lot of crashes in this code in Firefox CI when we switched to clang for linux ccov builds and those crashes disappeared with this patch.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
- Build Status
Buildable 42199 Build 42614: arc lint + arc unit
Event Timeline
The extra setup/teardown work for Windows relative to Linux is unfortunate. How would you feel about using an SRWLock instead? They accept a static initializer. (https://docs.microsoft.com/en-us/windows/win32/sync/slim-reader-writer--srw--locks)
Are there minimum Windows version requirements for compiler-rt? SRWLock is only available on Win7 forward, I think.
Also, I am the farthest thing from the owner of this code (I don't even have commit access!), so I don't think I'm the right person to review this.
I believe SRWLocks arrived in Vista. Anyway, I don't know about compiler-rt specifically, but LLVM in general has required Win7 since 2015: http://llvm.org/viewvc/llvm-project?view=revision&revision=249332
This lgtm from the Windows side but I'd feel better if you got an additional review from someone who knows gcov better than me.
LGTM. GCC is doing the same.
After landing this, we should be able to drop the mutex we had implemented in Firefox to workaround the absence of synchronization when explicitly requesting flush (https://searchfox.org/mozilla-central/rev/8bc24752246aeac8a9aed566cf1caccf88d97d11/tools/code-coverage/CodeCoverageHandler.cpp#26-30).
I'm not so sure about the Windows 7 requirement, as this will prevent Windows Vista as a target, not just as a host. The limitation should only apply when coverage is enabled though, so I think we can live with it.
This has broken http://lab.llvm.org:8080/green/job/clang-stage1-RA/4401.
Can you fix this ASAP?
I've reverted this patch:
https://github.com/llvm/llvm-project/commit/78a7af456dbb8c43ab4f4616c14a78716a7c5d84
When you say "when the process is forked in different threads" you mean threads, not fork, right? Because this won't fix any issues you're seeing with fork.
It's also unclear that this fixes the issue correctly. flush_fn_head is accessible from other functions, also unprotected. Can the other functions be invoked while this one is?
@ahatanak it's probably a problem with Darwin.cpp where I need to add some exported symbols.
@jfb sorry but it fixes issues with fork:
- we had a lot of crashes in Firefox CI when dumping the GCDAs and so I tried this patch on the CI and no more weird crashes.
- and a minimal test case to reproduce it:
#include <iostream> #include <thread> #include <vector> #include <unistd.h> void foo() { fork(); } int main() { std::vector<std::thread> pool; for (int i = 0; i < 10; i++) { pool.emplace_back(std::thread(foo)); } for (auto & t : pool) { t.join(); } return 0; }
If I compile and run it (clang++-9 -std=c++11 foo.cpp -oo -lpthread --coverage && ./o) I get systematically a SEGFAULT and with the patch no more crashes.
The problem is not with flush_fn_list itself but it's when the functions in the list (gcda dumper) are called concurrently.
@vsk: the test compiler-rt/trunk/test/profile/instrprof-darwin-exports.c was failing probably because of the missing static on gcov_flush_mutex, could you check that it's ok for you ?
I don't understand this fix then. Can you please explain more? Specifically, if the problem occurs when forking a profiled program then a mutex won't help you. I understand that you were having crashes and aren't anymore, but I don't understand what the actual problem was, and why the proposed fix is correct. If the missing synchronization is in gcda dumpers, why aren't you synchronizing there? What's the actual segfault, why is it crashing? What's the shared state?
At a minimum I'd expect the commit message to explain this... and asking these questions I'm not sure the actual problem has been addressed. Can you please clarify?
@jfb, sorry I thought my commit message was clear enough.
So when coverage is enabled a call to __gcov_flush is inserted just before forkcall:
https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Instrumentation/GCOVProfiling.cpp#L659
So if the parent process is forked in different threads then __gcov_flush is called from different threads.
And so different global variables are accessed asynchronously:
https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/profile/GCDAProfiling.c#L104-L108
Initially we saw crashes in our CI: https://bugzilla.mozilla.org/show_bug.cgi?id=1599436
The above test case leads to multiple errors: double-free, "cannot merge previous GCDA file..."
And in debbuging firefox, I saw output_file == NULL (https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/profile/GCDAProfiling.c#L603) where it mustn't.
OK that helps a bit. I still don't understand though: why do you only need to protect flush and nothing else? You're protecting specific variables from being accesses concurrently only though flush, and not through the other paths that access those variables. Are these variables never accessed concurrently from these other functions?
OK that helps a bit. I still don't understand though: why do you only need to protect flush and nothing else? You're protecting specific variables from being accesses concurrently only though flush, and not through the other paths that access those variables. Are these variables never accessed concurrently from these other functions?
Following up, I'd like to understand this.
OK that helps a bit. I still don't understand though: why do you only need to protect flush and nothing else? You're protecting specific variables from being accesses concurrently only though flush, and not through the other paths that access those variables. Are these variables never accessed concurrently from these other functions?
As far as I know, the functions using global variables are called from __gcov_flush and from functions ran atexit, so the only way for these global variables to be used asynchronously is when __gcov_flush is called from different threads.
@jfb if you see other possible paths, please tell me.