This is an archive of the discontinued LLVM Phabricator instance.

Fix `-jobs=<N>` where <N> > 1 and the number of workers is > 1 on macOS.
ClosedPublic

Authored by delcypher on Jul 24 2016, 9:43 PM.

Details

Summary

Fix -jobs=<N> where <N> > 1 and the number of workers is > 1 on macOS.

The original ExecuteCommand() called system() from the C library.
The C library implementation of this on macOS contains a mutex which
serializes calls to system(). This prevented the -jobs= flag
from running copies of the fuzzing binary in parallel which is
the opposite of what is intended.

To fix this on macOS an alternative implementation of ExecuteCommand()
is provided that can be used concurrently. This is provided in
FuzzerUtilDarwin.cpp which is guarded to only compile code on Apple
platforms. The existing implementation has been moved to a new file
FuzzerUtilLinux.cpp which is guarded to only compile code on Linux.

This commit includes a simple test to check that LibFuzzer is being
executed in parallel when requested.

Diff Detail

Repository
rL LLVM

Event Timeline

delcypher updated this revision to Diff 65294.Jul 24 2016, 9:43 PM
delcypher retitled this revision from to [WIP] Fix `-jobs=<N>` where <N> > 1 on macOS..
delcypher updated this object.
delcypher added reviewers: kcc, aizatsky.
kcc edited edge metadata.Jul 25 2016, 7:07 PM

I am not happy with this implementation as it seems like a gross

Gross indeed.
At the very least, move all the OSX-specific code to an OSX-specific file.

Also, if you had to write that much code, surely it deserves a test :)

This allows pressing CTRL+C in the terminal to do the right thing on macOS.

What exactly do you mean by the "right thing"?
Do we need to do that "right thing" at the cost of such extra complexity?

In D22742#495674, @kcc wrote:

I am not happy with this implementation as it seems like a gross

Gross indeed.
At the very least, move all the OSX-specific code to an OSX-specific file.

I can do that but first I'd like to decide on fix this properly.

Also, if you had to write that much code, surely it deserves a test :)

I wanted to write a test but I don't know how to do so in a reliable way from llvm-lit. The only way I know of doing this is by running ps -af in parallel and checking that the multiple invocations of the fuzzer binary are all using non zero cpu time. Writing something that does that using bash shell code if probably very fragile.

This allows pressing CTRL+C in the terminal to do the right thing on macOS.

What exactly do you mean by the "right thing"?

Without the extra code when running in RunInMultipleProcesses() pressing CTRL+C in a shell
will immediately kill the parent process on macOS and can leave around zombie processes due
waitpid() not being called by the parent process to clean up the children. This is an artifact of my implementation of NonLockingSystem().

Do we need to do that "right thing" at the cost of such extra complexity?

This extra complexity is due to my implementation of NonLockingSystem() not blocking SIGINT and SIGQUIT.
This was a design decision on my part to avoid blocking and unblocking those signals in a racey way.
This is why system() on macOS has a mutex because calling system() from multiple threads is not safe. This
patch blocks and unblocks the signal in the main thread before creating the threads and restores the signal handling
after all threads have finished which is safe but very ugly.

One way of avoiding this "extra complexity" is simply to have NonLockingSystem() also block and unblock SIGINT and SIGQUIT itself
and just accept that doing this is racey to prevent implementation details leaking into other parts of LibFuzzer's code (i.e. what is happening in this patch).

Another way of cleaning this up is to do basically what the patch is doing now (setting and restoring the signal handlers in the main thread) but refactor doing that into helper methods.

The linux code also has the same race but I guess glibc's version of system() doesn't lock so you haven't run into the issue I did.

What is more important to you, race freedom or simpler/cleaner code?

@kcc

It's been pointed out to me that my post might not have been particularly helpful so here is my second attempt at explaining the problem and possible solutions that I'm aware of.

The Problem

On macOS the -jobs=<N> flag does not work as intended. Currently only one job will run at a time even if <N> is greater than 1 and the number of workers is great than 1. Why is it important to fix this? It's useful to take advantage of the multiple cores most machines have for fuzzing.

The root cause of this is calling system() from multiple threads. On macOS the implementation of system() contains a mutex that guards against multiple copies of system() running concurrently. This causes all worker threads apart from one to block when calling ExecuteCommand() because it calls system().

You can see the implementation at https://github.com/unofficial-opensource-apple/Libc/blob/master/stdlib/FreeBSD/system.c which hopefully will make it clear what the problem is.

On Linux where Glibc is usually used, system()'s implementation is different. I was incorrect before when I said it was racey on Linux. I presumed that no lock was being used but provided Glibc is built with _LIBC_REENTRANT on (I'm not sure how to check if glibc was compiled with this. Any ideas?) then it looks like it is safe to call from multiple threads. You can see the implementation at

https://github.com/bminor/glibc/blob/2a69f853c03034c2e383e0f9c35b5402ce8b5473/sysdeps/posix/system.c

Possible solutions

Here are a few ideas I have for solutions

  • Provide a version of ExecuteCommand() that can be safely called and can be used concurrently for macOS that has the same semantics as system(). Basically this would involve implementing something that behaves similarly to glibc's system() for macOS. I think this would probably be the cleanest solution.
  • Use the current patch I have now which reimplements system() but doesn't try to modify signal handlers but instead requires callers to change and restore the signal handlers. Although this works it leads to a very messy implementation so I don't think we should go with this approach.
  • Don't use worker threads and ExecuteCommand() at all. Instead in the main thread just do fork and exec for the number workers we want and then do waitpid. When a child finishes if there are more jobs left we'd do another fork and exec. This would work but it means we leak implementation details (i.e. using fork, exec and waitpid) into other parts of LibFuzzer. This also happens to work because worker threads don't do anything complicated right now. If they were doing something more complicated then this single threaded strategy for spawning copies of LibFuzzer might not work.

What do you think? Do you have any ideas about how the problem should be fixed?

How to test for the bug

I've been thinking about ways to test for this bug. One way would be to start libfuzzer, sleep for a second or so and then check that the log files for the expected number of worker threads exist. This is fragile due to the sleep but I don't really know of a better way to test for this from lit. Does this seem reasonable to you?

delcypher updated this revision to Diff 65871.Jul 27 2016, 10:04 PM
delcypher retitled this revision from [WIP] Fix `-jobs=<N>` where <N> > 1 on macOS. to Fix `-jobs=<N>` where <N> > 1 and the number of workers is > 1 on macOS..
delcypher updated this object.
delcypher edited edge metadata.
  • Separate implementation of ExecuteCommand() for Darwin and Linux.
  • Remove hack for blocking signals out off FuzzerDriver.cpp.
  • Correctly (I think) block and ignore signals in ExecuteCommand() on Darwin.
  • Add test.

@kcc Hopefully you'll find this new implementation a little more tasteful. I've also included a test. I'm not completely happy with the test but it currently passes on my Linux and macOS machines (with my patch applied). The test fails on macOS without my patch to ExecuteCommand().

I like the change but I am not qualified to review the Mac-specific code.
So, please find one more reviewer for FuzzerUtilDarwin.cpp (Kuba?)

lib/Fuzzer/FuzzerUtilDarwin.cpp
32 ↗(On Diff #65871)

use // comments

lib/Fuzzer/FuzzerUtilLinux.cpp
16 ↗(On Diff #65871)

please remove the comment.
There is no problem currently and we should not be commenting on hypothetical problems that much

lib/Fuzzer/test/NoAbort.cpp
10 ↗(On Diff #65871)

why not reuse EmptyTest.cpp?

lib/Fuzzer/test/fuzzer-jobs.test
1 ↗(On Diff #65871)

fix the FIXME

13 ↗(On Diff #65871)

will two jobs be enough?

delcypher added inline comments.Aug 2 2016, 3:52 PM
lib/Fuzzer/test/NoAbort.cpp
10 ↗(On Diff #65871)

Oh I hadn't noticed EmptyTest.cpp. I'll use that instead.

lib/Fuzzer/test/fuzzer-jobs.test
1 ↗(On Diff #65871)

Oops. I didn't remove the FIXME I left for myself. The implementation actually makes a temporary directory. The FIXME should actually state that doing this wouldn't be necessary if we could control the name of the fuzzing log files.

13 ↗(On Diff #65871)

In principle two jobs should be fine. I need to test that the test case on macOS catches the problem with just two jobs.

delcypher added inline comments.Aug 2 2016, 3:54 PM
lib/Fuzzer/test/fuzzer-jobs.test
1 ↗(On Diff #65871)

@kcc: Can I fix this in a separate commit that allows controlling the log file names? Adding a command line option to control this is independent from this patch.

kcc added inline comments.Aug 2 2016, 4:02 PM
lib/Fuzzer/test/fuzzer-jobs.test
1 ↗(On Diff #65871)

just remove the FIXME since the test already creates a separate dir

delcypher updated this revision to Diff 66586.Aug 2 2016, 4:16 PM
  • Use only two jobs in the test
  • Remove /* */ style comments
  • Remove FIXME in test.
  • Use EmptyTest rather than implementing another test.
delcypher updated this revision to Diff 66588.Aug 2 2016, 4:18 PM
  • Drop comments about being reentrant in Linux implementation of ExecuteCommand().
delcypher marked 8 inline comments as done.Aug 2 2016, 4:19 PM

@kcc: Okay I've applied your suggested changes. @kubabrecka when you have time could you review the implementation of ExecuteCommand() for Darwin?

delcypher updated this revision to Diff 67377.Aug 9 2016, 11:08 AM

Be explicit about environ being a C symbol. Previously the name in the emitted object file appeared to be unmangled but to avoid any doubt being explicit is probably a good idea.

kubamracek edited edge metadata.Aug 10 2016, 9:00 AM

A bunch of nits. There really isn’t anything Darwin-specific in the FuzzerUtilDarwin.cpp file, it’s just POSIX code.

lib/Fuzzer/FuzzerUtilDarwin.cpp
24–27 ↗(On Diff #67377)

This can fit just two lines. Typo “varaibles”.

28 ↗(On Diff #67377)

Can you please come up with a better name than “UseCount"?

40–47 ↗(On Diff #67377)

Can we move these variables to the places where they’re first used?

77 ↗(On Diff #67377)

typo

89–94 ↗(On Diff #67377)

Not a huge fan of the "(void)" prefixes. I understand why you put them there, but no other code in libFuzzer seems to be using them. Does this really produce any warnings?

In D22742#511268, @kubabrecka wrote:

A bunch of nits. There really isn’t anything Darwin-specific in the FuzzerUtilDarwin.cpp file, it’s just POSIX code.

It's true that this code isn't really Darwin specific (apart from the forward declared environ global). In principle it ought to work on Linux too. I decided to not change how ExecuteCommand worked on Linux because it already works correctly there. @kcc how would you feel about using the Darwin implementation of ExecuteCommand for Linux too?

kcc added a comment.Aug 10 2016, 11:45 AM

It's true that this code isn't really Darwin specific (apart from the forward declared environ global). In principle it ought to work on Linux too. I decided to not change how ExecuteCommand worked on Linux because it already works correctly there. @kcc how would you feel about using the Darwin implementation of ExecuteCommand for Linux too?

It won't fix any problem and will complicate the code that I have to maintain (and understand) on Linux. So, I won't be happy.

delcypher added inline comments.Aug 10 2016, 3:10 PM
lib/Fuzzer/FuzzerUtilDarwin.cpp
89–94 ↗(On Diff #67377)

@kubabrecka: They are not here to suppress warnings (warnings don't seem to be emitted by clang without them). They are here explicitly state that I intend to ignore the return value.

You are right that the rest of LibFuzzer isn't written in this style. I don't mind removing them though.
@kcc: Do you have a preference here?

In D22742#511570, @kcc wrote:

It's true that this code isn't really Darwin specific (apart from the forward declared environ global). In principle it ought to work on Linux too. I decided to not change how ExecuteCommand worked on Linux because it already works correctly there. @kcc how would you feel about using the Darwin implementation of ExecuteCommand for Linux too?

It won't fix any problem and will complicate the code that I have to maintain (and understand) on Linux. So, I won't be happy.

Okay I will keep the implementations separate then.

delcypher updated this revision to Diff 67618.Aug 10 2016, 3:15 PM
delcypher edited edge metadata.
  • s/UseCount/ActiveThreadCount/
  • Move some variable declarations closer to where they are being used.
  • Fix some typos.
  • Reformat comment line.
  • Fix resource leak (not calling posix_spawnattr_destroy() in certain situations).
delcypher marked 4 inline comments as done.Aug 10 2016, 3:17 PM
delcypher added inline comments.
lib/Fuzzer/FuzzerUtilDarwin.cpp
29 ↗(On Diff #67618)

Renamed UseCount to ActiveThreadCount.

41–48 ↗(On Diff #67618)

I've moved most of them. I've not moved SpawnAttributes though because that would complicate the logic needs to restore the signal handlers when posix_spawnattr_init() fails.

delcypher marked 2 inline comments as done.EditedAug 10 2016, 3:18 PM

@kubabrecka : You can probably make another pass over the patch. The only comment I've not addressed is the use of (void).

In D22742#511268, @kubabrecka wrote:

A bunch of nits. There really isn’t anything Darwin-specific in the FuzzerUtilDarwin.cpp file, it’s just POSIX code.

THANK YOU!

kcc added inline comments.Aug 10 2016, 3:27 PM
lib/Fuzzer/FuzzerUtilDarwin.cpp
90–95 ↗(On Diff #67618)

no preference here.

kubamracek accepted this revision.Aug 11 2016, 8:03 AM
kubamracek edited edge metadata.

LGTM

This revision is now accepted and ready to land.Aug 11 2016, 8:03 AM

@kubabrecka Thanks for the review. @kcc are you happy for this to land now?

kcc accepted this revision.Aug 11 2016, 10:41 AM
kcc edited edge metadata.

LGTM

This revision was automatically updated to reflect the committed changes.