This is an archive of the discontinued LLVM Phabricator instance.

[test] Remove problematic thread from MainLoopTest to fix flakiness
ClosedPublic

Authored by rupprecht on Sep 1 2022, 8:07 PM.

Details

Summary

This test, specifically TwoSignalCallbacks, can be a little bit flaky, failing in around 5/2000 runs.

POSIX says:

If the value of pid causes sig to be generated for the sending process, and if sig is not blocked for the calling thread and if no other thread has sig unblocked or is waiting in a sigwait() function for sig, either sig or at least one pending unblocked signal shall be delivered to the sending thread before kill() returns.

The problem is that in test setup, we create a new thread with std::async and that is occasionally not cleaned up. This leaves that thread available to eat the signal we're polling for.

The need for this to be async does not apply anymore, so we can just make it synchronous.

This makes the test passes in 10000 runs.

Diff Detail

Event Timeline

rupprecht created this revision.Sep 1 2022, 8:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2022, 8:07 PM
rupprecht requested review of this revision.Sep 1 2022, 8:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2022, 8:07 PM
labath added a subscriber: labath.Sep 2 2022, 12:45 AM

AFAICT kill is entirely asynchronous

This is not exactly true. POSIX describes this situation quite well (emphasis mine):

If the value of pid causes sig to be generated for the sending process, and if sig is not blocked for the calling thread and if no other thread has sig unblocked or is waiting in a sigwait() function for sig, either sig or at least one pending unblocked signal shall be delivered to the sending thread before kill() returns.

Our problem is that this sentence does not apply here, not for one, but two reasons:

  • "sig is not blocked for the calling thread" -- the calling thread in fact has the signal blocked. That is expected, as it will be unblocked (and delivered) in the ppoll call inside MainLoop::Run. That is a pretty good way to catch signals race-free, but it also relies on the another part of the sentence above.
  • "no other thread has sig unblocked" -- it only works if there are no other threads willing to accept that signal, and I believe that is what is failing us here. This test does in fact create an extra thread in its SetUp function on line 35. By the time we leave the SetUp function, that thread has finished with its useful work (producing the future object), but I suspect what is happening is that, occasionally, the OS-level thread fails to exit on time and eats our signal.

In this case, I believe that the simplest way to fix this is to get rid of that thread. I believe it was necessary at some point in the past (when we were doing the Listen+Accept calls as a single action), but now it is not necessary as the self-connection can be completed without having two threads actively connecting to each other -- it's enough that one socket declares its intent to accept (listen to) a connection. That will make the test simpler. and I believe it will also fix the flakyness you observed.

lldb/unittests/Host/MainLoopTest.cpp
34–35

Thread created here.

41

Delete the async call above, and put something like ASSERT_TRUE(listen_socket_up->Accept(accept_socket).Success()) here.

rupprecht updated this revision to Diff 457598.Sep 2 2022, 7:53 AM
  • Remove async call to avoid deadlock instead
rupprecht marked 2 inline comments as done.Sep 2 2022, 8:03 AM

AFAICT kill is entirely asynchronous

This is not exactly true. POSIX describes this situation quite well (emphasis mine):

If the value of pid causes sig to be generated for the sending process, and if sig is not blocked for the calling thread and if no other thread has sig unblocked or is waiting in a sigwait() function for sig, either sig or at least one pending unblocked signal shall be delivered to the sending thread before kill() returns.

Our problem is that this sentence does not apply here, not for one, but two reasons:

  • "sig is not blocked for the calling thread" -- the calling thread in fact has the signal blocked. That is expected, as it will be unblocked (and delivered) in the ppoll call inside MainLoop::Run. That is a pretty good way to catch signals race-free, but it also relies on the another part of the sentence above.
  • "no other thread has sig unblocked" -- it only works if there are no other threads willing to accept that signal, and I believe that is what is failing us here. This test does in fact create an extra thread in its SetUp function on line 35. By the time we leave the SetUp function, that thread has finished with its useful work (producing the future object), but I suspect what is happening is that, occasionally, the OS-level thread fails to exit on time and eats our signal.

In this case, I believe that the simplest way to fix this is to get rid of that thread. I believe it was necessary at some point in the past (when we were doing the Listen+Accept calls as a single action), but now it is not necessary as the self-connection can be completed without having two threads actively connecting to each other -- it's enough that one socket declares its intent to accept (listen to) a connection. That will make the test simpler. and I believe it will also fix the flakyness you observed.

Yes, this works too. Updated the diff with that suggestion.

It definitely is simpler in terms of the delta from this diff, although I do worry it kicks the can down the road -- AFAIK it's generally a hard problem within a block of code to verify a thread hasn't been started somewhere else, especially in this case where it was done via a std::future/std::async with no hint that the thread wasn't cleaned up yet. So if we ever have another test in /Host/ that gets linked into the same test binary, and that test runs first and starts a thread, this test could start being flaky again. Is there anything we can do to make sure that kind of scenario doesn't happen?

labath accepted this revision.Sep 2 2022, 8:33 AM

It might be a good idea to also change the kill(getpid(), sig); statements into raise(sig) (a.k.a. pthread_kill(pthread_self(), sig)), so that they're sent to a specific thread, instead of the whole process.

It would also be possible to implement the MainLoop class in such a way that it responds to signals received by other threads as well, although one can ask himself which behavior is more natural. For the use case we're currently using this (catching SIGCHLDs) it wouldn't make a difference though.

This revision is now accepted and ready to land.Sep 2 2022, 8:33 AM
rupprecht retitled this revision from [test] Ensure MainLoop has time to start listening for signals. to [test] Remove problematic thread from MainLoopTest to fix flakiness.Sep 2 2022, 9:42 AM
rupprecht edited the summary of this revision. (Show Details)
rupprecht updated this revision to Diff 457631.Sep 2 2022, 10:17 AM
  • Use pthread_kill to only kill the current thread

It might be a good idea to also change the kill(getpid(), sig); statements into raise(sig) (a.k.a. pthread_kill(pthread_self(), sig)), so that they're sent to a specific thread, instead of the whole process.

Nice, that also works. It fixes the problem on its own, but I'll leave both changes in (removing the async too).

labath added a comment.Sep 6 2022, 5:09 AM

(I've reverted the pthread_kill part, as the mac build did not like it.)

(I've reverted the pthread_kill part, as the mac build did not like it.)

Thanks! I didn't get any buildbot notification; do LLDB build bots not send email?

(I've reverted the pthread_kill part, as the mac build did not like it.)

Thanks! I didn't get any buildbot notification; do LLDB build bots not send email?

The regular buildbot bots do, but I'm not sure about the GreenDragon (@JDevlieghere ?). Although no emails would help in this case, as the bot was already red at the time this landed.