This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lldb/unittests/Host/
-
unittests/
-
Host/
2/2
MainLoopTest.cpp

Differential D133181

[test] Remove problematic thread from MainLoopTest to fix flakiness
ClosedPublic

Authored by rupprecht on Sep 1 2022, 8:07 PM.

Download Raw Diff

Details

Reviewers

labath

Commits

rG945bdb167ff5: [test] Remove problematic thread from MainLoopTest to fix flakiness

Summary

This test, specifically TwoSignalCallbacks, can be a little bit flaky, failing in around 5/2000 runs.

POSIX says:

If the value of pid causes sig to be generated for the sending process, and if sig is not blocked for the calling thread and if no other thread has sig unblocked or is waiting in a sigwait() function for sig, either sig or at least one pending unblocked signal shall be delivered to the sending thread before kill() returns.

The problem is that in test setup, we create a new thread with std::async and that is occasionally not cleaned up. This leaves that thread available to eat the signal we're polling for.

The need for this to be async does not apply anymore, so we can just make it synchronous.

This makes the test passes in 10000 runs.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rupprecht created this revision.Sep 1 2022, 8:07 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2022, 8:07 PM

rupprecht requested review of this revision.Sep 1 2022, 8:07 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2022, 8:07 PM

Herald added a subscriber: lldb-commits. · View Herald Transcript

Harbormaster completed remote builds in B184750: Diff 457487.Sep 1 2022, 8:11 PM

Herald added a subscriber: JDevlieghere. · View Herald TranscriptSep 1 2022, 8:11 PM

AFAICT kill is entirely asynchronous

This is not exactly true. POSIX describes this situation quite well (emphasis mine):

If the value of pid causes sig to be generated for the sending process, and if sig is not blocked for the calling thread and if no other thread has sig unblocked or is waiting in a sigwait() function for sig, either sig or at least one pending unblocked signal shall be delivered to the sending thread before kill() returns.

Our problem is that this sentence does not apply here, not for one, but two reasons:

"sig is not blocked for the calling thread" -- the calling thread in fact has the signal blocked. That is expected, as it will be unblocked (and delivered) in the ppoll call inside MainLoop::Run. That is a pretty good way to catch signals race-free, but it also relies on the another part of the sentence above.
"no other thread has sig unblocked" -- it only works if there are no other threads willing to accept that signal, and I believe that is what is failing us here. This test does in fact create an extra thread in its SetUp function on line 35. By the time we leave the SetUp function, that thread has finished with its useful work (producing the future object), but I suspect what is happening is that, occasionally, the OS-level thread fails to exit on time and eats our signal.

In this case, I believe that the simplest way to fix this is to get rid of that thread. I believe it was necessary at some point in the past (when we were doing the Listen+Accept calls as a single action), but now it is not necessary as the self-connection can be completed without having two threads actively connecting to each other -- it's enough that one socket declares its intent to accept (listen to) a connection. That will make the test simpler. and I believe it will also fix the flakyness you observed.

lldb/unittests/Host/MainLoopTest.cpp
34–35	Thread created here.
41	Delete the async call above, and put something like `ASSERT_TRUE(listen_socket_up->Accept(accept_socket).Success())` here.

Remove async call to avoid deadlock instead

In D133181#3766319, @labath wrote:

AFAICT kill is entirely asynchronous

This is not exactly true. POSIX describes this situation quite well (emphasis mine):

If the value of pid causes sig to be generated for the sending process, and if sig is not blocked for the calling thread and if no other thread has sig unblocked or is waiting in a sigwait() function for sig, either sig or at least one pending unblocked signal shall be delivered to the sending thread before kill() returns.

Our problem is that this sentence does not apply here, not for one, but two reasons:

"sig is not blocked for the calling thread" -- the calling thread in fact has the signal blocked. That is expected, as it will be unblocked (and delivered) in the ppoll call inside MainLoop::Run. That is a pretty good way to catch signals race-free, but it also relies on the another part of the sentence above.

"no other thread has sig unblocked" -- it only works if there are no other threads willing to accept that signal, and I believe that is what is failing us here. This test does in fact create an extra thread in its SetUp function on line 35. By the time we leave the SetUp function, that thread has finished with its useful work (producing the future object), but I suspect what is happening is that, occasionally, the OS-level thread fails to exit on time and eats our signal.

In this case, I believe that the simplest way to fix this is to get rid of that thread. I believe it was necessary at some point in the past (when we were doing the Listen+Accept calls as a single action), but now it is not necessary as the self-connection can be completed without having two threads actively connecting to each other -- it's enough that one socket declares its intent to accept (listen to) a connection. That will make the test simpler. and I believe it will also fix the flakyness you observed.

Yes, this works too. Updated the diff with that suggestion.

It definitely is simpler in terms of the delta from this diff, although I do worry it kicks the can down the road -- AFAIK it's generally a hard problem within a block of code to verify a thread hasn't been started somewhere else, especially in this case where it was done via a std::future/std::async with no hint that the thread wasn't cleaned up yet. So if we ever have another test in /Host/ that gets linked into the same test binary, and that test runs first and starts a thread, this test could start being flaky again. Is there anything we can do to make sure that kind of scenario doesn't happen?

Harbormaster completed remote builds in B184826: Diff 457598.Sep 2 2022, 8:07 AM

It might be a good idea to also change the kill(getpid(), sig); statements into raise(sig) (a.k.a. pthread_kill(pthread_self(), sig)), so that they're sent to a specific thread, instead of the whole process.

It would also be possible to implement the MainLoop class in such a way that it responds to signals received by other threads as well, although one can ask himself which behavior is more natural. For the use case we're currently using this (catching SIGCHLDs) it wouldn't make a difference though.

This revision is now accepted and ready to land.Sep 2 2022, 8:33 AM

rupprecht retitled this revision from [test] Ensure MainLoop has time to start listening for signals. to [test] Remove problematic thread from MainLoopTest to fix flakiness.Sep 2 2022, 9:42 AM

rupprecht edited the summary of this revision. (Show Details)

Use pthread_kill to only kill the current thread

In D133181#3767072, @labath wrote:

It might be a good idea to also change the kill(getpid(), sig); statements into raise(sig) (a.k.a. pthread_kill(pthread_self(), sig)), so that they're sent to a specific thread, instead of the whole process.

Nice, that also works. It fixes the problem on its own, but I'll leave both changes in (removing the async too).

Harbormaster completed remote builds in B184850: Diff 457631.Sep 2 2022, 10:32 AM

Closed by commit rG945bdb167ff5: [test] Remove problematic thread from MainLoopTest to fix flakiness (authored by rupprecht). · Explain WhySep 2 2022, 10:32 AM

This revision was automatically updated to reflect the committed changes.

rupprecht added a commit: rG945bdb167ff5: [test] Remove problematic thread from MainLoopTest to fix flakiness.

labath mentioned this in D131160: [WIP][lldb] Add "event" capability to the MainLoop class.Sep 6 2022, 4:28 AM

labath mentioned this in rG65596cede8a4: [lldb] Go back to process-directed signals in MainLoopTest.cpp.Sep 6 2022, 5:07 AM

(I've reverted the pthread_kill part, as the mac build did not like it.)

In D133181#3771747, @labath wrote:

(I've reverted the pthread_kill part, as the mac build did not like it.)

Thanks! I didn't get any buildbot notification; do LLDB build bots not send email?

In D133181#3772830, @rupprecht wrote:

In D133181#3771747, @labath wrote:

(I've reverted the pthread_kill part, as the mac build did not like it.)

Thanks! I didn't get any buildbot notification; do LLDB build bots not send email?

The regular buildbot bots do, but I'm not sure about the GreenDragon (@JDevlieghere ?). Although no emails would help in this case, as the bot was already red at the time this landed.

Revision Contents

Path

Size

lldb/

unittests/

Host/

MainLoopTest.cpp

23 lines

Diff 457633

lldb/unittests/Host/MainLoopTest.cpp

Show All 25 Lines	void SetUp() override {
bool child_processes_inherit = false;		bool child_processes_inherit = false;
Status error;		Status error;
std::unique_ptr<TCPSocket> listen_socket_up(		std::unique_ptr<TCPSocket> listen_socket_up(
new TCPSocket(true, child_processes_inherit));		new TCPSocket(true, child_processes_inherit));
ASSERT_TRUE(error.Success());		ASSERT_TRUE(error.Success());
error = listen_socket_up->Listen("localhost:0", 5);		error = listen_socket_up->Listen("localhost:0", 5);
ASSERT_TRUE(error.Success());		ASSERT_TRUE(error.Success());

Socket *accept_socket;		Socket *accept_socket;
std::future<Status> accept_error = std::async(std::launch::async, [&] {
return listen_socket_up->Accept(accept_socket);
});

std::unique_ptr<TCPSocket> connect_socket_up(		std::unique_ptr<TCPSocket> connect_socket_up(
		labathUnsubmitted Done Reply Inline Actions Thread created here. labath: Thread created here.
new TCPSocket(true, child_processes_inherit));		new TCPSocket(true, child_processes_inherit));
error = connect_socket_up->Connect(		error = connect_socket_up->Connect(
llvm::formatv("localhost:{0}", listen_socket_up->GetLocalPortNumber())		llvm::formatv("localhost:{0}", listen_socket_up->GetLocalPortNumber())
.str());		.str());
ASSERT_TRUE(error.Success());		ASSERT_TRUE(error.Success());
ASSERT_TRUE(accept_error.get().Success());		ASSERT_TRUE(listen_socket_up->Accept(accept_socket).Success());
		labathUnsubmitted Done Reply Inline Actions Delete the async call above, and put something like `ASSERT_TRUE(listen_socket_up->Accept(accept_socket).Success())` here. labath: Delete the async call above, and put something like `ASSERT_TRUE(listen_socket_up->Accept…

callback_count = 0;		callback_count = 0;
socketpair[0] = std::move(connect_socket_up);		socketpair[0] = std::move(connect_socket_up);
socketpair[1].reset(accept_socket);		socketpair[1].reset(accept_socket);
}		}

void TearDown() override {		void TearDown() override {
socketpair[0].reset();		socketpair[0].reset();
▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
}		}

TEST_F(MainLoopTest, Signal) {		TEST_F(MainLoopTest, Signal) {
MainLoop loop;		MainLoop loop;
Status error;		Status error;

auto handle = loop.RegisterSignal(SIGUSR1, make_callback(), error);		auto handle = loop.RegisterSignal(SIGUSR1, make_callback(), error);
ASSERT_TRUE(error.Success());		ASSERT_TRUE(error.Success());
kill(getpid(), SIGUSR1);		pthread_kill(pthread_self(), SIGUSR1);
ASSERT_TRUE(loop.Run().Success());		ASSERT_TRUE(loop.Run().Success());
ASSERT_EQ(1u, callback_count);		ASSERT_EQ(1u, callback_count);
}		}

// Test that a signal which is not monitored by the MainLoop does not		// Test that a signal which is not monitored by the MainLoop does not
// cause a premature exit.		// cause a premature exit.
TEST_F(MainLoopTest, UnmonitoredSignal) {		TEST_F(MainLoopTest, UnmonitoredSignal) {
MainLoop loop;		MainLoop loop;
Status error;		Status error;
struct sigaction sa;		struct sigaction sa;
sa.sa_sigaction = [](int, siginfo_t , void ) { };		sa.sa_sigaction = [](int, siginfo_t , void ) { };
sa.sa_flags = SA_SIGINFO; // important: no SA_RESTART		sa.sa_flags = SA_SIGINFO; // important: no SA_RESTART
sigemptyset(&sa.sa_mask);		sigemptyset(&sa.sa_mask);
ASSERT_EQ(0, sigaction(SIGUSR2, &sa, nullptr));		ASSERT_EQ(0, sigaction(SIGUSR2, &sa, nullptr));

auto handle = loop.RegisterSignal(SIGUSR1, make_callback(), error);		auto handle = loop.RegisterSignal(SIGUSR1, make_callback(), error);
ASSERT_TRUE(error.Success());		ASSERT_TRUE(error.Success());
std::thread killer([]() {		pthread_kill(pthread_self(), SIGUSR2);
sleep(1);		pthread_kill(pthread_self(), SIGUSR1);
kill(getpid(), SIGUSR2);
sleep(1);
kill(getpid(), SIGUSR1);
});
ASSERT_TRUE(loop.Run().Success());		ASSERT_TRUE(loop.Run().Success());
killer.join();
ASSERT_EQ(1u, callback_count);		ASSERT_EQ(1u, callback_count);
}		}

// Test that two callbacks can be registered for the same signal		// Test that two callbacks can be registered for the same signal
// and unregistered independently.		// and unregistered independently.
TEST_F(MainLoopTest, TwoSignalCallbacks) {		TEST_F(MainLoopTest, TwoSignalCallbacks) {
MainLoop loop;		MainLoop loop;
Status error;		Status error;
unsigned callback2_count = 0;		unsigned callback2_count = 0;
unsigned callback3_count = 0;		unsigned callback3_count = 0;

auto handle = loop.RegisterSignal(SIGUSR1, make_callback(), error);		auto handle = loop.RegisterSignal(SIGUSR1, make_callback(), error);
ASSERT_TRUE(error.Success());		ASSERT_TRUE(error.Success());

{		{
// Run a single iteration with two callbacks enabled.		// Run a single iteration with two callbacks enabled.
auto handle2 = loop.RegisterSignal(		auto handle2 = loop.RegisterSignal(
SIGUSR1, [&](MainLoopBase &loop) { ++callback2_count; }, error);		SIGUSR1, [&](MainLoopBase &loop) { ++callback2_count; }, error);
ASSERT_TRUE(error.Success());		ASSERT_TRUE(error.Success());

kill(getpid(), SIGUSR1);		pthread_kill(pthread_self(), SIGUSR1);
ASSERT_TRUE(loop.Run().Success());		ASSERT_TRUE(loop.Run().Success());
ASSERT_EQ(1u, callback_count);		ASSERT_EQ(1u, callback_count);
ASSERT_EQ(1u, callback2_count);		ASSERT_EQ(1u, callback2_count);
ASSERT_EQ(0u, callback3_count);		ASSERT_EQ(0u, callback3_count);
}		}

{		{
// Make sure that remove + add new works.		// Make sure that remove + add new works.
auto handle3 = loop.RegisterSignal(		auto handle3 = loop.RegisterSignal(
SIGUSR1, [&](MainLoopBase &loop) { ++callback3_count; }, error);		SIGUSR1, [&](MainLoopBase &loop) { ++callback3_count; }, error);
ASSERT_TRUE(error.Success());		ASSERT_TRUE(error.Success());

kill(getpid(), SIGUSR1);		pthread_kill(pthread_self(), SIGUSR1);
ASSERT_TRUE(loop.Run().Success());		ASSERT_TRUE(loop.Run().Success());
ASSERT_EQ(2u, callback_count);		ASSERT_EQ(2u, callback_count);
ASSERT_EQ(1u, callback2_count);		ASSERT_EQ(1u, callback2_count);
ASSERT_EQ(1u, callback3_count);		ASSERT_EQ(1u, callback3_count);
}		}

// Both extra callbacks should be unregistered now.		// Both extra callbacks should be unregistered now.
kill(getpid(), SIGUSR1);		pthread_kill(pthread_self(), SIGUSR1);
ASSERT_TRUE(loop.Run().Success());		ASSERT_TRUE(loop.Run().Success());
ASSERT_EQ(3u, callback_count);		ASSERT_EQ(3u, callback_count);
ASSERT_EQ(1u, callback2_count);		ASSERT_EQ(1u, callback2_count);
ASSERT_EQ(1u, callback3_count);		ASSERT_EQ(1u, callback3_count);
}		}
#endif		#endif