This is an archive of the discontinued LLVM Phabricator instance.

[lldb] [test] Improve stability of llgs vCont-threads tests
ClosedPublic

Authored by mgorny on Jul 1 2022, 12:09 PM.

Details

Summary

Perform a major refactoring of vCont-threads tests in order to attempt
to improve their stability and performance.

Split test_vCont_run_subset_of_threads() into smaller test cases,
and split the whole suite into two files: one for signal-related tests,
the running-subset-of tests.

Eliminate output_match checks entirely, as they are fragile to
fragmentation of output. Instead, for the initial thread list capture
raise an explicit SIGSTOP from inside the test program, and for
the remaining output let the test program run until exit, and check all
the captured output afterwards.

For resume tests, capture the LLDB's thread view before and after
starting new threads in order to determine the IDs corresponding
to subthreads rather than relying on program output for that.

Add a mutex for output to guarantee serialization. A barrier is used
to guarantee that all threads start before SIGSTOP, and an atomic bool
is used to delay prints from happening until after SIGSTOP.

Call std::this_thread::yield() to reduce the risk of one of the threads
not being run.

This fixes the test hangs on FreeBSD. Hopefully, it will also fix all
the flakiness on buildbots.

Sponsored by: The FreeBSD Foundation

Diff Detail

Event Timeline

mgorny created this revision.Jul 1 2022, 12:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 1 2022, 12:09 PM
Herald added a subscriber: arichardson. · View Herald Transcript
mgorny requested review of this revision.Jul 1 2022, 12:09 PM
mgorny updated this revision to Diff 441768.

Remove obsolete <string> include.

mgorny updated this revision to Diff 441940.Jul 3 2022, 4:02 AM

Ok, unsurprisingly it turned out that using exit() when the main thread was potentially suspended was a bad idea (at least it was still fairly unstable on FreeBSD). I've managed to get around that by using _exit() but if you think just raise(SIGSTOP) would be preferred, I can change the tests to use that.

labath added a comment.Jul 6 2022, 2:18 AM

Seems like an improvement. I'd like to hear what you make of the inline comment though.

I can also imagine a setup where most of the verification happens inside the inferior. Like, each time a thread gets to run it increments a variable specific to that thread. Once that number crosses some threshold, it check the variables of the other threads, and verifies that one of them has the number zero, and the number of the other thread is reasonably high (> 10% of the threshold or something).

lldb/test/API/tools/lldb-server/vCont-threads/main.cpp
45

I am not sure if write is an (atomic) system call on windows. Maybe just put a mutex around the printf call? In this setup, I think it would be sufficient to write this once, and then spend the rest of the time making sure the other suspended thread gets a chance to run (if it is not suspended for whatever reason).

mgorny marked an inline comment as done.Jul 6 2022, 6:34 AM

I can also imagine a setup where most of the verification happens inside the inferior. Like, each time a thread gets to run it increments a variable specific to that thread. Once that number crosses some threshold, it check the variables of the other threads, and verifies that one of them has the number zero, and the number of the other thread is reasonably high (> 10% of the threshold or something).

I suppose this would be possible but tbh I think it'd be more complex than the current solution.

lldb/test/API/tools/lldb-server/vCont-threads/main.cpp
45

For some reason, I've assumed we're avoiding mutexes. However, this certainly makes sense, as well as removing duplicate writes. I've also noticed we had a race condition between starting prints and SIGSTOP, so I've fixed that too. Now the main thread is always resumed.

mgorny updated this revision to Diff 442550.Jul 6 2022, 6:36 AM
mgorny marked an inline comment as done.
mgorny edited the summary of this revision. (Show Details)

Implement the discussed changes.

labath accepted this revision.Jul 6 2022, 6:38 AM
labath added inline comments.
lldb/test/API/tools/lldb-server/vCont-threads/main.cpp
45

We have tests like that, but I think they're all related to the situations where we want to ensure some things happen as concurrently as possible. This is not one of those cases, and I don't see any problem with using mutexes here (and it definitely won't be the first such test).

This revision is now accepted and ready to land.Jul 6 2022, 6:38 AM
mgorny updated this revision to Diff 442823.Jul 7 2022, 2:21 AM
mgorny edited the summary of this revision. (Show Details)

Ok, so the previous version was fundamentally botched wrt synchronization.

Here's a new that:

  1. Uses a barrier to ensure that all threads actually start before we SIGSTOP. Otherwise, the debugger would sometimes not get all threads on FreeBSD.
  2. Uses a std::atomic<bool> to ensure that threads don't start printing until we're past SIGSTOP. Originally, I wanted to abuse print_mutex for that but this caused deadlocks on FreeBSD (I guess a suspended thread got in line for the mutex and blocked others from getting it).
  3. Marks can_exit_now as a thread_local variable.
labath accepted this revision.Jul 7 2022, 3:10 AM
labath added inline comments.
lldb/test/API/tools/lldb-server/vCont-threads/main.cpp
18

I guess this should technically be a volatile sig_atomic_t

mgorny added inline comments.Jul 7 2022, 6:28 AM
lldb/test/API/tools/lldb-server/vCont-threads/main.cpp
18

But with thread_local, correct?

labath added inline comments.Jul 7 2022, 6:40 AM
lldb/test/API/tools/lldb-server/vCont-threads/main.cpp
18

Yes. If that's not enough qualifiers, you can even throw in static, for good measure. :)

mgorny marked 2 inline comments as done.Jul 7 2022, 7:27 AM
mgorny added inline comments.
lldb/test/API/tools/lldb-server/vCont-threads/main.cpp
18

Better overqualified than underqualified!

This revision was automatically updated to reflect the committed changes.
mgorny marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2022, 7:35 AM
mgorny added a comment.Jul 7 2022, 8:04 AM

@labath, any clue why it'd be broken on Debian?

https://lab.llvm.org/buildbot/#/builders/68/builds/35396

AssertionError: '$T0athread:1ddf55;name:a.out;00:0000000000000000;01:0000000000000000;02:e1bf0b8 [truncated]... != '$W00'
- $T0athread:1ddf55;name:a.out;00:0000000000000000;01:0000000000000000;02:e1bf0b8d067f0000;03:0000000000000000;04:0200000000000000;05:e0272245fc7f0000;06:90292245fc7f0000;07:e0272245fc7f0000;08:0000000000000000;09:e0272245fc7f0000;0a:0800000000000000;0b:4602000000000000;0c:c091997f07560000;0d:0000000000000000;0e:0000000000000000;0f:0000000000000000;10:e1bf0b8d067f0000;11:4602000000000000;12:3300000000000000;13:0000000000000000;14:0000000000000000;15:2b00000000000000;16:0000000000000000;17:0000000000000000;reason:signal;+ $W00

I mean, this makes no sense to me. Why is it stopping due to a signal LLDB was supposed to send to it? And why can't I reproduce it on Gentoo or FreeBSD?

Well... I'm as puzzled as you are. Whatever the problem is, it seems to be limited to debian (kernel). I've filed a bug about that: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1014754.

One good thing I learned is that this problem seems to be limited to SIGSTOPs, so it should be possible to work around it by using some other signal. That's probably a good idea anyway, since SIGSTOP has the unpleasant property of leaving around a stopped zombie if something goes wrong (that's what killed the bot over the weekend).

mgorny reopened this revision.Jul 11 2022, 8:17 AM

Heh, and I actually thought using SIGSTOP was a good idea because it's unlikely to have any unexpected side effects ;-).

This revision is now accepted and ready to land.Jul 11 2022, 8:17 AM

Also, big, big thanks for helping out with this!