Page MenuHomePhabricator

fix a race is the LLDB test suite results collection
ClosedPublic

Authored by tfiala on Apr 17 2016, 8:55 PM.

Details

Reviewers
labath
Summary

The race boiled down to this:

If a test worker queue is able to run the test inferior and
clean up before the dosep.py listener socket is spun up, and
the worker queue is the last one (as would be the case when
there's only one test rerunning in the rerun queue), then
the test suite will exit the main loop before having a chance
to process any test events coming from the test inferior or
the worker queue job control.

I found this race to be far more likely on fast hardware.
Our Linux CI is one such example.  While it will show
up primarily during meta test events generated by
a worker thread when a test inferior times out or
exits with an exceptional exit (e.g. seg fault), it only
requires that the OS takes longer to hook up the
listener socket than it takes for the final test inferior
and worker thread to shut down.

Diff Detail

Event Timeline

tfiala updated this revision to Diff 54027.Apr 17 2016, 8:55 PM
tfiala retitled this revision from to fix a race is the LLDB test suite results collection.
tfiala updated this object.
tfiala added a reviewer: labath.
tfiala added a subscriber: lldb-commits.

The fix simply has the listener socket always send a byte, and the initiators of the socket always wait for that byte before continuing. This ensures that both cases where the sender is creating a socket and sending test events will never exit the dosep.py main loop before the listener socket is added to the asyncore map. This fixes the race.

The two initiators of sockets are:

  1. the dotest.py inferior process that is running a test case. This is the channel that normal test events (e.g. test start, test result, etc.) come over.
  1. the dosep.py worker queues (either true threads via the threading module, or faux threads that are really processes, via the multiprocessing library). They will spin up a socket if they need to report that a test inferior has generated an exceptional exit (i.e. seg fault or some other signal-generating exit) or if a timeout forced the process to be killed.

It is the second case above, during a rerun pass, that was most likely to race.

labath accepted this revision.Apr 18 2016, 2:04 AM
labath edited edge metadata.

This sounds a lot like the problem that has been plaguing our darwin buildbot (interestingly, I have never seen it happen on linux). Thanks for tracking that down.

I am not sure I understand the architecture of this system completely, but it sounds like a reasonable thing to do.

This revision is now accepted and ready to land.Apr 18 2016, 2:04 AM
tfiala closed this revision.Apr 18 2016, 9:15 AM

Closing with commit:
r266624

BTW, we are still seeing these fake timeouts on our buildbot http://lab.llvm.org:8011/builders/lldb-x86_64-darwin-13.4/builds/9770, so I am not sure if this actually fixed the problem. It could be that this is actually a different problem, but it does fit your description very well. In any case, I just thought I'd let you know about that...

Interesting, thanks Pavel!

I'll look at those as I start looking to test the test infrastructure.

Okay I just looked at those, Pavel.

They fall into a category that I know is not yet handled. This is the category:

  • No test method is "open" (i.e. the dotest.py inferior is not in between a start_test/end_test) when the timeout occurs. Thus, we cannot "charge" the timeout to a test method. Thus, it gets charged to the file.
  • When we rerun, we currently do not clear file-level results. (We should right before the rerun).

This hadn't been prioritized because I haven't in practice run into timeouts that are occurring outside of a test method. That implies either the startup or the shutdown of the python dotest.py inferior is hanging if you are consistently hitting that. That in itself is an interesting event to investigate.

I'll file a bug on this since there is a reasonable solution that will solve this one. I'll get a bug tracking that once I finish writing this.

Okay I just looked at those, Pavel.

They fall into a category that I know is not yet handled. This is the category:

  • No test method is "open" (i.e. the dotest.py inferior is not in between a start_test/end_test) when the timeout occurs. Thus, we cannot "charge" the timeout to a test method. Thus, it gets charged to the file.

This is fine. I dont have an issue with the presentation of the results.

  • When we rerun, we currently do not clear file-level results. (We should right before the rerun).

    This hadn't been prioritized because I haven't in practice run into timeouts that are occurring outside of a test method. That implies either the startup or the shutdown of the python dotest.py inferior is hanging if you are consistently hitting that. That in itself is an interesting event to investigate.

I'm not sure that is the case. The test that are reported as timed out are not doing anything special (testdefaultcachelinesize, testlistener) or are completely skipped on android (testpublicapiheader) so i'd be very surprised if we were seeing the timeouts there and nowhere else... It sounds more like a test runner bug to me...

In any case, i'll try to look into that once i finish dealing with other things bugging us. I just thought you should know about, as i think it could be related to this...

If you're in a position where you can grab a sample (using the OS X sample tool) on the bot when they're "hanging", we can try to find out where they are at.

The reason I *think* they're not in a method is because they don't have test method information associated with them. But that could be some kind of bug.

In any event, I'm working on this in a TDD fashion now.

FYI: This patch seems to have broken all LLDB testing on Windows. I'll investigate why.

Lots of the socket calls now result in stack traces like this:

Traceback (most recent call last):
  File "C:\python_35\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "C:\python_35\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", line 495, in process_dir_worker_multiprocessing_pool
    return process_dir(*args)
  File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", line 442, in process_dir
    command, timeout, base_name, inferior_pid_events, full_test_path))
  File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", line 417, in call_with_timeout
    test_filename)
  File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", line 372, in send_inferior_post_run_events
    send_events_to_collector(post_events, command)
  File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", line 308, in send_events_to_collector
    formatter_spec = result_formatter.create_results_formatter(config)
  File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\result_formatter.py", line 112, in create_results_formatter
    results_file_object, cleanup_func = create_socket(config.port)
  File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\result_formatter.py", line 78, in create_socket
    sock.connect(("localhost", port))
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

The changes to result_formatter.py are after the problematic line, so the root cause it not immediately obvious. I have a guess that we're trying to open more sockets than before and are therefore exceeding a handle limit imposed by Python's use of ancient C RTL code.

My guess here would be that the fixing of the race - i.e. ensuring the listening side is up and running before allowing the sender to continue, is probably keeping more sockets open than before (i.e. fixing the race, but doing so by using an average larger number of resources).

There are at least a couple ways to solve that, if that indeed is the issue:

  1. Don't run as many threads on Windows (set --threads 24 or 32),
  1. We abstract the communication mechanism, and use something different on Windows and/or something better for all, that allows bi-directional communication and isn't as limited. The "something different" could be a file based approach or some kind of other IPC mechanism. The key is it needs to be compatible with the asyncore approach used as part of the main loop in dosep.py.

What you probably don't want to do is revert this on Windows, since it is likely ensuring some test events come in that were getting dropped before.

If you end up getting excessively stuck on this, I'd suggest doing #1 to eliminate the limit from being hit, while a better solution for #2 is worked out for Windows. (And if you get stuck on #2, let me know, I may be able to dig around a bit and help, or at least get a test suite for that part up so you can plug in an implementation that works).

We already limit threads on Windows to work around other problems. Technically, we force Windows to always use the multiprocessing-pool instead of the threading-pool, which I think has the effect of limiting the number of threads.

I suspect that the first several invocations are stuck waiting "forever" for the new ACK byte (perhaps because of buffering?) and that causes the rest of the invocations to fail. I'm still investigating.

Having been broken for a week, it seems we've already collected 30-40 newly broken tests on Windows. :-(

Ick. Yeah that sounds like something to resolve quickly.

They shouldn't be getting stuck. That would imply that the side running via dosep.py (the listener socket) is either (1) no longer running and therefore the listener isn't spinning up, although that would be catastrophic and nothing else would work, or (2) the listening socket that fires up isn't getting time or is for some other reason not getting its send out.

Can you dial down the threads (which in the Windows case then means processes) further? How low are they set to now?

I got it. There was an exception being thrown from asyncore so early that it didn't give a proper stack trace. The problem boils down to Python 3 drawing a distinction between strings and bytearrays. Patch to come momentarily.

Thanks for your help.

tfiala added a subscriber: tfiala.Apr 25 2016, 8:50 PM

Awesome! Good catch, Adrian!

(And my pleasure!)

-Todd

labath added a comment.May 5 2016, 2:54 AM

If you're in a position where you can grab a sample (using the OS X sample tool) on the bot when they're "hanging", we can try to find out where they are at.

The reason I *think* they're not in a method is because they don't have test method information associated with them. But that could be some kind of bug.

In any event, I'm working on this in a TDD fashion now.

In turned out that this was indeed a bug in lldb (now fixed with D19916). What threw us off was that this error started appearing right around the new test runner change, but it must have been an coincidence.

In any case, the sample tool was helpful in tracking this down, so thanks for that. :)

In any case, the sample tool was helpful in tracking this down, so thanks for that. :)

Hey great! Glad you were able to track that down!