Page MenuHomePhabricator

[lldb-mi] Fix hanging of target-select-so-path.test
ClosedPublic

Authored by apolyakov on Sep 15 2018, 12:53 PM.

Details

Summary

The target-select-so-path test might hang on
some platforms. The reason of that behavior
was in incorrect usage of Filecheck and lldb-mi
processes. Instead of redirecting lldb-mi's output
to Filecheck, we should run lldb-mi session,
finish the session, collect its output and then pass
it to Filecheck.
Also, this patch adds a timer to the test to prevent
it from hanging in the future.

Event Timeline

apolyakov created this revision.Sep 15 2018, 12:53 PM

It looks like it would be easy to forget to add -gdb-exit to every test. Wouldn't it be better to have the lldb-mi automatically produce a -gdb-exit command when it reads EOF from stdin?

It seems reasonable, I'll look at this.

apolyakov updated this revision to Diff 166290.Sep 20 2018, 7:57 AM
apolyakov retitled this revision from [lldb-mi] Improve lldb-mi LIT tests to [lldb-mi] Fix hanging of target-select-so-path.test.
apolyakov edited the summary of this revision. (Show Details)

I found out that the reason of hanging of the test suite was in incorrect usage of Filecheck and lldb-mi processes in target-select-so-path.test. Current patch has been tested on CentOS 7.

These changes fixed the issue for me.

This revision is now accepted and ready to land.Sep 20 2018, 9:04 AM

FYI, for me the test is still hanging on Arch Linux.

Can you provide some logs? For example, you might be able to run this test (llvm-lit -a -vv .../target-select-so-path.test) and kill lldb-mi and filecheck processes if they hang. Also, information about which processes hang would be useful too (lldb-mi or filecheck or maybe both).

tatyana-krasnukha requested changes to this revision.Sep 20 2018, 10:30 AM

Alexander, please add a timeout to be sure that the test is not hanging even if something goes wrong. A possible solution for Python 2:

This revision now requires changes to proceed.Sep 20 2018, 10:30 AM
apolyakov edited the summary of this revision. (Show Details)

Thanks to @tatyana-krasnukha for the idea about a timer. Added a timer to target-select-so-path test.

I'm accepting this as seems to improve the situation, even though it's not a complete fix.

(Not sure about the 2-minute timeout value though, but we'll see how the bots react).

teemperor accepted this revision.Sep 20 2018, 12:51 PM

Can you provide some logs? For example, you might be able to run this test (llvm-lit -a -vv .../target-select-so-path.test) and kill lldb-mi and filecheck processes if they hang. Also, information about which processes hang would be useful too (lldb-mi or filecheck or maybe both).

@teemperor can you do this? It'll help me a lot.

Sure, here is the output: After some debugging it seems that our select wrapper code is failing. We probably could fix this by terminating the process like in the other error cases in the same method (grep for the "failed to write to the unnamed pipe" error to see the code).

******************** TEST 'lldb :: tools/lldb-mi/target/target-select-so-path.test' FAILED ********************
Script:
--
: 'RUN: at line 3';   /home/teemperor/.llvm/rel-build/./bin/clang -o /home/teemperor/.llvm/rel-build/tools/lldb/lit/tools/lldb-mi/target/Output/target-select-so-path.test.tmp /home/teemperor/.llvm/llvm/tools/lldb/lit/tools/lldb-mi/target/inputs/main.c -g
: 'RUN: at line 4';   python /home/teemperor/.llvm/llvm/tools/lldb/lit/tools/lldb-mi/target/inputs/target-select-so-path.py "/home/teemperor/.llvm/rel-build/bin/lldb-server gdbserver" "/home/teemperor/.llvm/rel-build/bin/lldb-mi --synchronous /home/teemperor/.llvm/rel-build/tools/lldb/lit/tools/lldb-mi/target/Output/target-select-so-path.test.tmp" /home/teemperor/.llvm/llvm/tools/lldb/lit/tools/lldb-mi/target/target-select-so-path.test
--
Exit Code: 143

Command Output (stderr):
--
+ : 'RUN: at line 3'
+ /home/teemperor/.llvm/rel-build/./bin/clang -o /home/teemperor/.llvm/rel-build/tools/lldb/lit/tools/lldb-mi/target/Output/target-select-so-path.test.tmp /home/teemperor/.llvm/llvm/tools/lldb/lit/tools/lldb-mi/target/inputs/main.c -g
+ : 'RUN: at line 4'
+ python /home/teemperor/.llvm/llvm/tools/lldb/lit/tools/lldb-mi/target/inputs/target-select-so-path.py '/home/teemperor/.llvm/rel-build/bin/lldb-server gdbserver' '/home/teemperor/.llvm/rel-build/bin/lldb-mi --synchronous /home/teemperor/.llvm/rel-build/tools/lldb/lit/tools/lldb-mi/target/Output/target-select-so-path.test.tmp' /home/teemperor/.llvm/llvm/tools/lldb/lit/tools/lldb-mi/target/target-select-so-path.test
failed to write to the unnamed pipe: timed out
/home/teemperor/.llvm/rel-build/tools/lldb/lit/tools/lldb-mi/target/Output/target-select-so-path.test.script: line 2:  7284 Terminated              python /home/teemperor/.llvm/llvm/tools/lldb/lit/tools/lldb-mi/target/inputs/target-select-so-path.py "/home/teemperor/.llvm/rel-build/bin/lldb-server gdbserver" "/home/teemperor/.llvm/rel-build/bin/lldb-mi --synchronous /home/teemperor/.llvm/rel-build/tools/lldb/lit/tools/lldb-mi/target/Output/target-select-so-path.test.tmp" /home/teemperor/.llvm/llvm/tools/lldb/lit/tools/lldb-mi/target/target-select-so-path.test

--

Do you mean ConnectToRemote method from lldb/tools/lldb-server/lldb-gdbserver.cpp?

Yes, the writeSocketIdToPipe(unnamed_pipe_fd, socket_id); fails in this method/file.

AFAIR, adding an exit(...) to ConnectToRemote won't solve this problem. The test will still be failing on Arch.

This comment was removed by teemperor.

Posting my mail here for the record:

I was more hoping it would at least turn the deadlock into a fail,
this way I could at least run the test suit.

Anyway, the actual issue is related Python 3: Arch Linux (and probably
some other distributions that "already" upgraded to Python 3 as
default) will launch Python 3 when the test script calls python ....
And for some reason in Python 3.7, we will not inherit our FD from our
pipe to the subprocess, which causes this strange behavior when write
to the unrelated FD number in ConnectToRemote. When I explicitly call
Python 2.7 from the test, everything runs as expected.

The python docs don't see to mention this change (and it seems like a
bug to me), so I'm open for ideas how to fix this properly. In any
case this problem doesn't block this review.

AFAIR, adding an exit(...) to ConnectToRemote won't solve this problem. The test will still be failing on Arch.

I was more hoping it would at least turn the deadlock into a fail,
this way I could at least run the test suit.

Anyway, the actual issue is related Python 3: Arch Linux (and probably
some other distributions that "already" upgraded to Python 3 as
default) will launch Python 3 when the test script calls python ....
And for some reason in Python 3.7, we will not inherit our FD from our
pipe to the subprocess, which causes this strange behavior when write
to the unrelated FD number in ConnectToRemote. When I explicitly call
Python 2.7 from the test, everything runs as expected.

The python docs don't see to mention this change (and it seems like a
bug to me), so I'm open for ideas how to fix this properly. In any
case this problem doesn't block this review.

(I hope this comment shows up in Phabricator, as it seems
reviews.llvm.org is currently having some internal errors).
Am So., 23. Sep. 2018 um 01:03 Uhr schrieb Alexander Polyakov via
Phabricator <reviews@reviews.llvm.org>:

apolyakov added a comment.

AFAIR, adding an exit(...) to ConnectToRemote won't solve this problem. The test will still be failing on Arch.

https://reviews.llvm.org/D52139

apolyakov added a comment.EditedSep 24 2018, 3:58 AM

If so, we can try to run the script with python2.x. @teemperor can you try to modify target-select-so-path.test this way:
change # RUN: python %p/inputs/target-select-so-path.py "%debugserver" "%lldbmi %t" %s to
# RUN: python2 %p/inputs/target-select-so-path.py "%debugserver" "%lldbmi %t" %s

If this works for you on Arch, I'll update the revision.

Yeah, explicitly typing python2 is what I did to fix it. Not sure if that breaks other OSs though (e.g. if a system has no python2 binary, but only python).

Also, that fix should be its own revision. It's not connected to the idea behind this revision from what I can see.

Reduced timer from 120 to 30 seconds.

A lot of tests are failing with Python 3, at least on CentOS. So, I agree the problem doesn't block this review.

This revision is now accepted and ready to land.Sep 24 2018, 11:34 AM
This revision was automatically updated to reflect the committed changes.
labath added a subscriber: labath.Sep 25 2018, 6:54 AM

Anyway, the actual issue is related Python 3: Arch Linux (and probably
some other distributions that "already" upgraded to Python 3 as
default) will launch Python 3 when the test script calls python ....
And for some reason in Python 3.7, we will not inherit our FD from our
pipe to the subprocess, which causes this strange behavior when write
to the unrelated FD number in ConnectToRemote. When I explicitly call
Python 2.7 from the test, everything runs as expected.

The python docs don't see to mention this change (and it seems like a
bug to me), so I'm open for ideas how to fix this properly. In any
case this problem doesn't block this review.

It's more like python 2 had a bug where it always inherited the file descriptor, and then python3 fixed that and introduced a special popen argument to control the inheriting behavior.

Back when we were designing this test, I demonstrated the necessary incantations for this to work on both python2 and 3 https://reviews.llvm.org/D49739?id=157310#inline-438290. It seems that did not make it into the final version..