This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Fix CommandInterpreter::DidProcessStopAbnormally() with multiple threads
ClosedPublic

Authored by mstorsjo on Sep 16 2022, 6:23 AM.

Details

Summary

If a process has multiple threads, the thread with the stop
info might not be the first one in the thread list.

On Windows, under certain circumstances, processes seem to have
one or more extra threads that haven't been launched by the
executable itself, waiting in NtWaitForWorkViaWorkerFactory. If the
main (stopped) thread isn't the first one in the list (the order
seems nondeterministic), DidProcessStopAbnormally() would return
false prematurely, instead of inspecting later threads.

The main observable effect of DidProcessStopAbnormally() erroneously
returning false, is when running lldb with multiple "-o" parameters
to specify multiple commands to execute on the command line.

After an abnormal stop, lldb would stop executing "-o" parameters
and execute "-k" parameters instead - but due to this issue, it
would instead keep executing "-o" parameters as if there was no
abnormal stop. (If multiple parameters are specified via a script
file via the "-s" option, all of the commands in that file are
executed regardless of whether there's an abnormal stop inbetween.)

Diff Detail

Event Timeline

mstorsjo created this revision.Sep 16 2022, 6:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 16 2022, 6:23 AM
mstorsjo requested review of this revision.Sep 16 2022, 6:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 16 2022, 6:23 AM

We never made any guarantee about the order threads would be listed in the array returned by GetThreadList. I'm not sure it would be worth trying to do that, because often you don't know whether a thread's public StopInfo will be valid even if the private stop info is. For instance, if a thread hits a thread specific breakpoint for a thread that isn't in the breakpoint's thread specifier, then we reset the stop info back to an empty one because formally that thread didn't hit our breakpoint... So this change is clearly right.

I don't suppose there's a way to write a test for this?

We never made any guarantee about the order threads would be listed in the array returned by GetThreadList. I'm not sure it would be worth trying to do that, because often you don't know whether a thread's public StopInfo will be valid even if the private stop info is. For instance, if a thread hits a thread specific breakpoint for a thread that isn't in the breakpoint's thread specifier, then we reset the stop info back to an empty one because formally that thread didn't hit our breakpoint... So this change is clearly right.

I don't suppose there's a way to write a test for this?

Actually, yes, it is testable fairly easily - by firing up a test example with 2 threads and hitting a crash or breakpoint in either of them, and running it with a -k option. Before this fix, the test would still pass maybe half of the time, but now it should pass reliably.

mstorsjo updated this revision to Diff 460880.Sep 16 2022, 1:31 PM

Added a testcase. I'm not sure if this is the most accurate/correct place for the test, but the name is at least verbose enough to describe what it exercises.

DavidSpickett added inline comments.Sep 23 2022, 5:20 AM
lldb/source/Interpreter/CommandInterpreter.cpp
2473–2474

Please add a comment explaining why we walk all the threads.

(though in hindsight it does seem the obvious thing to do)

lldb/test/Shell/Driver/CommandOnCrashMultiThreaded.test
2

Isn't this fixing an issue that you saw on Windows as well?

lldb/test/Shell/Driver/Inputs/CommandOnCrashMultiThreaded.cpp
6

This is an x86 breakpoint, right? Shame there is no __builtin_break (well, msvc has one apparently but no help for this).

I might add the AArch64 and Arm equivalent if I have some spare time.

mstorsjo added inline comments.Sep 23 2022, 5:35 AM
lldb/source/Interpreter/CommandInterpreter.cpp
2473–2474

Sure, can do

lldb/test/Shell/Driver/CommandOnCrashMultiThreaded.test
2

Yes, that was the place where I observed the issue (as Windows apparently, under some circumstances, can have a couple of extra threads running that haven't been spawned by the code of the process itself), but it's easily reproducible on any platform as long as you make sure there's more than one thread running.

I copied the basis for the test from Shell/Register/x86-multithreaded-read.test - most of the tests there under Shell/Register seem to have such an XFAIL, not sure exactly why. (In this particular test, at least the -pthread linker flag might be a trivial platform specific difference that might break it, which would need to be abstracted somewhere.)

lldb/test/Shell/Driver/Inputs/CommandOnCrashMultiThreaded.cpp
6

Yes, afaik this is a regular x86 programmatic breakpoint.

About the portability of breakpoints on ARM btw; the exact udf #xx code for a programmatic breakpoint is OS dependent, and does differ between Linux and Windows at least - see e.g. https://github.com/mingw-w64/mingw-w64/commit/38c9f0318c6509110706afdc52adc12d8cc34ead.

mstorsjo updated this revision to Diff 462454.Sep 23 2022, 5:37 AM

Add a code comment about the code that is changed in the patch.

DavidSpickett accepted this revision.Sep 23 2022, 5:45 AM

LGTM

lldb/test/Shell/Driver/CommandOnCrashMultiThreaded.test
2

Yeah I didn't notice the pthread flag, that could be it. Makes sense to me.

lldb/test/Shell/Driver/Inputs/CommandOnCrashMultiThreaded.cpp
6

Good point, definitely would have tripped over that. I'll double check arm64 as well.

This revision is now accepted and ready to land.Sep 23 2022, 5:45 AM
mstorsjo added inline comments.Sep 26 2022, 1:01 AM
lldb/test/Shell/Driver/CommandOnCrashMultiThreaded.test
2

Actually, it turns out that this test does pass on Windows (I hadn't tested before, I had just inherited the XFAIL from the test I based it on) - as -pthread isn't a linker flag but a compiler driver flag, the msvc target in clang just does nothing about it. So I'll remove the XFAIL here before pushing it.

I might add the AArch64 and Arm equivalent if I have some spare time.

I looked into this and there's a detail that makes this work on x86 only. So I'll leave it as is, it's getting tested somewhere, and that's what matters.

If you're curious this is the spanner in the works:

size_t NativeProcessProtocol::GetSoftwareBreakpointPCOffset() {
  switch (GetArchitecture().GetMachine()) {
  case llvm::Triple::x86:
  case llvm::Triple::x86_64:
  case llvm::Triple::systemz:
    // These architectures report increment the PC after breakpoint is hit.
    return cantFail(GetSoftwareBreakpointTrapOpcode(0)).size();

Meaning AArch64 doesn't. lldb on x86 can continue past a break in the program because of this:

(lldb) run
Process 31032 launched: '/tmp/test.o' (x86_64)
Process 31032 stopped
* thread #1, name = 'test.o', stop reason = signal SIGTRAP
    frame #0: 0x000055555555460f test.o`main at test.c:3
   1    int main() {
   2            __asm__ __volatile__("int3");
-> 3                  return 0;
   4    }
(lldb) c
Process 31032 resuming
Process 31032 exited with status = 0 (0x00000000)

On AArch64 we just stick on the break:

(lldb) run
Process 2162869 launched: '/tmp/test.o' (aarch64)
Process 2162869 stopped
* thread #1, name = 'test.o', stop reason = signal SIGTRAP
    frame #0: 0x0000aaaaaaaaa71c test.o`main at test.c:2:3
   1    int main() {
-> 2      asm volatile("brk #0xf000\n\t");
   3      return 0;
   4    }
(lldb) c
Process 2162869 resuming
Process 2162869 stopped
* thread #1, name = 'test.o', stop reason = signal SIGTRAP
    frame #0: 0x0000aaaaaaaaa71c test.o`main at test.c:2:3
   1    int main() {
-> 2      asm volatile("brk #0xf000\n\t");
   3      return 0;
   4    }

gdb has the same behaviour on AArch64 so I'll leave it as is until someone complains.

I might add the AArch64 and Arm equivalent if I have some spare time.

I looked into this and there's a detail that makes this work on x86 only. So I'll leave it as is, it's getting tested somewhere, and that's what matters.

If you're curious this is the spanner in the works:

size_t NativeProcessProtocol::GetSoftwareBreakpointPCOffset() {
  switch (GetArchitecture().GetMachine()) {
  case llvm::Triple::x86:
  case llvm::Triple::x86_64:
  case llvm::Triple::systemz:
    // These architectures report increment the PC after breakpoint is hit.
    return cantFail(GetSoftwareBreakpointTrapOpcode(0)).size();

Meaning AArch64 doesn't. lldb on x86 can continue past a break in the program because of this:

(lldb) run
Process 31032 launched: '/tmp/test.o' (x86_64)
Process 31032 stopped
* thread #1, name = 'test.o', stop reason = signal SIGTRAP
    frame #0: 0x000055555555460f test.o`main at test.c:3
   1    int main() {
   2            __asm__ __volatile__("int3");
-> 3                  return 0;
   4    }
(lldb) c
Process 31032 resuming
Process 31032 exited with status = 0 (0x00000000)

On AArch64 we just stick on the break:

(lldb) run
Process 2162869 launched: '/tmp/test.o' (aarch64)
Process 2162869 stopped
* thread #1, name = 'test.o', stop reason = signal SIGTRAP
    frame #0: 0x0000aaaaaaaaa71c test.o`main at test.c:2:3
   1    int main() {
-> 2      asm volatile("brk #0xf000\n\t");
   3      return 0;
   4    }
(lldb) c
Process 2162869 resuming
Process 2162869 stopped
* thread #1, name = 'test.o', stop reason = signal SIGTRAP
    frame #0: 0x0000aaaaaaaaa71c test.o`main at test.c:2:3
   1    int main() {
-> 2      asm volatile("brk #0xf000\n\t");
   3      return 0;
   4    }

gdb has the same behaviour on AArch64 so I'll leave it as is until someone complains.

Ah, yes, I remember hitting that issue occasionally. On the other hand, if it would be somewhat doable to fix it, it'd would be a great value add IMO - I've had to do acrobatics to set the program counter to the next instruction to resume running after such a breakpoint a couple times...