This is an archive of the discontinued LLVM Phabricator instance.

Fix a race between lldb's packet timeout and killing the profile thread
ClosedPublic

Authored by jingham on Feb 21 2020, 5:01 PM.

Details

Summary

The debugserver profile thread used to suspend itself between samples with
a usleep. When you detach or kill, MachProcess::Clear would delay replying
to the incoming packet until pthread_join of the profile thread returned.
If you are unlucky or the suspend delay is long, it could take longer than
the packet timeout for pthread_join to return. Then you would get an error
about detach not succeeding from lldb - even though in fact the detach was
successful...

I replaced the usleep with PThreadEvents entity. Then we just call a timed
WaitForEventBits, and when debugserver wants to stop the profile thread, it
can set the event bit, and the sleep will exit immediately.

Note, you have to get fairly unlucky because when lldb times out a packet it
then sends a qEcho, and tries to get back in sync. That adds some extra delay
which might give the detach a chance to succeed. I've had occasional mysterious
reports of Detach failing like this - only under Xcode which is currently the only client
I know of of the profiling info for years, but didn't get to chasing it down till now.

Diff Detail

Event Timeline

jingham created this revision.Feb 21 2020, 5:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2020, 5:01 PM

Lots of change for something that might be fixed much easier:

Alt way: Why not just set m_profile_enabled to false in StopProfileThread() and let loop exit on next iteration? Code changes would be much smaller. All comments above are marked with "alt way:" for this solution

lldb/test/API/macosx/profile_vrs_detach/TestDetachVrsProfile.py
28

fix comment?

lldb/tools/debugserver/source/MacOSX/MachProcess.h
341–346

Alt way (see main comment): remove these lines

385

Alt way: remove

lldb/tools/debugserver/source/MacOSX/MachProcess.mm
28

remove

34

remove

490

alt way: remove

1326

alt way:

m_profile_enabled = false;
1329

alt way: remove

2511–2538

Alt way: revert all changes here, when we call StopProfileThread, it will set m_profile_enabled = false and this loop will naturally exit.

Your way: Move the conversion of profile interval out of the loop?

timespec ts;
using namespace std::chrono;
std::chrono::microseconds dur(proc->ProfileInterval());
const auto dur_secs = duration_cast<seconds>(dur);
const auto dur_usecs = dur % std::chrono::seconds(1);
 while (proc->IsProfilingEnabled()) {
   nub_state_t state = proc->GetState();
   if (state == eStateRunning) {
     std::string data =
         proc->Task().GetProfileData(proc->GetProfileScanType());
     if (!data.empty()) {
       proc->SignalAsyncProfileData(data.c_str());
     }
   } else if ((state == eStateUnloaded) || (state == eStateDetached) ||
              (state == eStateUnloaded)) {
     // Done. Get out of this thread.
     break;
   }
   DNBTimer::OffsetTimeOfDay(&ts, dur_secs.count(), dur_usecs.count());
   // Exit if requested.
   if (proc->m_profile_events.WaitForSetEvents(eMachProcessProfileCancel, &ts))
     break;
 }
clayborg accepted this revision.Feb 24 2020, 11:10 AM

Jim Ingham said in email:

I don’t understand your suggestion. The point of this change was to get the profile loop to exit without waiting the full sleep time. That time is generally pretty long (1 second or thereabouts) and so if you have the main debugserver thread wait on the profile thread’s timeout before replying to the detach packet, then lldb might have already timed out waiting for the packet reply by the time it does so.

Gotcha, then yes ignore my comments for alt way...

You could also fix this by not doing the pthread_join but instead have the detaching finish independently and then let the profile thread exit after the fact, but that seems like asking for trouble. Or you could reply to the detach right away, and then wait on the profile thread to exit as debugserver is going down. But the current code is not at all set up to do that.

Anyway, this solution has the benefit of being exactly what you want to have happen - the profile thread stops its wait sleep immediately when told to exit. And it relies on well tested mechanisms, and doesn’t seem terribly intrusive.

That is true.

BTW, I thought about doing the time calculation outside the profile thread loop, but I couldn’t see anything that said you have to stop and restart the profile loop when you change the timeout. So it didn’t seem safe to only calculate the timeout once. However, if we want to require that you stop and start the profile thread when you change its wait interval it would be fine to do that.

Fair point. All objections removed then, as this is the most performant and bullet proof solution.

This revision is now accepted and ready to land.Feb 24 2020, 11:10 AM
This revision was automatically updated to reflect the committed changes.