This is an archive of the discontinued LLVM Phabricator instance.

[NativeProcessLinux] Get rid of the thread state coordinator thread
ClosedPublic

Authored by labath on Apr 23 2015, 9:17 AM.

Details

Summary

This change removes the thread state coordinator thread by making all the operations it was
performing synchronous. In order to prevent deadlock, NativeProcessLinux must now always call
m_monitor->DoOperation with the m_threads_mutex released. This is needed because HandleWait
callbacks lock the mutex (which means the monitor thread will block waiting on whoever holds the
lock). If the other thread now requests a monitor operation, it will wait for the monitor thread
do process it, creating a deadlock.

In order to preserve this invariant, I have made several NPL methods into operations, so that
they aquire the lock inside the monitor thread.

Diff Detail

Repository
rL LLVM

Event Timeline

labath updated this revision to Diff 24308.Apr 23 2015, 9:17 AM
labath retitled this revision from to [NativeProcessLinux] Get rid of the thread state coordinator thread.
labath updated this object.
labath edited the test plan for this revision. (Show Details)
labath added reviewers: vharron, chaoren.
labath added a subscriber: Unknown Object (MLST).
labath updated this revision to Diff 24369.Apr 24 2015, 2:50 AM

Fix up TSC unit tests.

chaoren added inline comments.Apr 28 2015, 11:26 AM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
2977 ↗(On Diff #24369)

There has to be a better way to do this. Is it not possible to acquire the lock without being an operation?

chaoren added inline comments.Apr 28 2015, 11:46 AM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
2977 ↗(On Diff #24369)

It might be a better idea for these to be actual operation classes, and not a lambda.

I guess I should provide some more meta-info about this patch series:

All of the ThreadStateCoordinator patches need to be applied serially. The order is D9227->D9254->D9296->D9321 (basically, just follow increasing revision numbers).
Of these, the most important is the first one, where I remove the thread that was executing the TSC and make processing of all of TSC events synchronous. After this, a lot of the code which deals with the piping of data between threads (and between TSC and NativeProcessLinux) becomes redundant, so in the rest of the patches, I am slowly cleaning things up. These changes should not change the functionality, but at the end of it, I expect to save about 1000 lines of code and get rid of the ThreadStateCoordinator completely.

Oh and the series is not complete yet - I have a couple of changes queued locally and I will be pushing them slowly, and I also intend to make further simplifications. However, I wanted to get early feedback, so that I can simplify rebasing.

labath added inline comments.Apr 29 2015, 2:19 AM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
2977 ↗(On Diff #24369)

I don't see a way to avoid this (without some employing some radically different design). Basically, here we have two resources here: execution time on the privileged monitor thread and the m_threads_mutex.

a) processing waitpid events happens on the monitor thread, so it acquires the first resource first. Then for successful processing of some (all?) of the events you need to update the thread metadata, so you need to acquire the second resource. This creates a dependency execution_time -> mutex.

b) In resume, we were first acquiring the thread mutex, so we could safely iterate over the threads. Then we wanted to resume the threads, which means we needed some execution time on the privileged thread (second resource). Here, we have the dependency order reversed, which is bound to create a deadlock at some point.

I suspect this was the main reason for the existence of the TSC thread. When TSC was in a separate thread, the dependency cycle was broken, because you were simply queuing the requests for execution time in the TSC, which was then asynchronously executing them when the resource became available. However, this asynchronous execution made the codebase much more complicated, because you needed a lot of code to deal with passing data between threads. Also, because of the asynchronous nature of the execution it was possible to have individual resume request from this big resume operation interleaved with waitpid events of threads stopping. This lead to situations where you were halfway through resuming the threads, but you were already processing stop events for those threads and queuing operations to stop them. I believe this design, where you first resume the threads in a single atomic operation and then deal with the fallout (waitpid events) makes it much clearer what is going on. I think having a couple more "Operations" is a small price to pay for such clarity.

What do you think?

labath updated this revision to Diff 24625.Apr 29 2015, 6:14 AM

Fix null pointer error during destruction

An error would occur if we tried to destroy the Monitor while it was in the middle of processing
a waitpid request. This happened because the processing could result in a Monitor operation being
executed through the now-null m_monitor_up. This is a bad design (we have the same problem with
NativeProcessLinux), and I plan to fix it in the future, but for now I fix it by delaying the
destruction of the Monitor. Instead, I introduce the Terminate method, which stops the monitor
thread when we need it.

labath added inline comments.Apr 29 2015, 6:16 AM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
2977 ↗(On Diff #24369)

About the operation class comment:
If you think having a separate operation class will make things clearer, I can go ahead and do it. However, I find it makes it easier to read the code if the lambda (which is the actual meat of the function) is in the function declaration. If I have a separate operation class I have to scroll from e.g. ReadRegisterValue to ReadRegOperation to ReadRegOperation::Execute to find the actual implementation.

I guess it's a matter of taste but I eventually wanted get rid of the Operation class altogether and morph the code into something like this:

Error
NPL::Monitor::DoOperation(std::function<Error()> lambda);

Error
NPL::ReadRegisterValue(...)
{

return monitor_up->DoOperation([&] {
  // actual code goes here
});

}

However, I did not want to mess with this in this patch, so I postponed it, and created LambdaOperation as a temporary placeholder. What do you think about that? I can also do the DoOperation modification before this patch if you think it works better...

chaoren added inline comments.Apr 29 2015, 7:05 PM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
3873 ↗(On Diff #24625)

Please ignore this comment. Phabricator won't let me delete.

2977 ↗(On Diff #24369)

I believe this design, where you first resume the threads in a single atomic operation and then deal with the fallout (waitpid events) makes it much clearer what is going on

Agreed, but if we could unlocked the thread mutex before using the privileged thread, we have the option of doing that for the moment, until we can come up with a better design.

I find it makes it easier to read the code if the lambda (which is the actual meat of the function) is in the function declaration.

I just feel uneasy about putting large chunks of code in lambdas (at least use explicit capturing). Another thing is that the class name encapsulates important information about its purpose. I would much rather run into a ResumeOperation object than a LambdaOperation object while debugging.

FWIW, this also works:

Error
NPL::Resume(...)
{
    Error error;
    class ResumeOperation : public Operation
    {
        ...
    } op{error};
    m_monitor_up->DoOperation(&op);
    return error;
}

It might be a good idea if we can get some more opinions on this.

labath added inline comments.Apr 30 2015, 1:52 AM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
2977 ↗(On Diff #24369)

I believe this design, where you first resume the threads in a single atomic operation and then deal with the fallout (waitpid events) makes it much clearer what is going on

Agreed, but if we could unlocked the thread mutex before using the privileged thread, we have the option of doing that for the moment, >until we can come up with a better design.

I don't see how that is possible, without compromising the atomicity of the resume. Even if you unlock the mutex right before the call to the privileged thread, you will still have the possibility of some waitpid event sneaking through and changing the list of threads (stopping, creating new threads, ...) from under you.

I have been contemplating one other solution. What do you think about a design like this:

Error NPL::Resume(...)
{
  m_monitor_op->BeginOperationBlock();
  ...
  m_monitor_op->EndOperationBlock();
  return error;
}

Here, the efect of BeginOperationBlock() would be that it locks the privileged thread into a loop where it just executes operation requests. Any waitpid processing gets deferred until we execute EndOperationBlock(). We could easily have these nest if we needed to and we can make a ScopedOperationLocker (or something) to make sure we don't forget to execute the End operation.

What do you make of that? I think I'm starting to prefer this solution myself....

I find it makes it easier to read the code if the lambda (which is the actual meat of the function) is in the function declaration.

I just feel uneasy about putting large chunks of code in lambdas (at least use explicit capturing). Another thing is that the class name encapsulates important information about its purpose. I would much rather run into a ResumeOperation object than a LambdaOperation object while debugging.

I can add explicit capturing if you like. I agree with the debugging point, but I don't have a good answer for that. But hey, at least the name of the lambda would be something like NPL::Resume::unreadable_mess_of_characters, so you have _some_ chance of figuring it out :)

FWIW, this also works:
...

It works, but it's still too much visual noise for my taste...

It might be a good idea if we can get some more opinions on this.

Feel free to show this to anyone over there. Everyone in london is on holiday.

Guys, are you reading this? Any thoughts?

labath updated this revision to Diff 24692.Apr 30 2015, 3:32 AM

Alternative proposal to prevent monitor thread <-> m_thread_mutex deadlock.

labath added inline comments.Apr 30 2015, 3:40 AM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
3887 ↗(On Diff #24692)

I have an idea how to get rid of these as well.

We need to make the process of acquiring m_threads_mutex virtual in the base class. Then we can override this behaviour to acquire monitor lock as well. This should be relatively easy and it should make things future-proof. However, I will leave that for a separate patch.

chaoren added inline comments.Apr 30 2015, 4:21 PM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
2977 ↗(On Diff #24369)

I don't see how that is possible, without compromising the atomicity of the resume.

If I understood correctly, ThreadStateCoordinator::RequestThreadResume originally only enqueued the event to executed it asynchronously, so it must have not depended on the thread mutex. I do agree that it would be ideal to make all of these operations atomic, but I think this CL should focus on removing the TSC, and leave the atomic operations update for a later CL. You are right that waitpid events could sneak through in between, but that probability should have already existed before your change so there's no regression.

I have been contemplating one other solution. What do you think about a design like this:

You're still going to do m_monitor_up->DoOperation(&op) (or a lambda instead of op) eventually and there's no way around it. So this seems like an added layer of complexity with not much benefit.

We should use either only Operation classes (no lambdas), or only lambdas (no Operation classes). Which to use eventually is up to debate, but currently everything is done with Operation classes, so we should stick with that for now (since the focus right now is removing TSC).

A more "correct" solution seems to be to refactor the code so that the locking of the thread mutex always happens within the privileged thread, and avoid nesting the DoOperation calls. But this is not a trivial task, so It's best if we separate it from this CL.

It works, but it's still too much visual noise for my taste...

Agreed, let's not use that.

labath added inline comments.May 1 2015, 3:32 AM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
2977 ↗(On Diff #24369)

If I understood correctly, ThreadStateCoordinator::RequestThreadResume originally only enqueued the event to executed it asynchronously, so it must have not depended on the thread mutex. I do agree that it would be ideal to make all of these operations atomic, but I think this CL should focus on removing the TSC, and leave the atomic operations update for a later CL. You are right that waitpid events could sneak through in between, but that probability should have already existed before your change so there's no regression.

I don't think we can separate the two things.. The Resume operation was atomic (with respect to the m_threads list) before this change, so if we made it non-atomic now, it would be a step backwards. The reason why it did not deadlock before was because the events were just queued and not executed immediately. This way, waitpid could happen while you are processing the event, but not while you are queueing them. Then, the waitpid would queue some events of it's own, which would get processed after the Resume events, so _in theory_ everything was nice and serialized. It was a very clever design, but it was easy to get things wrong (e.g. processing of waitpid had the queue events which were "correct" no matter what events were queued before or after it. And it could not depend on the state of the threads to help it determine the correct sequence of events because the threads' state could change before its events got a chance to execute).

Now that we are making the processing of events synchronous, things are easier because our internal state of threads exactly reflects the state of threads in the system. However, we still need to make sure that operations on this state are atomic.

You're still going to do m_monitor_up->DoOperation(&op) (or a lambda instead of op) eventually and there's no way around it. So this seems like an added layer of complexity with not much benefit.

Yes, I will be doing the operations, but this way they will be atomic, which is exactly what we want. It complicates the implementation a little bit, but not by much. And the usage of it is actually very easy - it behaves just like a mutex, check out the new version of this patch. So in essence, all the user needs to know is "when I want to execute a sequence of operations, I need to lock this 'mutex'".

We should use either only Operation classes (no lambdas), or only lambdas (no Operation classes). Which to use eventually is up to debate, but currently everything is done with Operation classes, so we should stick with that for now (since the focus right now is removing TSC).

Agreed. The new version does not use lambdas any more.

A more "correct" solution seems to be to refactor the code so that the locking of the thread mutex always happens within the privileged thread, and avoid nesting the DoOperation calls. But this is not a trivial task, so It's best if we separate it from this CL.

I have considered making sure DoOperation calls do not nest, but this is not easy, since some operations (e.g. ReadMemory) are used from within other operations and are also external interfaces. So you would need two versions of them... So, right now, I chose to allow nesting of operations, but this is also not a new concept. Recursive mutexes exist exactly for this purpose, so you can think of this OperationLocker as a recursive mutex. In fact, the m_threads_mutex is a recursive mutex precisely because of this - you don't need a special version of functions for when you already hold the mutex.

Hm... where do we go from now? I'm happy with either of these versions (making big operations (without lamdas) or using operation mutexes) and can go with whichever you prefer. Or we can try to discuss this online to come up with some alternative... What do you think?

chaoren added inline comments.May 1 2015, 8:32 AM
source/Plugins/Process/Linux/NativeProcessLinux.cpp
2977 ↗(On Diff #24369)

I don't think we can separate the two things.. The Resume operation was atomic (with respect to the m_threads list) before this change, so if we made it non-atomic now, it would be a step backwards. The reason why it did not deadlock before was because the events were just queued and not executed immediately. This way, waitpid could happen while you are processing the event, but not while you are queueing them. Then, the waitpid would queue some events of it's own, which would get processed after the Resume events, so _in theory_ everything was nice and serialized. It was a very clever design, but it was easy to get things wrong (e.g. processing of waitpid had the queue events which were "correct" no matter what events were queued before or after it. And it could not depend on the state of the threads to help it determine the correct sequence of events because the threads' state could change before its events got a chance to execute).

Ah, then I didn't understand correctly. Sorry, please forget everything I said about a separate CL for atomicity.

I have considered making sure DoOperation calls do not nest, but this is not easy, since some operations (e.g. ReadMemory) are used from within other operations and are also external interfaces. So you would need two versions of them...

I don't have an issue with nesting DoOperation's in and of itself, but in the specific case of NPL::Resume, there's already a ResumeOperation that supposedly implements NPL::Resume, I was thinking of moving the bulk of NPL::Resume (that involves locking the thread mutex) into the already existing ResumeOperation.

Could we make m_operation_mutex accessible from NPL? And make Handle{Signal, Wait} trylock m_operation_mutex before proceeding (and skip if the trylock fails). Since the operations will be synchronous, it doesn't seem like we'll be needing it in DoOperation either.

Or we can try to discuss this online to come up with some alternative... What do you think?

Yes, please. The communication latency is really making this take longer than necessary.

chaoren accepted this revision.May 1 2015, 11:22 AM
chaoren edited edge metadata.

LGTM. Sorry it took this long.

source/Plugins/Process/Linux/NativeProcessLinux.cpp
3047 ↗(On Diff #24692)

This is so much nicer.

This revision is now accepted and ready to land.May 1 2015, 11:22 AM
This revision was automatically updated to reflect the committed changes.