This patch introduces a new thread backtrace command "unique".
The command is based off of "thread backtrace all" but will instead
find all threads which share matching call stacks and de-duplicate
their output, listing call stack and all the threads which share it.
This is especially useful for apps which use thread/task pools
sitting around waiting for work and cause excessive duplicate output.
I needed this behavior recently when debugging a core with 700+ threads.
Details
Diff Detail
- Build Status
Buildable 7046 Build 7046: arc lint + arc unit
Event Timeline
The idea is great. I think it is a little confusing that you would do:
(lldb) thread backtrace all
to get all threads but
(lldb) thread unique-stacks
to backtrace the unique stacks. Wouldn't it be more logical to do:
(lldb) thread backtrace unique
It's really just another specification of the threads you are backtracing.
A few inline comments, we use '_' word separators for locals.
source/Commands/CommandObjectThread.cpp | ||
---|---|---|
448 | uniqueStacks -> unique_stacks | |
488–489 | We do local variables as: word_other_word, not camel-case. | |
497 | found_match |
This iteration addresses Jim's feedback.
I've updated all of the code to match the surrounding style where
I missed it originally. (branches/variable names)
I also moved the command under "thread backtrace unique".
This required refactoring the thread iterator command object.
I've tried to reuse as much as code possible in the thread iterator
so that we actually use HandleOneThread to push the rending of
of the call stack down to the BackTrace command.
The bucketing has been moved directly into the command iterator
when he user passes in the "unique" option.
Pavel's right, it would be good to add a test case. You could modify the test case in packages/Python/lldbsuite/test/functionalities/thread/num_threads/ to this end. Note this Test Case used to be sitting mixed with the directories in the thread directory, but I just moved it into its own directory - so update before doing this. You could just make a few more thread 3's and break at the lock.unlock line. Then you should have a bunch of threads with the same stack, and your 'unique' listing would coalesce them. If you do this, please add it as a separate test (so make a new method 'test_unique' and redo the setup. That way the tests will remain independent.
There's a "functionalities/thread/backtrace_all" test that looks tempting to modify, but it doesn't actually look that amenable to modification for your purposes. The TestBacktraceAll.py there does show an example of running a command-line command, and looking for patterns in the result, which is what you'll need to add.
source/Commands/CommandObjectThread.cpp | ||
---|---|---|
324 | Maybe "to see threads grouped by unique call stacks"? As written it sounds like you're going to show me the really cool call stacks. |
Address Pavel/Jim/Greg's feedback with the addition of a new test.
I followed Jim's advice and extended the existing num_threads suite
so that we start more thread3's. The new test_unique_stacks test case
was added to verify that we correct bucketing of the multiple thread3
call stacks executing across threads.
Thanks for the feedback thus far!
Thanks for writing the test. We just need to make sure it runs in a stable manner.
packages/Python/lldbsuite/test/functionalities/thread/num_threads/TestNumThreads.py | ||
---|---|---|
91 | I am afraid this is going to be very flaky. There's no guarantee that the threads will have had enough time to block in the lock call by the time you reach the breakpoint. The simplest solution would be to insert a sufficiently long (a couple of seconds) sleep in the inferior, but smarter positions are possible as well. For example, we could iterate through all SBThreads in the process and manually issue a Continue on all of them who haven't hit our predefined breakpoints yet. | |
packages/Python/lldbsuite/test/functionalities/thread/num_threads/main.cpp | ||
57–59 | I'd find for(auto &t: thread_3s) t.join() more understandable than this. |
Address Pavel's feedback, made the unit test more robust.
This update increases the robustness of the new test I added.
We ensure synchronization, and force the threads into the state
we want them to be in by manually stepping them into place.
thanks for the effort. I found the logic of the test quite difficult to follow, with multiple breakpoints and notify_calls(). Instead of trying to point out each problem, I figured it will be easier to write my take on how the test could look like: https://paste.ubuntu.com/24665770/. Besides making the logic flow nicer (I hope), it also avoids a couple of other potentially problematic spots in your test (reliance on the exact way we number threads, potentially infinite loop in the thread.StepInstruction calls, etc).
Let me know what you think.
We should probably store the stacks as lldb::addr_t values that are load addresses for quicker comparisons and searches. Inlined code details things more clearly.
source/Commands/CommandObjectThread.cpp | ||
---|---|---|
76 | Might be more efficient to store a "std::stack<lldb::addr_t>" where you store the load address of each address. The comparisons would go quicker, not to mention the construction and destruction of stack itself. | |
148 | Add an operator function for UniqueStack that only compares the stack frames: bool inline operator <(const UniqueStack &lhs, const UniqueStack &rhs) { return lhs.m_stack_frames < rhs.m_stack_frames; } The comparison will be cheap when m_stack_frames is a std::stack of lldb::addr_t values. Then store as a std::set for efficient lookup later instead of a linear search. | |
204 | std::set<UniqueSet> &unique_stacks; | |
220–221 | const lldb::addr_t pc = frame_sp->GetStackID().GetPC(); stack_frames.push(pc); | |
226 | use fast std::set.find() so search for the matching value. |
Address Pavel and Greg's feedback on Diff 100365.
Pavel: I took your suggestions to make the test case more readable,
I really appreciate the guidance. I did have to tweak some of the
functionality to make the test case pass reliably, as there were
still some races possible. I also saw that SBThread.Resume() seems
to occasionally result in a StopReason of eStopReasonNon. So I worked
around that by only including threads int expected output that the Resume
resulted in making it to our breakpoint. I have verified the test is
consistently passes by executing it on repeat 100 times,
Greg: Thanks for the suggestions for using std::set, and lldb:adrr_t.
The code is a lot cleaner now, and as you mentioned should perform faster.
I didn't notice any significant speed up with my 700 thread dump however.
Much cleaner. After seeing how we must construct a UniqueStack just so we can search for it, an even cleaner solution would be to make unique_stacks into:
std::map<std::vector<lldb::addr_t>
Clicked submit too early with last comments. After seeing how we previously needed to create a temp UniqueStack just to do the lookup, and also how UniqueStack make its thread index ID list mutable, a better solution is to use a std::map as shown in the inlined comments.
One issue with this fix that we will need to fix is we will need to disable showing variable values in the stack backtrace. The default frame format is:
(lldb) settings show frame-format frame-format (format-string) = "frame #${frame.index}: ${frame.pc}{ ${module.file.basename}{`${function.name-with-args}{${frame.no-debug}${function.pc-offset}}}}{ at ${line.file.basename}:${line.number}}{${function.is-optimized} [opt]}\n"
The main problem is we have "${function.name-with-args}" in the format string. We must detect we are trying to unique stacks and not try to print any thread variables otherwise we will not be showing the correct information to the user. This would mean piping the fact we are trying to unique stacks into the FormatEntity::Format() function in FormatEntity.h/.cpp and just don't show the variables in this case.
source/Commands/CommandObjectThread.cpp | ||
---|---|---|
50 | After seeing how we have to create a temp UniqueStack so we can search for it in our set, we can do this a bit cleaner. We can get rid of this class if we use a std::map that maps from the unique stack load addresses to the thread ID list: std::map<std::vector<lldb::addr_t>, std::vector<uint32_t>> unique_stacks; So more changes below... | |
148 | Change to: std::map<std::vector<lldb::addr_t>, std::vector<uint32_t>> unique_stacks; | |
159 | change to: for (const auto &stacks_tids: unique_stacks) { | |
161 | change to: const std::vector<uint32_t>& thread_index_ids = stacks_tids.second; | |
169 | change to: // Since all threads have the same call stack we can just pick the first one. uint32_t representative_thread_id = thread_index_ids.front(); | |
204 | std::map<std::vector<lldb::addr_t>, std::vector<uint32_t>> &unique_stacks | |
225 | Remove this. | |
228 | auto matching_stack = unique_stacks.find(stack_frames); | |
229–233 | Can probably replace the entire if/else with: unique_stacks[stack_frames].push_back(thread_index_id); |
The other option for fixing the problem with showing the wrong variables in the backtraces would be to make up a new frame-format string that is used for uniqued stack frames and use that format when showing uniqued stack frames:
(lldb) settings show frame-format-unique frame-format (format-string) = "frame #${frame.index}: ${frame.pc}{ ${module.file.basename}{`${function.name}{${frame.no-debug}${function.pc-offset}}}}{ at ${line.file.basename}:${line.number}}{${function.is-optimized} [opt]}\n"
The only difference in the format string is we use ${function.name} instead of ${function.name-with-args}.
Zach made some comments via e-mail:
Couple of things:
- Unless you really really want iteration over this map to be in some deterministic order, better to use unordered_map. And if you do want it to be in some deterministic order, you should provide a comparison function, as the default one is probably not what you want.'
True. The previous solution had the same ordering as this map I proposed.
- std::map<std::vector<lldb::addr_t>, std::vector<uint32_t>> is pretty horrible to read, and offers no insight into what the keys and values are. At the very least, add some typedefs like:
typedef std::vector<lldb::addr_t> StackID;
typedef std::vector<tid_t> ThreadIDList;
std::unordered_map<StackID, ThreadIDList> TheMap;
Agreed.
- If performance is a concern here (700+ threads seems like it could exhibit slowdown when uniquing stacks), the best performance would be to use DenseSet<std::unique_ptr<StackInfo>>, where StackInfo contains both the address list and the thread is list as well as a pre-computed hash value, and then use a custom implementation of DenseMapInfo that just returns the hash value directly. See llvm/lib/DebugInfo/CodeView/TypeSerializer.cpp for an example.
Not sure I would go that far. When you include both things in the type (StackInfo here) then you must make one up in order to search for it right? We are mainly trying to create a collection of stacks that we can use to point the the thread IDs. I agreed it doesn't need to be sorted so std::unordered_map would work just fine.
The more I think about it, the previous approach was probably fine with the std::set and with the UniqueStack class. Lets just stick with that for now, but we still need to fix the display to not show variable values and claim that many threads have the same variable values when showing ${function.name-with-args}.
Fix bug in "unique" backtrace output that Greg pointed out.
Introduced a new format for unique frames and plumbed that
through the stacks to be able to toggle between them both
depending on the calling arguments.
Thanks. The fact that we are not able to rely on the operation of Resume in this case sounds like a bug. Obviously we can't condition the acceptance of this patch by fixing that issue, but we should at least track it. Can you create a bug on bugs.llvm.org, and reference it in your workaround. BTW, what's the platform you are testing this on?
As Jim pointed out, the Resume command does not do what I thought it does, so having it in the test makes no sense. One option would be to just leave out the calls to Resume (as they don't do anything anyway) and go with your workaround of testing the threads that happened to stop at the breakpoint on the first run. The other option would be to replace the Resume call with something that does what we intended (one such candidate would be thread.StepOut). I prefer the second one as it gives a more deterministic output, but I think both of them are good enough.
Committed as r305197. I needed to tweak the test logic a bit, as not even thread.StepOut() did exactly what we needed there, because it was resuming also all other threads even though it was not necessary. Take a look at the resulting commit if you want to see the differences: (spoiler: the magic command that worked was thread continue #id)
I am afraid this is going to be very flaky. There's no guarantee that the threads will have had enough time to block in the lock call by the time you reach the breakpoint. The simplest solution would be to insert a sufficiently long (a couple of seconds) sleep in the inferior, but smarter positions are possible as well. For example, we could iterate through all SBThreads in the process and manually issue a Continue on all of them who haven't hit our predefined breakpoints yet.