Storing timestamps (TSCs) in a more efficient map at the decoded thread level to speed up TSC lookup, as well as reduce the amount of memory taken up by each instruction. Also introduced TSC range which keeps the current timestamp valid for all subsequent instructions until the next timestamp is emitted.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
let's better use the word TSC instead of timestamps, which is more accurate
lldb/source/Plugins/Trace/intel-pt/DecodedThread.h | ||
---|---|---|
164 | receive const &pt_insn to avoid copies | |
166–167 | Let's just use an overload void AppendInstruction(const pt_insn &instruction, uint64_t tsc); | |
181–182 | This has to be an optional because it might not be present |
lldb/source/Plugins/Trace/intel-pt/DecodedThread.cpp | ||
---|---|---|
94–96 | doing [] is O(logn), and we want to be faster than this. You can do the following which is O(1) auto it = m_instruction_timestamps.end() if (it != m_instruction_timestamps.begin()) { it--; if (it->second != tsc) { // this tsc is not the same! m_instruction_timestamps.insert(insn_idx, tsc); } else { // this tsc is the same, do nothing } } you can further optimize this by storing the last tsc that has been appended, that way you don't even need to create iterators | |
115–118 | this is wrong, you need to use upper_bound - 1, like this: if (m_instruction_timestamps.empty()) return None; auto it = m_instruction_timestamps.upper_bound(insn_idx); if (it == m_instruction_timestamps.begin()) return None; --it; return it->second; this will allow you to go to the largest index that is <= than insn_idx | |
lldb/source/Plugins/Trace/intel-pt/DecodedThread.h | ||
181–182 | better use insn_idx instead of index for all these variables | |
184–188 | I actually like this, let's improve it | |
lldb/source/Plugins/Trace/intel-pt/IntelPTDecoder.cpp | ||
153–154 ↗ | (On Diff #418660) | |
lldb/source/Plugins/Trace/intel-pt/TraceCursorIntelPT.cpp | ||
25–26 | here you need to set the correct value of m_current_tsc | |
44–45 | this will be O(logn). We can do better if m_current_tsc is the following little structure class DecodedThread { struct TscRange { size_t start_index; size_t end_index; size_t tsc; std::map<size_t, uint64_t>::iterator prev; std::map<size_t, uint64_t>::iterator next; }; } Optional<TscRange> m_current_tsc; Then you can ask the new method Optional<TscRange> DecodedThread::GetTSCRange(size_t insn_index) which will give you the entire range of the tsc that covers insn_index. With these numbers, you can do a comparison in this line to very quickly move from TSC to TSC only when needed. You can also have the method DecodedThread::GetNextTscRange(const TscRange& range) that computes in O(1) the next range, and you can similarly have GetPrevTscRange()`. The iterators will help you do that withing using lower_bound, which is O(1) | |
66–84 | you need to calculate the new tsc_range after moving m_pos | |
lldb/source/Plugins/Trace/intel-pt/TraceCursorIntelPT.h | ||
45–46 | Optional<uint64_t> m_current_tsc; |
I'm proposing a new interface for the TscRange. Let me know if you have questions
lldb/source/Plugins/Trace/intel-pt/DecodedThread.cpp | ||
---|---|---|
95 | we need to update it because of the optional | |
lldb/source/Plugins/Trace/intel-pt/DecodedThread.h | ||
166–167 | Let's improve the documentation and let's use the word TSC more ubiquitously | |
184 | ||
185–193 | Let's add more logic to this object so that it handles as much as it can and we reduce the logic that was added to DecodedThread. We also don't need two iterators, just one is enough. Don't forget to add documentation to these new methods. Given this new definition for TscRange. The only method we need to add in DecodedThread is CalculateTscRange(size_t insn_index), and mention the documentation that this operation is O(logn) and should be used sparingly. | |
186–188 | ||
196–203 | Let's improve the documentation and also get rid of the struct keyword in return types. That's old style C | |
219–238 | don't start variable names with __, as may people think that those variables should be discarded. Let's just give it a proper name. Let's also use an Optional and let's add documentation to all the variables. |
Some calculations are wrong, but overall this is good. We are very close!
lldb/source/Plugins/Trace/intel-pt/DecodedThread.cpp | ||
---|---|---|
117–118 | now that I think of this, you can delete this, because if the map is empty, this function will return in line 117 | |
137–140 | undo these two lines | |
159 | decoded_thread instead of ref | |
161–162 | delete | |
164–168 | seeing ++ and -- is very hard to read. I also prefer not to modify the it variable for cleanness. Also doing end->first might crash the program. I'm writing here a correct version | |
173 | ||
178 | The comparison is not right. let's use <= in a specific order to make it easier to read | |
186 | As m_it is valid, doing the comparison m_it == m_decoded_thread.m_instruction_timestamps.end() will always return false. Remember that .end() will return a fake iterator that points to no value. Besides that, don't modify m_it. Let's better create a new iterator | |
190–192 | Similarly, this has to be improved. I also like to put --it statements in their own line to make it easier to read. | |
lldb/source/Plugins/Trace/intel-pt/DecodedThread.h | ||
182 | ||
187–188 | ||
189 | Move this class to the beginning of the public section of DecodedThread for easier discoverability | |
191 | ||
207–213 | let's better delete this. It adds some maintenance cost with little benefits | |
218 | This comment is hard to follow. Let's just delete it because it's a private constructor | |
223–227 | let's add another piece of information | |
225–228 | Let's just delete this, as we can get them directly from m_it without doing any operations | |
227 | ||
230 | Setting to llvm::None here is equivalent to doing it from all the constructors | |
233 | ||
lldb/source/Plugins/Trace/intel-pt/TraceCursorIntelPT.cpp | ||
45–52 | No need to do m_current_tsc = m_decoded_thread_sp->CalculateTscRange(m_pos); because its value has already been calculated in the constructor. We can simplify this as well | |
99–101 | are you using git clang-format? I'm curious why this line changed | |
101–107 | ||
lldb/source/Plugins/Trace/intel-pt/TraceCursorIntelPT.h | ||
45 | /// Tsc range covering the current instruction. | |
lldb/source/Plugins/Trace/intel-pt/TraceIntelPT.cpp | ||
124 | Instead of doing *raw_size, better remove this number from CalculateApproximateMemoryUsage() |
Included requested changes, removed extra members
lldb/source/Plugins/Trace/intel-pt/TraceCursorIntelPT.cpp | ||
---|---|---|
45–52 | It is possible that when TraceCursorIntelPT is created the m_current_tsc is None, for example when just started the trace and tried to dump instructions... But then if a tsc is emitted later, this would cause it to remain None since we don't re-calculate it if it was initially None | |
99–101 | Yes I am. I think its because its longer than 80 chars. | |
101–107 | m_current_tsc is already checked at the beginning of this function |
almost there! Mostly cosmetic changes needed
lldb/source/Plugins/Trace/intel-pt/DecodedThread.cpp | ||
---|---|---|
94–98 | We need to handle a special case. It might happen that the first instruction is an error, which won't have a TSC, and the second instruction is an actual instruction, and from that point on you'll always have TSCs. For this case, we can assume that the TSC of the error instruction at the beginning of the trace is the same as the first valid TSC. | |
117–118 | delete these two lines. The rest of the code will work well without it | |
158–160 | ||
lldb/source/Plugins/Trace/intel-pt/DecodedThread.h | ||
104 | +1 | |
140 | Let's be more verbose with the names to improve readability | |
142 | ||
148 | let's receive a reference here and then convert it to pointer, so that we minimize the number of places with pointers. Also, if later we decide to use a shared_ptr instead of a pointer, we can do it inside of the constructor without changing this line | |
150 | ||
181–182 | ||
lldb/source/Plugins/Trace/intel-pt/TraceCursorIntelPT.cpp | ||
65–84 | we can simplify this so that we only invoke CalculateTscRange once | |
lldb/source/Plugins/Trace/intel-pt/TraceCursorIntelPT.h | ||
46 | rename it to m_tsc_range. The word current is very redundant in this case | |
lldb/source/Plugins/Trace/intel-pt/TraceIntelPT.cpp | ||
123–124 | Use doubles, as the average might not be a whole number |
Incorporate feedback and update tests
lldb/source/Plugins/Trace/intel-pt/TraceCursorIntelPT.cpp | ||
---|---|---|
65–84 | This is incorrect The converted code always returns 0. I've refactored it to have CalculateTscRange once but its a side-effect-y function and will need some future attention. |
one last nit and good to go
lldb/source/Plugins/Trace/intel-pt/TraceCursorIntelPT.cpp | ||
---|---|---|
82 | don't use auto for simple types |
lldb/source/Plugins/Trace/intel-pt/DecodedThread.h | ||
---|---|---|
145 | nit: No need to friend the enclosing class since C++11 - https://en.cppreference.com/w/cpp/language/nested_types |
lldb/source/Plugins/Trace/intel-pt/DecodedThread.h | ||
---|---|---|
145 | TIL! |
Dont use auto for simple types
lldb/source/Plugins/Trace/intel-pt/DecodedThread.h | ||
---|---|---|
145 | We need the friend because we are using a private constructor from outside, in DecodedThread::CalculateTscRange and a couple other places. The idea is to let only DecodedThread create TscRange. |
Don't forget to update the description of this diff and of the commit before pushing (you need to do both). Include the avg instruction size for a trace of at least 10k instructions as well :)
The difference in memory usage is appreciable with large number of instructions, as shown below
# Before (with current metrics, total memory does not include raw trace size) Raw trace size: 2048 KiB Total number of instructions: 900004 Total approximate memory usage: 56143.10 KiB Average memory usage per instruction: 63.87 bytes
# After Raw trace size: 2048 KiB Total number of instructions: 900004 Total approximate memory usage: 42187.69 KiB Average memory usage per instruction: 48.00 bytes
+1