This is an archive of the discontinued LLVM Phabricator instance.

[trace] Create a top level ThreadTrace class
AbandonedPublic

Authored by wallace on Jun 2 2021, 9:10 PM.

Details

Reviewers
clayborg
vsk
Summary

We want to start doing some analisis at instruction level, including stack trace reconstruction, trace summarization and security analysis. For this we need to provide a basic and efficient way to access instruction information without the symbolication bits for quick analysis. This class should be at Trace.h level abstracting plug-in dependent information.

Diff Detail

Event Timeline

wallace created this revision.Jun 2 2021, 9:10 PM
wallace requested review of this revision.Jun 2 2021, 9:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2021, 9:10 PM
wallace retitled this revision from [trace] Create a global instruction class to [trace] Create a top-level instruction class.Jun 2 2021, 9:25 PM
clayborg requested changes to this revision.Jun 7 2021, 2:56 PM

So it would be nice to try and not encode errors into the TraceInstruction class and deal with any errors at decode time.

lldb/include/lldb/Target/Trace.h
32–33

Can't we just avoid including any errors in this TraceInstruction class? And have the function that generates these return a:

llvm::Expected<TraceInstruction>

?

76

It would be nice to avoid trying to encode errors into this class.

lldb/include/lldb/lldb-enumerations.h
798
This revision now requires changes to proceed.Jun 7 2021, 2:56 PM
wallace added inline comments.Jun 7 2021, 5:14 PM
lldb/include/lldb/Target/Trace.h
32–33

not really, because these errors don't mean that the entire decoding failed, they instead mean that some specific chunks of the trace were not able to be decoded.
For example, you can have this decoded trace:

a.out`(none)
    [ 4] 0x0000000000400510    pushq  0x200af2(%rip)            ; _GLOBAL_OFFSET_TABLE_ + 8
    [ 5] 0x0000000000400516    jmpq   *0x200af4(%rip)           ; _GLOBAL_OFFSET_TABLE_ + 16
    [ 6] 0x00007ffff7df1950    error: no memory mapped at this address
    ...missing instructions
  a.out`main + 20 at main.cpp:10
    [ 7] 0x0000000000400674    movl   %eax, -0xc(%rbp)

Instruction 6 is an error. We have an associated address but the decoder couldn't decode it and then it skipped the trace until the next synchronization point, which is instruction 7 in this case.

Not all decoder errors have an associated address, though, only some. So I prefer just to have any errors be represented as a strings for now.

Another option is to create a new class that contains a sequential list of valid instructions terminated by 0 or more contiguous errors. But that seems a little bit too much, as there's not a lot that we can do when we see errors, we either ignore them or we stop the current iteration.

wallace updated this revision to Diff 350495.Jun 7 2021, 10:28 PM

address comments

wallace marked an inline comment as done.Jun 7 2021, 10:28 PM
wallace added inline comments.Jun 7 2021, 10:33 PM
lldb/include/lldb/Target/Trace.h
32–33

It should be possible. I wonder about memory consumption.
llvm::Expected has these members:

union {
    AlignedCharArrayUnion<storage_type> TStorage;
    AlignedCharArrayUnion<error_type> ErrorStorage;
  };
  bool HasError : 1;
#if LLVM_ENABLE_ABI_BREAKING_CHECKS
  bool Unchecked : 1;
#endif

that means that it would end up using 3 * 8 bytes in total, right?

32–33

disregard this comment, I understood your remark incorrectly

clayborg requested changes to this revision.Jun 7 2021, 10:56 PM

One quick either documentation update for the error string, or switch to ConstString... Probably best to the the error string docs as if we use fancy C++ union stuff it might not compile for everyone on all C++ compilers..

lldb/include/lldb/Target/Trace.h
76

I would be great if this could actually be a ConstString object. I haven't played with C++ unions with actual objects in them, but I believe it is possible. If you can't make a comment stating this string will be from a ConstString and doesn't need to be freed.

This revision now requires changes to proceed.Jun 7 2021, 10:56 PM
wallace updated this revision to Diff 350652.Jun 8 2021, 10:26 AM

I tried to use a ConstString instead of a const char *, but that's not trivial because
the compilers wants a safe destructor all the way down, and it seems that this is non-standard
across compilers, except for C++17 and onwards, which has the std::variant type.

So I'm just adding a comment as suggested.

vsk requested changes to this revision.Jun 8 2021, 11:09 AM
vsk added a subscriber: vsk.

I'm quite concerned about the design decision to represent a trace as a vector of TraceInstruction. This bakes considerations that appear to be specific to Facebook's usage of IntelPT way too deep into lldb's APIs, and isn't workable for our use cases.

At Apple, we use lldb to navigate instruction traces that contain billions of instructions. Allocating 16 bytes per instruction simply will not scale for our workflows. We require the in-memory overhead to be approximately 1-2 bits per instruction. I'm not familiar with how large IntelPT traces can get, but presumably (for long enough traces) you will encounter the same scaling problem.

What alternatives to the vector<TraceInstruction> representation have you considered? One idea might be to implement your program analyses on top of a generic interface for navigating forward/backward through a trace and extracting info about the instruction via a set of API calls; this leaves the actual in-memory representation of "TraceInstruction" unspecified.

This revision now requires changes to proceed.Jun 8 2021, 11:09 AM
wallace requested review of this revision.Jun 8 2021, 2:15 PM

At Apple, we use lldb to navigate instruction traces that contain billions of instructions. Allocating 16 bytes per instruction simply will not scale for our workflows. We require the in-memory overhead to be approximately 1-2 bits per instruction. I'm not familiar with how large IntelPT traces can get, but presumably (for long enough traces) you will encounter the same scaling problem.

I have thought about that and I'm taking it into account. We implemented a TraverseInstructions API in Trace.h that receives a callback (see https://lldb.llvm.org/cpp_reference/classlldb__private_1_1Trace.html#a02e346b117c15cef1cb0568c343f7e1c). The idea is that the intel pt plugin can decode the instructions on the fly as part of the iteration.
In terms of memory, there are two kinds of trace objects: one is the raw undecoded instruction trace that uses 1 bit per instruction in avg. That one is decoded in https://github.com/llvm/llvm-project/blob/main/lldb/source/Plugins/Trace/intel-pt/IntelPTDecoder.cpp#L163 into TraceInstructions, which effectively use 16 bytes in the duration of their lives. Right now we haven't implemented lazy decoding; we are decoding the entire trace. But that's just because so far we have been working with small traces. As we progress in this work, we'll start working with larger traces and we'll have to do implement the lazy decoding, for which the TraverseInstructions API will come handy.

What alternatives to the vector<TraceInstruction> representation have you considered? One idea might be to implement your program analyses on top of a generic interface for navigating forward/backward through a trace and extracting info about the instruction via a set of API calls; this leaves the actual in-memory representation of "TraceInstruction" unspecified.

That's exactly what I described as the TraverseInstructions method :) The intel-pt plugin or any other trace plugin could have an internal representation for traces, but we still need a generic TraceInstruction object for consumers that want to be agnostic of the trace technology.

vsk added a comment.Jun 8 2021, 2:52 PM

Right now we haven't implemented lazy decoding; we are decoding the entire trace. But that's just because so far we have been working with small traces. As we progress in this work, we'll start working with larger traces and we'll have to do implement the lazy decoding, for which the TraverseInstructions API will come handy.

This approach - of prototyping trace analyses on a vector<TraceInstruction> representation and then rewriting them later when scaling issues arise - doesn't sound good to me. Even for something simple, like finding a backtrace, the window of instructions needed for analysis can easily grow to a point where the full vector<TraceInstruction> can't be maintained in memory.

To clarify: I think that's a perfectly reasonable way to prototype trace analyses, but I don't think that kind of prototyping should be done at the SB API level, as there are extra bincompat & portability concerns here.

What alternatives to the vector<TraceInstruction> representation have you considered? One idea might be to implement your program analyses on top of a generic interface for navigating forward/backward through a trace and extracting info about the instruction via a set of API calls; this leaves the actual in-memory representation of "TraceInstruction" unspecified.

That's exactly what I described as the TraverseInstructions method :) The intel-pt plugin or any other trace plugin could have an internal representation for traces, but we still need a generic TraceInstruction object for consumers that want to be agnostic of the trace technology.

It's not at all clear to me that lldb needs to surface a generic TraceInstruction object, with metadata like m_byte_size, m_inst_type, etc. stored inline. It seems more flexible to instead vend a cursor (just an integer) that indicates a position in the trace; a separate set of APIs could answer queries given a cursor, e.g. 'getInstTypeAtCursor(u64 cursor)'.

This approach - of prototyping trace analyses on a vector<TraceInstruction> representation and then rewriting them later when scaling issues arise - doesn't sound good to me. Even for something simple, like finding a backtrace, the window of instructions needed for analysis can easily grow to a point where the full vector<TraceInstruction> can't be maintained in memory.

It's worth noticing that the vector<TraceInstruction> is internal to the Intel PT plugin. It's not exposed at the Trace.h interface level, so any consumers can't make any assumptions on how the data is stored. We will improve the intel pt plugin itself as we make progress on this, but we are ensuring that the interfaces are correct for the future.

To clarify: I think that's a perfectly reasonable way to prototype trace analyses, but I don't think that kind of prototyping should be done at the SB API level, as there are extra bincompat & portability concerns here.

I know, and I want to stress out that we are not exposing any of this instruction information at the SB API level and probably we won't for a while, i.e. I don't plan on creating a SBTraceInstruction class, as there's too much boilerplate needed to expose an SBInstruction object. For now, we want to have some code that processes instructions and that exists at the Trace.h level, outside of the intel-pt plugin, as we want to make it generic enough to be utilized by different tracing technologies.

It's not at all clear to me that lldb needs to surface a generic TraceInstruction object, with metadata like m_byte_size, m_inst_type, etc. stored inline. It seems more flexible to instead vend a cursor (just an integer) that indicates a position in the trace; a separate set of APIs could answer queries given a cursor, e.g. 'getInstTypeAtCursor(u64 cursor)'.

There's an advantage. Decoding intel-pt instructions is very slow and we are able to produce correct indices only if we decode the raw trace sequentially, either backwards (for negative indices) or forwards (for positive indices). If we start from the middle, we don't know which instruction we are at. If we allow the user to query properties of arbitrary indices, we might need to decode a big prefix or a big suffix of the trace until we get to that index. It seems safer to me to expose right away the properties of these instructions as soon as they are decoded, so that we don't need any further decoding if the user invokes a query method with a possibly wrong index.

vsk added a comment.Jun 8 2021, 5:07 PM

This approach - of prototyping trace analyses on a vector<TraceInstruction> representation and then rewriting them later when scaling issues arise - doesn't sound good to me. Even for something simple, like finding a backtrace, the window of instructions needed for analysis can easily grow to a point where the full vector<TraceInstruction> can't be maintained in memory.

It's worth noticing that the vector<TraceInstruction> is internal to the Intel PT plugin. It's not exposed at the Trace.h interface level, so any consumers can't make any assumptions on how the data is stored. We will improve the intel pt plugin itself as we make progress on this, but we are ensuring that the interfaces are correct for the future.

This doesn't quite alleviate my concern. We'd like for the generic trace infrastructure (outside of the PT plugin) to support different tracing technologies as well. It's not clear that the TraceInstruction concept can do that.

Two questions that have remained unanswered here: (1) how would TraceInstruction be used by non-PT plugins? (2) how do we guarantee the generic analyses you're planning to write will scale?

To clarify: I think that's a perfectly reasonable way to prototype trace analyses, but I don't think that kind of prototyping should be done at the SB API level, as there are extra bincompat & portability concerns here.

I know, and I want to stress out that we are not exposing any of this instruction information at the SB API level and probably we won't for a while, i.e. I don't plan on creating a SBTraceInstruction class, as there's too much boilerplate needed to expose an SBInstruction object. For now, we want to have some code that processes instructions and that exists at the Trace.h level, outside of the intel-pt plugin, as we want to make it generic enough to be utilized by different tracing technologies.

lldb's SB interfaces have always been relatively thin wrappers around classes in include/lldb (SBTarget -> Target, SBProcess -> Process, etc.). So it's really worth spending the extra time to get the generic interfaces right, even if they're not SB API yet, since that's ultimately the direction in which we'll want to go.

It's not at all clear to me that lldb needs to surface a generic TraceInstruction object, with metadata like m_byte_size, m_inst_type, etc. stored inline. It seems more flexible to instead vend a cursor (just an integer) that indicates a position in the trace; a separate set of APIs could answer queries given a cursor, e.g. 'getInstTypeAtCursor(u64 cursor)'.

There's an advantage. Decoding intel-pt instructions is very slow and we are able to produce correct indices only if we decode the raw trace sequentially, either backwards (for negative indices) or forwards (for positive indices). If we start from the middle, we don't know which instruction we are at. If we allow the user to query properties of arbitrary indices, we might need to decode a big prefix or a big suffix of the trace until we get to that index. It seems safer to me to expose right away the properties of these instructions as soon as they are decoded, so that we don't need any further decoding if the user invokes a query method with a possibly wrong index.

I don't see how the fact that PT decoding is slow necessitates the TraceInstruction representation. I don't find that the issues you're ascribing to a cursor representation are inevitable, or all that different from the issues you'd run into with generating a large number of TraceInstructions for a subset of a trace (also not clear how this subset would be specified).

I've been thinking about what you said and I'm having second thoughts on my implementation. I'll share more context:

  • I want to work in the short term on reverse debugging and reconstruction of stack traces, for that i'll need to know the instruction type of each instruction in the trace, which will be used as part of some heuristics to identify calls and returns between functions.
  • A future application that we plan to work on is adding profiling information to the instructions
  • Right now the intel-pt plugin is storing the decoded instructions in a vector, which works for small traces but wont' for gigantic traces. I imagine that I'll end up removing that vector and make the TraverseInstruction API decode instructions in place keeping one instruction in memory at a time within the intel pt plugin for a given traversal. For that I'll need accessors that can provide information of the current Instruction. As there could be two or more concurrent traversals happening at the same time (I'm trying to be generic here), it might make sense to create an abstract class TraceInstruction that can be extended by each plug-in and implement its getters.

I'm thinking about something like this

class TraceInstruction {
  virtual lldb::addr_t GetLoadAddress() = 0;
  virtual TraceInstructionType() GetInstructionType() = 0;
  virtual uint64_t GetTimestamp() = 0;
  ... anything that can appear in the future 
};

and have no members, leaving to each plug-in the decision of which of those methods to implement and how.

What do you think of this? I think this incorporates your feedback.

vsk added a comment.Jun 9 2021, 3:34 PM

I've been thinking about what you said and I'm having second thoughts on my implementation. I'll share more context:

  • I want to work in the short term on reverse debugging and reconstruction of stack traces, for that i'll need to know the instruction type of each instruction in the trace, which will be used as part of some heuristics to identify calls and returns between functions.
  • A future application that we plan to work on is adding profiling information to the instructions
  • Right now the intel-pt plugin is storing the decoded instructions in a vector, which works for small traces but wont' for gigantic traces. I imagine that I'll end up removing that vector and make the TraverseInstruction API decode instructions in place keeping one instruction in memory at a time within the intel pt plugin for a given traversal. For that I'll need accessors that can provide information of the current Instruction. As there could be two or more concurrent traversals happening at the same time (I'm trying to be generic here), it might make sense to create an abstract class TraceInstruction that can be extended by each plug-in and implement its getters.

I'm thinking about something like this

class TraceInstruction {
  virtual lldb::addr_t GetLoadAddress() = 0;
  virtual TraceInstructionType() GetInstructionType() = 0;
  virtual uint64_t GetTimestamp() = 0;
  ... anything that can appear in the future 
};

and have no members, leaving to each plug-in the decision of which of those methods to implement and how.

What do you think of this? I think this incorporates your feedback.

Thanks for the context.

I'm not convinced that retaining only 1 decoded instruction in memory at a time (under a TraverseInstructions callback) will be sufficient for the use cases you've outlined. Let's say the user sets a few breakpoints, then repeatedly does "rev-continue" followed by "bt". If lldb only holds a max of 1 decoded instruction in memory, it would need to repeatedly decode parts of the trace to evaluate those user commands, which would be slow as you pointed out earlier. OTOH there's not enough memory to retain all of the decoded instructions.

It seems like we need a way for generic trace analyses to request _batches_ of decoded trace data at a time, lazily demanding more from the trace plugin only if the analysis requires it to proceed, while freeing any stale decoded data it's not using.

Moreover, there doesn't seem to be a hard requirement that each trace plugin specialize a TraceInstruction class. For example, to support that breakpoint/rev-continue/bt workflow, you really only need a list of the places where _control flow changed_. So the generic layer could request that data from a trace plugin ('plugin->decodeControlFlowChanges(startPos, stopPos)'), then query more specific APIs that answer questions about how many control flow changes there were, and what the inst type + load address was at each control flow change point. Each plugin implementation can pick the most efficient way to implement those narrower APIs, instead of trying to cram all potentially useful information into a single TraceInst class.

The reason why I believe this to be such a crucial part of the design is because of the scaling problem I mentioned earlier. A single 1.5GHz core running at IPC=2 yields 3*10^9 instructions/second, so we hit "gigantic" all too quickly!

lldb/source/Plugins/Trace/intel-pt/IntelPTDecoder.cpp
104

Are all of these instruction types both necessary and generic? E.g. does having FarCall in addition to Call assist with backtrace reconstruction, and if so, what does FarCall mean precisely?

Btw, thanks for the conversation. This is being really helpful.

I'm not convinced that retaining only 1 decoded instruction in memory at a time (under a TraverseInstructions callback) will be sufficient for the use cases you've outlined. Let's say the user sets a few breakpoints, then repeatedly does "rev-continue" followed by "bt". If lldb only holds a max of 1 decoded instruction in memory, it would need to repeatedly decode parts of the trace to evaluate those user commands, which would be slow as you pointed out earlier. OTOH there's not enough memory to retain all of the decoded instructions.

It seems like we need a way for generic trace analyses to request _batches_ of decoded trace data at a time, lazily demanding more from the trace plugin only if the analysis requires it to proceed, while freeing any stale decoded data it's not using.

Moreover, there doesn't seem to be a hard requirement that each trace plugin specialize a TraceInstruction class. For example, to support that breakpoint/rev-continue/bt workflow, you really only need a list of the places where _control flow changed_. So the generic layer could request that data from a trace plugin ('plugin->decodeControlFlowChanges(startPos, stopPos)'), then query more specific APIs that answer questions about how many control flow changes there were, and what the inst type + load address was at each control flow change point. Each plugin implementation can pick the most efficient way to implement those narrower APIs, instead of trying to cram all potentially useful information into a single TraceInst class.

The reason why I believe this to be such a crucial part of the design is because of the scaling problem I mentioned earlier. A single 1.5GHz core running at IPC=2 yields 3*10^9 instructions/second, so we hit "gigantic" all too quickly!

Yes, I agree with everything you said. It will really make the implementation simpler and more generic if we have these accessors based on indices instead of a struct/class, and we let each plug-in manage its data and its cache their own way.
decodeControlFlowChanges is definitely a good idea, and other similar interfaces could be created following the same approach.

wallace updated this revision to Diff 351329.EditedJun 10 2021, 6:32 PM

Following @vsk 's feedback, I made the following changes:

  • Create a ThreadTrace class, that contains the trace for a specific thread and offers the TraverseInstructions method and other getters like GetInstructionType and GetInstructionSize. This effectively moves some code from the Trace.h class to this new class, which has a few advantages:
    • Trace.h now has the method GetThreadTrace(thread) which can return an Error if the thread couldn't be decoded or wasn't traced. This is a safer way to know if things are correctly set up for a given thread.
    • It's easier to expose an SBAPI for instructions and its data storage with a corresponding SBThreadTrace class. Having only SBTrace is not enough for that.
    • A ThreadTrace can potentially outlive a stop id in the case of a live process, which could be useful if the user wants to compare traces or two different points in the execution of a process.
  • I'm keeping the vector<IntelPTInstruction> class for now as an internal representation, but now there's more freedom to change it because the top level API in ThreadTrace makes no assumptions on the data, except that it's indexed.
  • I simplified the TraceInstructionType enum. It was more complex than needed.
  • I removed some deprecated code and improved some comments.

As a side note, I was chatting with some teammates and we'll explore storing decoded intel pt traces on disk, which later lldb will mmap and use as its storage. The accessors exposed in this diff are generic enough to handle that case as well.

wallace retitled this revision from [trace] Create a top-level instruction class to [trace] Create a top level ThreadTrace class.Jun 10 2021, 6:32 PM
wallace edited the summary of this revision. (Show Details)
wallace updated this revision to Diff 351344.Jun 10 2021, 11:18 PM
wallace edited the summary of this revision. (Show Details)

nits

vsk added a comment.Jun 11 2021, 2:30 PM

I think this is moving in the right direction! However, the way ThreadTrace is specified still raises the same scalability concerns (see inline). The "instruction index" concept doesn't seem sufficient. Decoding needs to be done in two stages: a separate "trace index" concept would help with that. In the first stage, lldb identifies which subset of the trace it needs decoded by specifying start/stop trace indices. This should be fast, as counting the number of available trace bytes/"chunks" is O(1). Nothing needs to be decoded at this point. The second stage starts after a trace subset is decoded, and is where instruction indices are useful.

(I'll be OOO next week, but hope to pick up review on 6/21.)

lldb/include/lldb/Target/ThreadTrace.h
68

GetInstructionCount requires decoding the whole ThreadTrace (if not for PT, then for other tracing technologies). I don't think we can take the O(n) hit up front.

77

How does a a ThreadTrace client query what the valid range for indices is?

lldb/include/lldb/lldb-enumerations.h
791

Do we have an example of a generic trace analysis that needs to differentiate between conditional & unconditional branches?

I mostly commented on ThreadTrace.h and we should resolve the questions in there before reviewing the rest of this patch.

lldb/include/lldb/Target/ThreadTrace.h
44–45

Should zero mean oldest when using TraceDirection::Forwards and newest when using TraceDirection::Backwards? Otherwise how would a person easily go backwards from the end of the thread trace?

61

Maybe name this just "Traverse" and add an additional parameter like:

enum class TraceUnit {
  Instruction, ///< Call the callback with the address each every instruction in a program.
  Branch, ///< Call the callback with the address of each branch that causes program flow to change.
  Return, ///< Call the callback only for branches that return from a function and removes a function from the thread's stack.
  Call, ///< Call the callback only for a function calls that add a function to the thread's stack.
};

Then this would make this traverse function very versatile as it could be used for making backtraces or stepping through the trace (step in, out, over etc). The enums could alternatively define bits in a bitfield that could be combined.

68

This could be very expensive to calculate. Do we want to omit this to avoid having to traverse all of the data and also disassemble everything to find out instruction counts? Since our trace might have 1000 trace branches, but millions of instructions that are traversed between each branch

77

Why not just put this information in the callback? It seems like a storage issue or performance issue if you can only find the address out at callback time.

83

Can we put this info in the callback? Ditto from above comment.

I'll redo this in a different patch

wallace abandoned this revision.Jun 16 2021, 12:25 PM
lldb/source/Plugins/Trace/intel-pt/CMakeLists.txt