This is an archive of the discontinued LLVM Phabricator instance.

[trace][intel pt] Introduce wall clock time for each trace item
ClosedPublic

Authored by wallace on Jul 18 2022, 5:29 PM.

Details

Summary
  • Decouple TSCs from trace items
  • Turn TSCs into events just like CPUs. The new name is HW clock tick, wich could be reused by other vendors.
  • Add a GetWallTime that returns the wall time that the trace plug-in can infer for each trace item.
  • For intel pt, we are doing the following interpolation: if an instruction takes less than 1 TSC, we use that duration, otherwise, we assume the instruction took 1 TSC. This helps us avoid having to handle context switches, changes to kernel, idle times, decoding errors, etc. We are just trying to show some approximation and not the real data. For the real data, TSCs are the way to go. Besides that, we are making sure that no two trace items will give the same interpolation value. Finally, we are using as time 0 the time at which tracing started. For simplicity, I'm assuming all errors and events have a duration.

Sample output:

(lldb) r
Process 750047 launched: '/home/wallace/a.out' (x86_64)
Process 750047 stopped
* thread #1, name = 'a.out', stop reason = breakpoint 1.1
    frame #0: 0x0000000000402479 a.out`main at main.cpp:29:20
   26   };
   27
   28   int main() {
-> 29     std::vector<int> vvv;
   30     for (int i = 0; i < 100; i++)
   31       vvv.push_back(i);
   32
(lldb) process trace start -s 64kb -t --per-cpu
(lldb) b 60
Breakpoint 2: where = a.out`main + 1689 at main.cpp:60:23, address = 0x0000000000402afe
(lldb) c
Process 750047 resuming
Process 750047 stopped
* thread #1, name = 'a.out', stop reason = breakpoint 2.1
    frame #0: 0x0000000000402afe a.out`main at main.cpp:60:23
   57     map<int, int> m;
   58     m[3] = 4;
   59
-> 60     map<string, string> m2;
   61     m2["5"] = "6";
   62
   63     std::vector<std::string> vs = {"2", "3"};
(lldb) thread trace dump instructions -t -f -e thread #1: tid = 750047
    0: [379567.000 ns] (event) HW clock tick [48599428476224707]
    1: [379569.000 ns] (event) CPU core changed [new CPU=2]
    2: [390487.000 ns] (event) HW clock tick [48599428476246495]
    3: [1602508.000 ns] (event) HW clock tick [48599428478664855]
    4: [1662745.000 ns] (event) HW clock tick [48599428478785046]
  libc.so.6`malloc
    5: [1662746.995 ns] 0x00007ffff7176660    endbr64
    6: [1662748.991 ns] 0x00007ffff7176664    movq   0x32387d(%rip), %rax      ;  + 408
    7: [1662750.986 ns] 0x00007ffff717666b    pushq  %r12
    8: [1662752.981 ns] 0x00007ffff717666d    pushq  %rbp
    9: [1662754.977 ns] 0x00007ffff717666e    pushq  %rbx
    10: [1662756.972 ns] 0x00007ffff717666f    movq   (%rax), %rax
    11: [1662758.967 ns] 0x00007ffff7176672    testq  %rax, %rax
    12: [1662760.963 ns] 0x00007ffff7176675    jne    0x9c7e0                   ; <+384>
    13: [1662762.958 ns] 0x00007ffff717667b    leaq   0x17(%rdi), %rax
    14: [1662764.953 ns] 0x00007ffff717667f    cmpq   $0x1f, %rax
    15: [1662766.949 ns] 0x00007ffff7176683    ja     0x9c730                   ; <+208>
    16: [1662768.944 ns] 0x00007ffff7176730    andq   $-0x10, %rax
    17: [1662770.939 ns] 0x00007ffff7176734    cmpq   $-0x41, %rax
    18: [1662772.935 ns] 0x00007ffff7176738    seta   %dl
    19: [1662774.930 ns] 0x00007ffff717673b    jmp    0x9c690                   ; <+48>
    20: [1662776.925 ns] 0x00007ffff7176690    cmpq   %rdi, %rax
    21: [1662778.921 ns] 0x00007ffff7176693    jb     0x9c7b0                   ; <+336>
    22: [1662780.916 ns] 0x00007ffff7176699    testb  %dl, %dl
    23: [1662782.911 ns] 0x00007ffff717669b    jne    0x9c7b0                   ; <+336>
    24: [1662784.906 ns] 0x00007ffff71766a1    movq   0x3236c0(%rip), %r12      ;  + 24
(lldb) thread trace dump instructions -t -f -e -J -c 4
[
  {
    "id": 0,
    "timestamp_ns": "379567.000000",
    "event": "HW clock tick",
    "hwClock": 48599428476224707
  },
  {
    "id": 1,
    "timestamp_ns": "379569.000000",
    "event": "CPU core changed",
    "cpuId": 2
  },
  {
    "id": 2,
    "timestamp_ns": "390487.000000",
    "event": "HW clock tick",
    "hwClock": 48599428476246495
  },
  {
    "id": 3,
    "timestamp_ns": "1602508.000000",
    "event": "HW clock tick",
    "hwClock": 48599428478664855
  },
  {
    "id": 4,
    "timestamp_ns": "1662745.000000",
    "event": "HW clock tick",
    "hwClock": 48599428478785046
  },
  {
    "id": 5,
    "timestamp_ns": "1662746.995324",
    "loadAddress": "0x7ffff7176660",
    "module": "libc.so.6",
    "symbol": "malloc",
    "mnemonic": "endbr64"
  },
  {
    "id": 6,
    "timestamp_ns": "1662748.990648",
    "loadAddress": "0x7ffff7176664",
    "module": "libc.so.6",
    "symbol": "malloc",
    "mnemonic": "movq"
  },
  {
    "id": 7,
    "timestamp_ns": "1662750.985972",
    "loadAddress": "0x7ffff717666b",
    "module": "libc.so.6",
    "symbol": "malloc",
    "mnemonic": "pushq"
  },
  {
    "id": 8,
    "timestamp_ns": "1662752.981296",
    "loadAddress": "0x7ffff717666d",
    "module": "libc.so.6",
    "symbol": "malloc",
    "mnemonic": "pushq"
  }
]

Diff Detail

Event Timeline

wallace created this revision.Jul 18 2022, 5:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2022, 5:29 PM
wallace requested review of this revision.Jul 18 2022, 5:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2022, 5:29 PM
wallace updated this revision to Diff 445664.Jul 18 2022, 5:36 PM
wallace edited the summary of this revision. (Show Details)

fix nits

jj10306 requested changes to this revision.Jul 19 2022, 8:04 AM
jj10306 added inline comments.
lldb/source/Plugins/Trace/intel-pt/DecodedThread.cpp
243–244

What's purpose does the TscRange class serve? Specifically, why can't we take the same approach we take for CPU events where we just store a map and the chronologically last CPU ID on the DecodedThread:

// The cpu information is stored as a map. It maps `instruction index -> CPU`
// A CPU is associated with the next instructions that follow until the next cpu is seen.
std::map<uint64_t, lldb::cpu_id_t> m_cpus;
/// This is the chronologically last CPU ID.
llvm::Optional<uint64_t> m_last_cpu = llvm::None;

we currently store this same information for TSCs but then we have this whole additional TscRange notion that lives on a TraceCursorIntelPT

This is basically an extension of the conversation we had inline on (https://reviews.llvm.org/D129340) about eventually moving TSCs to be the same as CPUs. Perhaps now that we are making this change to treat TSCs like events (just like CPU changes) it would be a good time to revisit the design and make it simpler/consistent with CPU change events, if possible.

tl;dr iiuc the TSC lookup for an instruction is basically identical to the CPU ID lookup for an instruction, so why should the way we do the lookup be different for TSCs vs CPU IDs

244–250

I found this a little difficult to reason about.
Perhaps consistent use of the different iterator variables, minor renaming and docs may make this a little more understandable. Feel free to take none or any of the above suggestions 🙂

251–258

Doesn't this check essentially make it so m_tsc_duration_per_instruction is either 0 or 1?
Where is the heuristic that every instruction should take <= 1 TSC to execute coming from?

From a visualization perspective, how should we display instructions in a TscRange where m_tsc_duration_per_instruction is 0? Doesn't this force us to stack those instructions since will all have the same interpolated timestamp?

lldb/source/Plugins/Trace/intel-pt/DecodedThread.h
256–260

Now that we are introducing timestamps (ie wall clock from TSCs), should we rename this to make it clear that these are TSC values and not timestamps?

This revision now requires changes to proceed.Jul 19 2022, 8:04 AM

AFAICT this doesn't handle the TSCs we get for CPU change events through the context switch traces, is that correct?
This will be very useful to have for visualization purposes.

wallace planned changes to this revision.Jul 20 2022, 12:20 AM
jj10306 requested changes to this revision.Jul 20 2022, 7:32 AM
jj10306 added inline comments.
lldb/source/Plugins/Trace/intel-pt/LibiptDecoder.cpp
288

if we have both a start and end TSC, shouldn't we emit the start here (like you're doing) and then after decoding all the instructions we should also emit the the end TSC?

lldb/source/Plugins/Trace/intel-pt/TraceCursorIntelPT.cpp
119–122

Is this correct?
I don't see an issue with choosing a 1 TSC per instruction as a heuristic in this case, but we should be cautious not to explain this heuristic as being tied to any "expected" performance of the hardware because there are way too many variables that influence IPC like differences in workload, microarchitecture (speculative execution/branch prediction implementation, vector units, super scalar, etc), etc not to mention the difference b/t TSC and CPU cycles (like you mentioned).

124

Won't our interpolation calculation be messed up since items_since_last_tsc will be counting all trace errors, events, etc and not just the instructions/events that we want to assign a timestamp to and thus should be part of this calculation?

I'm thinking of a case like below :
i - instruction, e - error

tsc, cpu_change, i, e, e, e, tsc
in this case the items_since last_tsc would be 5 even though there is only 1 instruction.

i guess this brings up a more general question around how timestamps should be assigned. For example, in this case I see a couple different options:

  1. What we are doing now where we assign all events timestamps

Pro: easy implementation
Con: visualization may be weird bc timestamps are being influenced by events (ie errors) that users of the visualization don't care or know about

  1. Ignore errors but treat all instructions and events the same (ie instructions and events get timestamps)
  2. Ignore errors and treat instructions and events differently (ie instructions are the only things that contribute to items_since_last_tsc) and then events are simply assigned the same timestamp as an instruction

We can discuss this in more depth offline.

wallace updated this revision to Diff 446303.Jul 20 2022, 5:14 PM
wallace edited the summary of this revision. (Show Details)

final version

wallace added inline comments.Jul 20 2022, 5:32 PM
lldb/source/Plugins/Trace/intel-pt/LibiptDecoder.cpp
288

you are right

wallace edited the summary of this revision. (Show Details)Jul 20 2022, 5:33 PM
wallace updated this revision to Diff 446541.Jul 21 2022, 9:40 AM

fix a bug that happens when the initial trace items don't have timing information

This revision was not accepted when it landed; it landed in state Needs Review.Jul 26 2022, 12:05 PM
This revision was automatically updated to reflect the committed changes.