This is a RFC because of the uint8_t CPU change.
That chance needs discussing.
In "basic log mode", we indeed only ever read 8 bits into that field.
But in FDR mode, the CPU field in log is 16 bits.
But if you look in the compiler-rt part, as far as i can tell, the CPU id is always
(in both modes, basic and FDR) received from uint64_t __xray::readTSC(uint8_t &CPU).
So naturally, CPU id is always only 8 bit, and in FDR mode, extra 8 bits is just padding.
Please don't take my word for it, do recheck!
Thus, i do not believe we need to have uint16_t for CPU. With the other current code
we can't ever get more than uint8_t value there, thus we save 1 byte.
The rest of the patch is trivial.
By specifying the base type of RecordTypes we save 3 bytes.
llvm::SmallVector<>/llvm::SmallString only cost 16 bytes each, as opposed to 24/32 bytes.
Thus, in total, old sizeof(XRayRecord) was 88 bytes, and new one is 56 bytes.
There is no padding between the fields of XRayRecord, and XRayRecord itself isn't being
padded when stored into a vector. Thus the footprint of XRayRecord is now optimal.
This is important because XRayRecord is what has the biggest memory footprint,
and most contributes to the peak heap memory usage at least of llvm-xray convert.
Some numbers:
xray-log.llvm-exegesis.FswRtO was acquired from llvm-exegesis
(compiled with -fxray-instruction-threshold=128)
analysis mode over -benchmarks-file with 10099 points (one full
latency measurement set), with normal runtime of 0.387s.
Time old:
$ perf stat -r9 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-old.yml xray-log.llvm-exegesis.FswRtO Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-old.yml xray-log.llvm-exegesis.FswRtO' (9 runs): 7607.69 msec task-clock # 0.999 CPUs utilized ( +- 0.48% ) 522 context-switches # 68.635 M/sec ( +- 39.85% ) 1 cpu-migrations # 0.073 M/sec ( +- 60.83% ) 77905 page-faults # 10241.090 M/sec ( +- 3.13% ) 30471867671 cycles # 4005708.241 GHz ( +- 0.48% ) (83.32%) 2424264020 stalled-cycles-frontend # 7.96% frontend cycles idle ( +- 1.84% ) (83.30%) 11097550400 stalled-cycles-backend # 36.42% backend cycles idle ( +- 0.35% ) (33.38%) 36899274774 instructions # 1.21 insn per cycle # 0.30 stalled cycles per insn ( +- 0.07% ) (50.04%) 6538597488 branches # 859537529.125 M/sec ( +- 0.07% ) (66.70%) 79769896 branch-misses # 1.22% of all branches ( +- 0.67% ) (83.35%) 7.6143 +- 0.0371 seconds time elapsed ( +- 0.49% )
Time new:
$ perf stat -r9 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-new.yml xray-log.llvm-exegesis.FswRtO Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-new.yml xray-log.llvm-exegesis.FswRtO' (9 runs): 7207.49 msec task-clock # 1.000 CPUs utilized ( +- 0.46% ) 174 context-switches # 24.159 M/sec ( +- 30.10% ) 0 cpu-migrations # 0.062 M/sec ( +- 39.53% ) 52126 page-faults # 7232.740 M/sec ( +- 0.69% ) 28876446408 cycles # 4006783.905 GHz ( +- 0.46% ) (83.31%) 2352902586 stalled-cycles-frontend # 8.15% frontend cycles idle ( +- 2.08% ) (83.33%) 8986901047 stalled-cycles-backend # 31.12% backend cycles idle ( +- 1.00% ) (33.36%) 38630170181 instructions # 1.34 insn per cycle # 0.23 stalled cycles per insn ( +- 0.04% ) (50.02%) 7016819734 branches # 973626739.925 M/sec ( +- 0.04% ) (66.68%) 86887572 branch-misses # 1.24% of all branches ( +- 0.39% ) (83.33%) 7.2099 +- 0.0330 seconds time elapsed ( +- 0.46% )
(Nice, accidentally improved by -5%)
Memory old:
$ heaptrack_print heaptrack.llvm-xray.3976.gz | tail -n 7 total runtime: 18.16s. bytes allocated in total (ignoring deallocations): 5.25GB (289.03MB/s) calls to allocation functions: 21840309 (1202792/s) temporary memory allocations: 228301 (12573/s) peak heap memory consumption: 354.62MB peak RSS (including heaptrack overhead): 4.30GB total memory leaked: 87.42KB
Memory new:
$ heaptrack_print heaptrack.llvm-xray.5234.gz | tail -n 7 total runtime: 17.93s. bytes allocated in total (ignoring deallocations): 5.05GB (281.73MB/s) calls to allocation functions: 21840309 (1217747/s) temporary memory allocations: 228301 (12729/s) peak heap memory consumption: 267.77MB peak RSS (including heaptrack overhead): 2.16GB total memory leaked: 83.50KB
Memory diff:
$ heaptrack_print -d heaptrack.llvm-xray.3976.gz heaptrack.llvm-xray.5234.gz | tail -n 7 total runtime: -0.22s. bytes allocated in total (ignoring deallocations): -195.36MB (876.07MB/s) calls to allocation functions: 0 (0/s) temporary memory allocations: 0 (0/s) peak heap memory consumption: -86.86MB peak RSS (including heaptrack overhead): 0B total memory leaked: -3.92KB
So we indeed improved (reduced) peak memory usage, by ~-25%.
Not by a third since now something else is the top contributor to the peak.