I was trying to analyse the llvm-exegesis analysis mode performance,
and for that i wanted to view the LLVM X-Ray log visualization in Chrome
trace viewer. And the llvm-xray convert is sluggish, and sometimes
even ended up being killed by OOM.
So i have looked at the llvm-xray convert itself in heaptrack,
and this was the obvious problem point. It converts a *lot* of ints
to strings.
llvm::to_string() uses raw_string_ostream as the tmp stream.
It makes sense, you don't really know with what it will be called with,
it's a generic support function. But there is a downside to that.
raw_string_ostream is buffered, and minimal size of that buffer
is BUFSIZ (which is 8192 at least here). That is *huge*.
So if you call llvm::to_string(int), even though the output std::string
will be ~12 chars, you still just wasted 8192 temporary bytes...
But for arithmetic types, we can have an upper estimation of the
output string length, and thus we can use SmallString,
and only pay for the final price, not any intermediate tmp price.
While it won't radically improve the llvm-xray convert perf,
(there are likely other issues elsewhere)
this is quite clearly better than using raw_string_ostream,
in *ALL* cases.
xray-log.llvm-exegesis.lwZ0sT was acquired from llvm-exegesis
(compiled with -fxray-instruction-threshold=128)
analysis mode over -benchmarks-file with 10099 points (one full
latency measurement set), with normal runtime of 0.387s.
Timings:
Old:
$ perf stat -r 5 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT' (5 runs): 21346.24 msec task-clock # 1.000 CPUs utilized ( +- 0.28% ) 314 context-switches # 14.701 M/sec ( +- 59.13% ) 1 cpu-migrations # 0.037 M/sec ( +-100.00% ) 2181354 page-faults # 102191.251 M/sec ( +- 0.02% ) 85477442102 cycles # 4004415.019 GHz ( +- 0.28% ) (83.33%) 14526427066 stalled-cycles-frontend # 16.99% frontend cycles idle ( +- 0.70% ) (83.33%) 32371533721 stalled-cycles-backend # 37.87% backend cycles idle ( +- 0.27% ) (33.34%) 67896890228 instructions # 0.79 insn per cycle # 0.48 stalled cycles per insn ( +- 0.03% ) (50.00%) 14592654840 branches # 683631198.653 M/sec ( +- 0.02% ) (66.67%) 212207534 branch-misses # 1.45% of all branches ( +- 0.94% ) (83.34%) 21.3502 +- 0.0585 seconds time elapsed ( +- 0.27% )
New:
$ perf stat -r 5 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT' (5 runs): 20445.97 msec task-clock # 1.000 CPUs utilized ( +- 0.16% ) 504 context-switches # 24.651 M/sec ( +- 34.45% ) 0 cpu-migrations # 0.020 M/sec ( +- 61.24% ) 2181595 page-faults # 106702.410 M/sec ( +- 0.01% ) 81871726833 cycles # 4004369.000 GHz ( +- 0.16% ) (83.32%) 14329499463 stalled-cycles-frontend # 17.50% frontend cycles idle ( +- 0.57% ) (83.33%) 31260757379 stalled-cycles-backend # 38.18% backend cycles idle ( +- 0.21% ) (33.35%) 64198631119 instructions # 0.78 insn per cycle # 0.49 stalled cycles per insn ( +- 0.06% ) (50.02%) 13619488662 branches # 666132990.101 M/sec ( +- 0.06% ) (66.68%) 197109413 branch-misses # 1.45% of all branches ( +- 0.27% ) (83.34%) 20.4515 +- 0.0324 seconds time elapsed ( +- 0.16% )
Memory:
Old:
total runtime: 39.33s. bytes allocated in total (ignoring deallocations): 79.07GB (2.01GB/s) calls to allocation functions: 33267816 (845799/s) temporary memory allocations: 5832298 (148280/s) peak heap memory consumption: 9.21GB peak RSS (including heaptrack overhead): 147.98GB total memory leaked: 87.41KB
New:
total runtime: 35.34s. bytes allocated in total (ignoring deallocations): 33.20GB (939.39MB/s) calls to allocation functions: 27668393 (782809/s) temporary memory allocations: 232875 (6588/s) peak heap memory consumption: 9.21GB peak RSS (including heaptrack overhead): 147.97GB total memory leaked: 87.16KB
Diff:
total runtime: -3.99s. bytes allocated in total (ignoring deallocations): -45.87GB (11.50GB/s) calls to allocation functions: -5599423 (1404067/s) temporary memory allocations: -5599423 (1404067/s) peak heap memory consumption: 0B peak RSS (including heaptrack overhead): 0B total memory leaked: -255B