This is an archive of the discontinued LLVM Phabricator instance.

[Support] llvm::to_string(): raw_string_ostream is a memory hog
AbandonedPublic

Authored by lebedev.ri on Feb 23 2019, 7:54 AM.

Details

Summary

I was trying to analyse the llvm-exegesis analysis mode performance,
and for that i wanted to view the LLVM X-Ray log visualization in Chrome
trace viewer. And the llvm-xray convert is sluggish, and sometimes
even ended up being killed by OOM.

So i have looked at the llvm-xray convert itself in heaptrack,
and this was the obvious problem point. It converts a *lot* of ints
to strings.

llvm::to_string() uses raw_string_ostream as the tmp stream.
It makes sense, you don't really know with what it will be called with,
it's a generic support function. But there is a downside to that.
raw_string_ostream is buffered, and minimal size of that buffer
is BUFSIZ (which is 8192 at least here). That is *huge*.
So if you call llvm::to_string(int), even though the output std::string
will be ~12 chars, you still just wasted 8192 temporary bytes...

But for arithmetic types, we can have an upper estimation of the
output string length, and thus we can use SmallString,
and only pay for the final price, not any intermediate tmp price.

While it won't radically improve the llvm-xray convert perf,
(there are likely other issues elsewhere)
this is quite clearly better than using raw_string_ostream,
in *ALL* cases.

xray-log.llvm-exegesis.lwZ0sT was acquired from llvm-exegesis
(compiled with -fxray-instruction-threshold=128)
analysis mode over -benchmarks-file with 10099 points (one full
latency measurement set), with normal runtime of 0.387s.

Timings:
Old:

$ perf stat -r 5 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT 

 Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT' (5 runs):

          21346.24 msec task-clock                #    1.000 CPUs utilized            ( +-  0.28% )
               314      context-switches          #   14.701 M/sec                    ( +- 59.13% )
                 1      cpu-migrations            #    0.037 M/sec                    ( +-100.00% )
           2181354      page-faults               # 102191.251 M/sec                  ( +-  0.02% )
       85477442102      cycles                    # 4004415.019 GHz                   ( +-  0.28% )  (83.33%)
       14526427066      stalled-cycles-frontend   #   16.99% frontend cycles idle     ( +-  0.70% )  (83.33%)
       32371533721      stalled-cycles-backend    #   37.87% backend cycles idle      ( +-  0.27% )  (33.34%)
       67896890228      instructions              #    0.79  insn per cycle         
                                                  #    0.48  stalled cycles per insn  ( +-  0.03% )  (50.00%)
       14592654840      branches                  # 683631198.653 M/sec               ( +-  0.02% )  (66.67%)
         212207534      branch-misses             #    1.45% of all branches          ( +-  0.94% )  (83.34%)

           21.3502 +- 0.0585 seconds time elapsed  ( +-  0.27% )

New:

$ perf stat -r 5 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT 

 Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT' (5 runs):

          20445.97 msec task-clock                #    1.000 CPUs utilized            ( +-  0.16% )
               504      context-switches          #   24.651 M/sec                    ( +- 34.45% )
                 0      cpu-migrations            #    0.020 M/sec                    ( +- 61.24% )
           2181595      page-faults               # 106702.410 M/sec                  ( +-  0.01% )
       81871726833      cycles                    # 4004369.000 GHz                   ( +-  0.16% )  (83.32%)
       14329499463      stalled-cycles-frontend   #   17.50% frontend cycles idle     ( +-  0.57% )  (83.33%)
       31260757379      stalled-cycles-backend    #   38.18% backend cycles idle      ( +-  0.21% )  (33.35%)
       64198631119      instructions              #    0.78  insn per cycle         
                                                  #    0.49  stalled cycles per insn  ( +-  0.06% )  (50.02%)
       13619488662      branches                  # 666132990.101 M/sec               ( +-  0.06% )  (66.68%)
         197109413      branch-misses             #    1.45% of all branches          ( +-  0.27% )  (83.34%)

           20.4515 +- 0.0324 seconds time elapsed  ( +-  0.16% )

Memory:
Old:

total runtime: 39.33s.
bytes allocated in total (ignoring deallocations): 79.07GB (2.01GB/s)
calls to allocation functions: 33267816 (845799/s)
temporary memory allocations: 5832298 (148280/s)
peak heap memory consumption: 9.21GB
peak RSS (including heaptrack overhead): 147.98GB
total memory leaked: 87.41KB

New:

total runtime: 35.34s.
bytes allocated in total (ignoring deallocations): 33.20GB (939.39MB/s)
calls to allocation functions: 27668393 (782809/s)
temporary memory allocations: 232875 (6588/s)
peak heap memory consumption: 9.21GB
peak RSS (including heaptrack overhead): 147.97GB
total memory leaked: 87.16KB

Diff:

total runtime: -3.99s.
bytes allocated in total (ignoring deallocations): -45.87GB (11.50GB/s)
calls to allocation functions: -5599423 (1404067/s)
temporary memory allocations: -5599423 (1404067/s)
peak heap memory consumption: 0B
peak RSS (including heaptrack overhead): 0B
total memory leaked: -255B

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Feb 23 2019, 7:54 AM
rnk requested changes to this revision.Feb 25 2019, 10:36 AM

operator<< does this internally, see write_unsigned_impl. Instead of doing this, let's just disable buffering in raw_string_ostream. We already do it in raw_svector_ostream.

Also, how did llvm::to_string end up being so popular. We should move it to STLExtras.h from ScopedPrinter.h, or discourage its use in general.

This revision now requires changes to proceed.Feb 25 2019, 10:36 AM

I don't think we should discourage its use. It turns out to be extremely useful. But we also can't put it in STLExtras.h I think, because it would create a circular dependency. Same goes for StringExtras.h. We could either put it in raw_ostream.h or make a new file called StringConvert.h

rnk added a comment.Feb 25 2019, 10:54 AM

I don't think we should discourage its use. It turns out to be extremely useful. But we also can't put it in STLExtras.h I think, because it would create a circular dependency. Same goes for StringExtras.h. We could either put it in raw_ostream.h or make a new file called StringConvert.h

I think for basic integers, if you want std::strings, you can use std::to_string. But, if you have something with an interesting operator<< overload (llvm/Format.h), then you need llvm::to_string or something like it. What about putting it in Format.h, since based on the other use sites that is the most common?

In D58580#1409395, @rnk wrote:

operator<< does this internally, see write_unsigned_impl. Instead of doing this, let's just disable buffering in raw_string_ostream. We already do it in raw_svector_ostream.

I can do that, but i don't think i will be able to do any benchmarks. Will that change be ok as-is?

rnk added a comment.Feb 25 2019, 11:08 AM
In D58580#1409395, @rnk wrote:

operator<< does this internally, see write_unsigned_impl. Instead of doing this, let's just disable buffering in raw_string_ostream. We already do it in raw_svector_ostream.

I can do that, but i don't think i will be able to do any benchmarks. Will that change be ok as-is?

I don't think it needs to be benchmarked. Looking at the callers, for most of them it will save one 8K alloc/free. I'm assuming that repeated std::string::append isn't O(n^2) in any STL we care about. At the end of the day, anything we put into a std::string is small enough to put in memory, and shouldn't be performance sensitive.

lebedev.ri abandoned this revision.Feb 25 2019, 11:11 AM
In D58580#1409478, @rnk wrote:
In D58580#1409395, @rnk wrote:

operator<< does this internally, see write_unsigned_impl. Instead of doing this, let's just disable buffering in raw_string_ostream. We already do it in raw_svector_ostream.

I can do that, but i don't think i will be able to do any benchmarks. Will that change be ok as-is?

I don't think it needs to be benchmarked. Looking at the callers, for most of them it will save one 8K alloc/free. I'm assuming that repeated std::string::append isn't O(n^2) in any STL we care about. At the end of the day, anything we put into a std::string is small enough to put in memory, and shouldn't be performance sensitive.

Ok.
Will do that in a new review.