This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Support/
-
llvm/
-
Support/
-
ScopedPrinter.h

Differential D58580

[Support] llvm::to_string(): raw_string_ostream is a memory hog
AbandonedPublic

Authored by lebedev.ri on Feb 23 2019, 7:54 AM.

Download Raw Diff

Details

Reviewers

rnk
zturner
chandlerc

Summary

I was trying to analyse the llvm-exegesis analysis mode performance,
and for that i wanted to view the LLVM X-Ray log visualization in Chrome
trace viewer. And the llvm-xray convert is sluggish, and sometimes
even ended up being killed by OOM.

So i have looked at the llvm-xray convert itself in heaptrack,
and this was the obvious problem point. It converts a *lot* of ints
to strings.

llvm::to_string() uses raw_string_ostream as the tmp stream.
It makes sense, you don't really know with what it will be called with,
it's a generic support function. But there is a downside to that.
raw_string_ostream is buffered, and minimal size of that buffer
is BUFSIZ (which is 8192 at least here). That is *huge*.
So if you call llvm::to_string(int), even though the output std::string
will be ~12 chars, you still just wasted 8192 temporary bytes...

But for arithmetic types, we can have an upper estimation of the
output string length, and thus we can use SmallString,
and only pay for the final price, not any intermediate tmp price.

While it won't radically improve the llvm-xray convert perf,
(there are likely other issues elsewhere)
this is quite clearly better than using raw_string_ostream,
in *ALL* cases.

xray-log.llvm-exegesis.lwZ0sT was acquired from llvm-exegesis
(compiled with -fxray-instruction-threshold=128)
analysis mode over -benchmarks-file with 10099 points (one full
latency measurement set), with normal runtime of 0.387s.

Timings:
Old:

$ perf stat -r 5 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT 

 Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT' (5 runs):

          21346.24 msec task-clock                #    1.000 CPUs utilized            ( +-  0.28% )
               314      context-switches          #   14.701 M/sec                    ( +- 59.13% )
                 1      cpu-migrations            #    0.037 M/sec                    ( +-100.00% )
           2181354      page-faults               # 102191.251 M/sec                  ( +-  0.02% )
       85477442102      cycles                    # 4004415.019 GHz                   ( +-  0.28% )  (83.33%)
       14526427066      stalled-cycles-frontend   #   16.99% frontend cycles idle     ( +-  0.70% )  (83.33%)
       32371533721      stalled-cycles-backend    #   37.87% backend cycles idle      ( +-  0.27% )  (33.34%)
       67896890228      instructions              #    0.79  insn per cycle         
                                                  #    0.48  stalled cycles per insn  ( +-  0.03% )  (50.00%)
       14592654840      branches                  # 683631198.653 M/sec               ( +-  0.02% )  (66.67%)
         212207534      branch-misses             #    1.45% of all branches          ( +-  0.94% )  (83.34%)

           21.3502 +- 0.0585 seconds time elapsed  ( +-  0.27% )

New:

$ perf stat -r 5 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT 

 Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace.yml xray-log.llvm-exegesis.lwZ0sT' (5 runs):

          20445.97 msec task-clock                #    1.000 CPUs utilized            ( +-  0.16% )
               504      context-switches          #   24.651 M/sec                    ( +- 34.45% )
                 0      cpu-migrations            #    0.020 M/sec                    ( +- 61.24% )
           2181595      page-faults               # 106702.410 M/sec                  ( +-  0.01% )
       81871726833      cycles                    # 4004369.000 GHz                   ( +-  0.16% )  (83.32%)
       14329499463      stalled-cycles-frontend   #   17.50% frontend cycles idle     ( +-  0.57% )  (83.33%)
       31260757379      stalled-cycles-backend    #   38.18% backend cycles idle      ( +-  0.21% )  (33.35%)
       64198631119      instructions              #    0.78  insn per cycle         
                                                  #    0.49  stalled cycles per insn  ( +-  0.06% )  (50.02%)
       13619488662      branches                  # 666132990.101 M/sec               ( +-  0.06% )  (66.68%)
         197109413      branch-misses             #    1.45% of all branches          ( +-  0.27% )  (83.34%)

           20.4515 +- 0.0324 seconds time elapsed  ( +-  0.16% )

Memory:
Old:

total runtime: 39.33s.
bytes allocated in total (ignoring deallocations): 79.07GB (2.01GB/s)
calls to allocation functions: 33267816 (845799/s)
temporary memory allocations: 5832298 (148280/s)
peak heap memory consumption: 9.21GB
peak RSS (including heaptrack overhead): 147.98GB
total memory leaked: 87.41KB

New:

total runtime: 35.34s.
bytes allocated in total (ignoring deallocations): 33.20GB (939.39MB/s)
calls to allocation functions: 27668393 (782809/s)
temporary memory allocations: 232875 (6588/s)
peak heap memory consumption: 9.21GB
peak RSS (including heaptrack overhead): 147.97GB
total memory leaked: 87.16KB

Diff:

total runtime: -3.99s.
bytes allocated in total (ignoring deallocations): -45.87GB (11.50GB/s)
calls to allocation functions: -5599423 (1404067/s)
temporary memory allocations: -5599423 (1404067/s)
peak heap memory consumption: 0B
peak RSS (including heaptrack overhead): 0B
total memory leaked: -255B

Diff Detail

Repository: rL LLVM

Event Timeline

lebedev.ri created this revision.Feb 23 2019, 7:54 AM

Herald added subscribers: jdoerfert, kristina, courbet. · View Herald TranscriptFeb 23 2019, 7:54 AM

lebedev.ri mentioned this in D58584: [XRay][tools] Revert "Use Support/JSON.h in llvm-xray convert".Feb 23 2019, 2:17 PM

Diffusion mentioned this in rL354764: [XRay][tools] Revert "Use Support/JSON.h in llvm-xray convert".Feb 24 2019, 11:41 PM

lebedev.ri mentioned this in rG49b6f81a74ae: [XRay][tools] Revert "Use Support/JSON.h in llvm-xray convert".Feb 24 2019, 11:41 PM

operator<< does this internally, see write_unsigned_impl. Instead of doing this, let's just disable buffering in raw_string_ostream. We already do it in raw_svector_ostream.

Also, how did llvm::to_string end up being so popular. We should move it to STLExtras.h from ScopedPrinter.h, or discourage its use in general.

This revision now requires changes to proceed.Feb 25 2019, 10:36 AM

I don't think we should discourage its use. It turns out to be extremely useful. But we also can't put it in STLExtras.h I think, because it would create a circular dependency. Same goes for StringExtras.h. We could either put it in raw_ostream.h or make a new file called StringConvert.h

In D58580#1409433, @zturner wrote:

I don't think we should discourage its use. It turns out to be extremely useful. But we also can't put it in STLExtras.h I think, because it would create a circular dependency. Same goes for StringExtras.h. We could either put it in raw_ostream.h or make a new file called StringConvert.h

I think for basic integers, if you want std::strings, you can use std::to_string. But, if you have something with an interesting operator<< overload (llvm/Format.h), then you need llvm::to_string or something like it. What about putting it in Format.h, since based on the other use sites that is the most common?

In D58580#1409395, @rnk wrote:

operator<< does this internally, see write_unsigned_impl. Instead of doing this, let's just disable buffering in raw_string_ostream. We already do it in raw_svector_ostream.

I can do that, but i don't think i will be able to do any benchmarks. Will that change be ok as-is?

In D58580#1409470, @lebedev.ri wrote:

In D58580#1409395, @rnk wrote:

operator<< does this internally, see write_unsigned_impl. Instead of doing this, let's just disable buffering in raw_string_ostream. We already do it in raw_svector_ostream.

I can do that, but i don't think i will be able to do any benchmarks. Will that change be ok as-is?

I don't think it needs to be benchmarked. Looking at the callers, for most of them it will save one 8K alloc/free. I'm assuming that repeated std::string::append isn't O(n^2) in any STL we care about. At the end of the day, anything we put into a std::string is small enough to put in memory, and shouldn't be performance sensitive.

In D58580#1409478, @rnk wrote:

In D58580#1409470, @lebedev.ri wrote:

In D58580#1409395, @rnk wrote:

operator<< does this internally, see write_unsigned_impl. Instead of doing this, let's just disable buffering in raw_string_ostream. We already do it in raw_svector_ostream.

I can do that, but i don't think i will be able to do any benchmarks. Will that change be ok as-is?

I don't think it needs to be benchmarked. Looking at the callers, for most of them it will save one 8K alloc/free. I'm assuming that repeated std::string::append isn't O(n^2) in any STL we care about. At the end of the day, anything we put into a std::string is small enough to put in memory, and shouldn't be performance sensitive.

Ok.
Will do that in a new review.

lebedev.ri mentioned this in D58643: [Support] Make raw_string_ostream unbuffered.Feb 25 2019, 12:20 PM

Diffusion mentioned this in rL354819: [Support] Make raw_string_ostream unbuffered.Feb 25 2019, 12:51 PM

lebedev.ri mentioned this in rG0397f495c0c4: [Support] Make raw_string_ostream unbuffered.Feb 25 2019, 12:51 PM

hans mentioned this in rL354856: Merging r354764:.Feb 26 2019, 2:04 AM

hansw mentioned this in rGee57e9e190b8: Merging r354764: --------------------------------------------------------------….Feb 26 2019, 2:04 AM

Revision Contents

Path

Size

include/

llvm/

Support/

ScopedPrinter.h

25 lines

Diff 188050

include/llvm/Support/ScopedPrinter.h

//===-- ScopedPrinter.h ----------------------------------------- C++ ---===//		//===-- ScopedPrinter.h ----------------------------------------- C++ ---===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_SUPPORT_SCOPEDPRINTER_H		#ifndef LLVM_SUPPORT_SCOPEDPRINTER_H
#define LLVM_SUPPORT_SCOPEDPRINTER_H		#define LLVM_SUPPORT_SCOPEDPRINTER_H

#include "llvm/ADT/APSInt.h"		#include "llvm/ADT/APSInt.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
		#include "llvm/ADT/SmallString.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/Support/DataTypes.h"		#include "llvm/Support/DataTypes.h"
#include "llvm/Support/Endian.h"		#include "llvm/Support/Endian.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include <algorithm>		#include <algorithm>
		#include <limits>

namespace llvm {		namespace llvm {

template <typename T> struct EnumEntry {		template <typename T> struct EnumEntry {
StringRef Name;		StringRef Name;
// While Name suffices in most of the cases, in certain cases		// While Name suffices in most of the cases, in certain cases
// GNU style and LLVM style of ELFDumper do not		// GNU style and LLVM style of ELFDumper do not
// display same string for same enum. The AltName if initialized appropriately		// display same string for same enum. The AltName if initialized appropriately
Show All 25 Lines	struct HexNumber {
HexNumber(unsigned long Value) : Value(Value) {}		HexNumber(unsigned long Value) : Value(Value) {}
HexNumber(unsigned long long Value) : Value(Value) {}		HexNumber(unsigned long long Value) : Value(Value) {}
uint64_t Value;		uint64_t Value;
};		};

raw_ostream &operator<<(raw_ostream &OS, const HexNumber &Value);		raw_ostream &operator<<(raw_ostream &OS, const HexNumber &Value);
const std::string to_hexString(uint64_t Value, bool UpperCase = true);		const std::string to_hexString(uint64_t Value, bool UpperCase = true);

template <class T> const std::string to_string(const T &Value) {		template <class T, typename = typename std::enable_if<
		!std::numeric_limits<T>::is_specialized>::type>
		const std::string to_string(const T &Value) {
std::string number;		std::string number;
llvm::raw_string_ostream stream(number);		llvm::raw_string_ostream stream(number);
stream << Value;		stream << Value;
return stream.str();		return stream.str();
}		}

		template <class T,
		typename = typename std::enable_if<
		std::numeric_limits<T>::is_specialized>::type,
		void * = nullptr>
		const std::string to_string(const T &Value) {
		static constexpr auto NumDigits = std::numeric_limits<T>::is_integer
		? std::numeric_limits<T>::digits10
		: std::numeric_limits<T>::max_digits10;
		// Optional char for sign bit, plus the required number of base-10 digits.
		// If integer - then +1 is for rounding up. Else, for the decimal separator.
		static constexpr auto BufLen =
		std::numeric_limits<T>::is_signed + NumDigits + 1;
		llvm::SmallString<BufLen> number;
		llvm::raw_svector_ostream stream(number);
		stream << Value;
		assert(stream.str().size() <= BufLen && "small size prediction failed");
		return stream.str();
		}

class ScopedPrinter {		class ScopedPrinter {
public:		public:
ScopedPrinter(raw_ostream &OS) : OS(OS), IndentLevel(0) {}		ScopedPrinter(raw_ostream &OS) : OS(OS), IndentLevel(0) {}

void flush() { OS.flush(); }		void flush() { OS.flush(); }

void indent(int Levels = 1) { IndentLevel += Levels; }		void indent(int Levels = 1) { IndentLevel += Levels; }

▲ Show 20 Lines • Show All 313 Lines • Show Last 20 Lines