This is an archive of the discontinued LLVM Phabricator instance.

make ConstString allocate memory in non-tiny chunks
ClosedPublic

Authored by llunak on Oct 5 2019, 7:46 PM.

Download Raw Diff

Details

Reviewers

clayborg
aprantl

Commits

rGe2ca7cb504a8: make ConstString allocate memory in non-tiny chunks
rL374583: make ConstString allocate memory in non-tiny chunks
rLLDB374583: make ConstString allocate memory in non-tiny chunks

Summary

BumpPtrAllocator allocates in 4KiB chunks, which with any larger project is going to result in a large number of allocations. Increasing allocation size this way can save 10%-20% of symbol load time for a huge C++ project with correctly built debuginfo (specifically, attaching to LibreOffice Calc built with -gpubnames -gdwarf-aranges goes down from ~22s to ~18s).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

llunak created this revision.Oct 5 2019, 7:46 PM

Herald added subscribers: lldb-commits, JDevlieghere, aprantl. · View Herald TranscriptOct 5 2019, 7:46 PM

Increased default number of buckets for StringMap too.

This makes sense to me. We may need to make this configurable for low-memory hosts at some point.

This revision is now accepted and ready to land.Oct 8 2019, 10:11 AM

I assume you experimented with many different values, and these are the best tradeoff?

In D68549#1700019, @aprantl wrote:

This makes sense to me. We may need to make this configurable for low-memory hosts at some point.

Note that the memory usage shouldn't be as bad as it might look. As long as the BumpPtrAllocatorImpl size exceeds the threshold for a separate allocation (glibc malloc will use separate mmap for anything >=128KiB), memory for the virtual memory will be really allocated only if needed. And even if, that 1MiB is nowhere near the 15MiB liblldb.so takes here.

In D68549#1700022, @aprantl wrote:

I assume you experimented with many different values, and these are the best tradeoff?

I've tried few and picked what seemed reasonable. 64KiB for the allocator didn't perform as well, but then I don't think there's one good number, so I didn't really try to look for one. Also I tried just with my one specific use case, and this cannot be benchmarked precisely. The logic was basically just that for small projects this doesn't really matter because they won't take up much memory anyway and for those that do this should perform better.

Nice! If we run into any issues we can make this a global debugger setting.

Closed by commit rGe2ca7cb504a8: make ConstString allocate memory in non-tiny chunks (authored by llunak). · Explain WhyOct 11 2019, 12:36 PM

This revision was automatically updated to reflect the committed changes.

This patch actually made the RSS memory in our benchmarks go up by quite a bit (see lldb-bench ). The reason is that while we only create 1MiB chunks in the new allocator, we do have 256 different string pools we use as the backend for ConstString. So this means that we actually do allocation steps of 1MiB * 256 = 256MiB which is quite a lot. On a side note, we do use ConstString in the lldb-debugserver, and allocating additional 256MiB on some device running lldb-debugserver could end up being a serious problem if my understanding is correct? (+Alex because IIRC he cares about the debug server memory footprint).

Anyway, I think we either make the slab allocation logic smarter than just creating 1MiB slabs and trust the kernel/libc to not actually allocate that much real memory (maybe something like the std::vector growth model) or we fix the fact that we always have 256 string pools.

I'm a bit puzzled by the need for this change. The bump allocator already has logic to do power-of-two scaling for the allocation, so I wonder why it doesn't work properly here.

Oh, I wasn't aware we already had that in the allocator. I looked at the source and it seems that we need to first create about 1024 slabs before we reach the 1MiB limit, which means we need 256 * 1024 = 262144 slabs in LLDB due to the 256 different string pools.

Yeah, allocating 256MB sounds a bit too much, particularly for lldb-server (ideally, I'd remove ConstString from lldb-server completely, but that's a different story). The reason for 256 string pools was to avoid/reduce locking contention when accessing the pool, but that does make the power-of-2 scaling in BumpPtrAllocator kick in too slowly.

Maybe we could reduce the initial size by an order of magnitude here? Allocating 25MB (or whatever is the closest power of 2) does not sound like too much, and it should still be much better than 4kb. As an alternative, we could make the scaling in BumpPtrAllocator faster/configurable. I also doubt that the number of string pools was arrived at very scientifically, so it may be possible to tune that number too (though, without any data back it up, it seems to me that reducing the slab size should have less impact than reducing the pool count).

Oh, I somehow completely missed the fact that there are 256 of those :-/. In that case my numbers are way too much. Can you easily benchmark with different numbers? I think something like 131072 for BumpPtrAllocator (to still be large enough for the mmap threshold) could be reasonable, the size for StringPool can go down to 256 or maybe can be simply dropped.

In D68549#1710293, @joerg wrote:

I'm a bit puzzled by the need for this change.

I could measure a difference in lldb symbol load times with the change (and it was enough to measure just with a stopwatch). And there's no "need" for the change, the logic was that it should improve situations that have many debug symbols and shouldn't matter for situations with few debug symbols. The assumption was that in realistic scenarios nobody would care that a debugging session needs 300KiB of memory instead of 100KiB. If I try to debug either a small test app or the whole clang binary, I can see a difference but nothing I'd care about in practice. What I care about though is that LLDB spends a significant portion of its startup time in ConstString.

In D68549#1710293, @joerg wrote:

The bump allocator already has logic to do power-of-two scaling for the allocation, so I wonder why it doesn't work properly here.

As already said, it does work, eventually, but it's not power-of-two and so it takes its time before it figures out that it doesn't need to hammer on the allocation routines so much.

Anyway, I think we either make the slab allocation logic smarter than just creating 1MiB slabs and trust the kernel/libc to not actually allocate that much real memory (maybe something like the std::vector growth model) or we fix the fact that we always have 256 string pools.

So, in what realistic scenarios is this change a problem? Maybe it'd be enough to just have #if defined(ANDROID) || defined(ARM) || defined(WHATEVER) and not overcomplicate this.

Thinking more about this, maybe this is really not the right place and the change should be done in BumpPtrAllocator? To me it seems rather unreasonable that it would double the allocation size only after 128 allocations. Surely by the time it has already allocated 128*4096=0.5MiB memory it could have decided to raise the allocation size much sooner?

In D68549#1710744, @llunak wrote:

Thinking more about this, maybe this is really not the right place and the change should be done in BumpPtrAllocator? To me it seems rather unreasonable that it would double the allocation size only after 128 allocations. Surely by the time it has already allocated 128*4096=0.5MiB memory it could have decided to raise the allocation size much sooner?

That is my impression as well, but beware that BumpPtrAllocator has many users, and a they are generally more performance-sensitive and better tuned than lldb, so it may turn out that this value actually *is* the sweet spot for something. Nevertheless, I think it's worth asking around or trying to send a patch.

However, independently of that, I think it would make sense to raise the initial allocation size in ConstString somewhat (just not straight to 1 (256) meg).

(Seems like my previous comment was cut off in the middle for some reason?)

We could also just let the allocator take a parameter so that is increases the growth size to do it every Nth slab (N=1 for us) and set a slightly larger starting size.

By the way, if I look at the flame graph for lldb starting up and setting a break point in an LLDB debug binary, it seems that on the benchmark server we spent less than 0.5% of the startup time in that allocator function and much more time in that other ConstString overhead (the double-hashing to find the StringPool, the different locking). I'm curious why the allocator logic is so disproportionally slow on your setup (0.5% vs 10-20%). The only real work we do in the allocator is calling malloc, so I assume calling malloc is much more expensive on your system?

Can you easily benchmark with different numbers?

Sadly not, but just running LLDB to attach to a LLDB debug build and running this lldb command list should replicate the most important benchmark.

In D68549#1710986, @teemperor wrote:

We could also just let the allocator take a parameter so that is increases the growth size to do it every Nth slab (N=1 for us) and set a slightly larger starting size.

Heh, I wasn't exactly planning on getting into hacking base LLVM classes with this seemingly trivial commit. Oh well, a patch is at https://reviews.llvm.org/D69050 . If it gets accepted, I'll change ConstString to use BumpPtrAllocatorImpl<llvm::MallocAllocator, 16384, 16384, 1> and change the StringPool construction back to not having an explicit default.

By the way, if I look at the flame graph for lldb starting up and setting a break point in an LLDB debug binary, it seems that on the benchmark server we spent less than 0.5% of the startup time in that allocator function and much more time in that other ConstString overhead (the double-hashing to find the StringPool, the different locking). I'm curious why the allocator logic is so disproportionally slow on your setup (0.5% vs 10-20%). The only real work we do in the allocator is calling malloc, so I assume calling malloc is much more expensive on your system?

It is more expensive, relatively speaking. I used -gpubnames, so my profile wastes no time in ManualDWARFIndex. And if your "LLDB debug binary" implies non-optimized, then I tested with an optimized build, which of course changes things. And finally I tested with debug LibreOffice build, which means tons of debug data, more than with Clang or LLDB, but processed quickly thanks to -gpubnames. Apparently at that point malloc, kernel memory handling, etc. costs can become measurable *shrug*. The cost of ConstString hashing etc. is even larger, but there I don't see as simple solution as tweaking the allocator.

teemperor mentioned this in D71654: [llvm] Add a way to speed up the speed in which BumpPtrAllocator increases slab sizes.Dec 18 2019, 4:10 AM

teemperor mentioned this in rGb848b510a8d5: [llvm] Add a way to speed up the speed in which BumpPtrAllocator increases slab….Feb 3 2020, 12:17 AM

Revision Contents

Path

Size

lldb/

source/

Utility/

ConstString.cpp

9 lines

Diff 224657

lldb/source/Utility/ConstString.cpp

	Show All 25 Lines
	#include <stdint.h>			#include <stdint.h>
	#include <string.h>			#include <string.h>

	using namespace lldb_private;			using namespace lldb_private;

	class Pool {			class Pool {
	public:			public:
	typedef const char *StringPoolValueType;			typedef const char *StringPoolValueType;
	typedef llvm::StringMap<StringPoolValueType, llvm::BumpPtrAllocator>			// BumpPtrAllocator allocates in 4KiB chunks, any larger C++ project is going
				// to have megabytes of symbols, so allocate in larger chunks.
				typedef llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 1048576> Allocator;
				typedef llvm::StringMap<StringPoolValueType, Allocator>
	StringPool;			StringPool;
	typedef llvm::StringMapEntry<StringPoolValueType> StringPoolEntryType;			typedef llvm::StringMapEntry<StringPoolValueType> StringPoolEntryType;

	static StringPoolEntryType &			static StringPoolEntryType &
	GetStringMapEntryFromKeyData(const char *keyData) {			GetStringMapEntryFromKeyData(const char *keyData) {
	return StringPoolEntryType::GetStringMapEntryFromKeyData(keyData);			return StringPoolEntryType::GetStringMapEntryFromKeyData(keyData);
	}			}

	▲ Show 20 Lines • Show All 104 Lines • ▼ Show 20 Lines
	protected:			protected:
	uint8_t hash(const llvm::StringRef &s) const {			uint8_t hash(const llvm::StringRef &s) const {
	uint32_t h = llvm::djbHash(s);			uint32_t h = llvm::djbHash(s);
	return ((h >> 24) ^ (h >> 16) ^ (h >> 8) ^ h) & 0xff;			return ((h >> 24) ^ (h >> 16) ^ (h >> 8) ^ h) & 0xff;
	}			}

	struct PoolEntry {			struct PoolEntry {
	mutable llvm::sys::SmartRWMutex<false> m_mutex;			mutable llvm::sys::SmartRWMutex<false> m_mutex;
	StringPool m_string_map;			// StringMap by default starts with 16 buckets, any larger project is
				// going to have many symbols, so start with a larger value.
				StringPool m_string_map = StringPool( 65536 );
	};			};

	std::array<PoolEntry, 256> m_string_pools;			std::array<PoolEntry, 256> m_string_pools;
	};			};

	// Frameworks and dylibs aren't supposed to have global C++ initializers so we			// Frameworks and dylibs aren't supposed to have global C++ initializers so we
	// hide the string pool in a static function so that it will get initialized on			// hide the string pool in a static function so that it will get initialized on
	// the first call to this static function.			// the first call to this static function.
	▲ Show 20 Lines • Show All 149 Lines • Show Last 20 Lines