This is an archive of the discontinued LLVM Phabricator instance.

make ConstString allocate memory in non-tiny chunks
ClosedPublic

Authored by llunak on Oct 5 2019, 7:46 PM.

Details

Summary

BumpPtrAllocator allocates in 4KiB chunks, which with any larger project is going to result in a large number of allocations. Increasing allocation size this way can save 10%-20% of symbol load time for a huge C++ project with correctly built debuginfo (specifically, attaching to LibreOffice Calc built with -gpubnames -gdwarf-aranges goes down from ~22s to ~18s).

Diff Detail

Event Timeline

llunak created this revision.Oct 5 2019, 7:46 PM
llunak updated this revision to Diff 223414.Oct 6 2019, 6:39 AM

Increased default number of buckets for StringMap too.

aprantl accepted this revision.Oct 8 2019, 10:11 AM

This makes sense to me. We may need to make this configurable for low-memory hosts at some point.

This revision is now accepted and ready to land.Oct 8 2019, 10:11 AM

I assume you experimented with many different values, and these are the best tradeoff?

This makes sense to me. We may need to make this configurable for low-memory hosts at some point.

Note that the memory usage shouldn't be as bad as it might look. As long as the BumpPtrAllocatorImpl size exceeds the threshold for a separate allocation (glibc malloc will use separate mmap for anything >=128KiB), memory for the virtual memory will be really allocated only if needed. And even if, that 1MiB is nowhere near the 15MiB liblldb.so takes here.

I assume you experimented with many different values, and these are the best tradeoff?

I've tried few and picked what seemed reasonable. 64KiB for the allocator didn't perform as well, but then I don't think there's one good number, so I didn't really try to look for one. Also I tried just with my one specific use case, and this cannot be benchmarked precisely. The logic was basically just that for small projects this doesn't really matter because they won't take up much memory anyway and for those that do this should perform better.

clayborg accepted this revision.Oct 8 2019, 1:32 PM

Nice! If we run into any issues we can make this a global debugger setting.

This revision was automatically updated to reflect the committed changes.

This patch actually made the RSS memory in our benchmarks go up by quite a bit (see lldb-bench ). The reason is that while we only create 1MiB chunks in the new allocator, we do have 256 different string pools we use as the backend for ConstString. So this means that we actually do allocation steps of 1MiB * 256 = 256MiB which is quite a lot. On a side note, we do use ConstString in the lldb-debugserver, and allocating additional 256MiB on some device running lldb-debugserver could end up being a serious problem if my understanding is correct? (+Alex because IIRC he cares about the debug server memory footprint).

Anyway, I think we either make the slab allocation logic smarter than just creating 1MiB slabs and trust the kernel/libc to not actually allocate that much real memory (maybe something like the std::vector growth model) or we fix the fact that we always have 256 string pools.

joerg added a subscriber: joerg.Oct 15 2019, 5:07 PM

I'm a bit puzzled by the need for this change. The bump allocator already has logic to do power-of-two scaling for the allocation, so I wonder why it doesn't work properly here.

Oh, I wasn't aware we already had that in the allocator. I looked at the source and it seems that we need to first create about 1024 slabs before we reach the 1MiB limit, which means we need 256 * 1024 = 262144 slabs in LLDB due to the 256 different string pools.

labath added a subscriber: labath.Oct 16 2019, 1:33 AM

Yeah, allocating 256MB sounds a bit too much, particularly for lldb-server (ideally, I'd remove ConstString from lldb-server completely, but that's a different story). The reason for 256 string pools was to avoid/reduce locking contention when accessing the pool, but that does make the power-of-2 scaling in BumpPtrAllocator kick in too slowly.

Maybe we could reduce the initial size by an order of magnitude here? Allocating 25MB (or whatever is the closest power of 2) does not sound like too much, and it should still be much better than 4kb. As an alternative, we could make the scaling in BumpPtrAllocator faster/configurable. I also doubt that the number of string pools was arrived at very scientifically, so it may be possible to tune that number too (though, without any data back it up, it seems to me that reducing the slab size should have less impact than reducing the pool count).

Oh, I somehow completely missed the fact that there are 256 of those :-/. In that case my numbers are way too much. Can you easily benchmark with different numbers? I think something like 131072 for BumpPtrAllocator (to still be large enough for the mmap threshold) could be reasonable, the size for StringPool can go down to 256 or maybe can be simply dropped.

I'm a bit puzzled by the need for this change.

I could measure a difference in lldb symbol load times with the change (and it was enough to measure just with a stopwatch). And there's no "need" for the change, the logic was that it should improve situations that have many debug symbols and shouldn't matter for situations with few debug symbols. The assumption was that in realistic scenarios nobody would care that a debugging session needs 300KiB of memory instead of 100KiB. If I try to debug either a small test app or the whole clang binary, I can see a difference but nothing I'd care about in practice. What I care about though is that LLDB spends a significant portion of its startup time in ConstString.

The bump allocator already has logic to do power-of-two scaling for the allocation, so I wonder why it doesn't work properly here.

As already said, it does work, eventually, but it's not power-of-two and so it takes its time before it figures out that it doesn't need to hammer on the allocation routines so much.

Anyway, I think we either make the slab allocation logic smarter than just creating 1MiB slabs and trust the kernel/libc to not actually allocate that much real memory (maybe something like the std::vector growth model) or we fix the fact that we always have 256 string pools.

So, in what realistic scenarios is this change a problem? Maybe it'd be enough to just have #if defined(ANDROID) || defined(ARM) || defined(WHATEVER) and not overcomplicate this.

Thinking more about this, maybe this is really not the right place and the change should be done in BumpPtrAllocator? To me it seems rather unreasonable that it would double the allocation size only after 128 allocations. Surely by the time it has already allocated 128*4096=0.5MiB memory it could have decided to raise the allocation size much sooner?

Thinking more about this, maybe this is really not the right place and the change should be done in BumpPtrAllocator? To me it seems rather unreasonable that it would double the allocation size only after 128 allocations. Surely by the time it has already allocated 128*4096=0.5MiB memory it could have decided to raise the allocation size much sooner?

That is my impression as well, but beware that BumpPtrAllocator has many users, and a they are generally more performance-sensitive and better tuned than lldb, so it may turn out that this value actually *is* the sweet spot for something. Nevertheless, I think it's worth asking around or trying to send a patch.

However, independently of that, I think it would make sense to raise the initial allocation size in ConstString somewhat (just not straight to 1 (256) meg).

(Seems like my previous comment was cut off in the middle for some reason?)

We could also just let the allocator take a parameter so that is increases the growth size to do it every Nth slab (N=1 for us) and set a slightly larger starting size.

By the way, if I look at the flame graph for lldb starting up and setting a break point in an LLDB debug binary, it seems that on the benchmark server we spent less than 0.5% of the startup time in that allocator function and much more time in that other ConstString overhead (the double-hashing to find the StringPool, the different locking). I'm curious why the allocator logic is so disproportionally slow on your setup (0.5% vs 10-20%). The only real work we do in the allocator is calling malloc, so I assume calling malloc is much more expensive on your system?

Can you easily benchmark with different numbers?

Sadly not, but just running LLDB to attach to a LLDB debug build and running this lldb command list should replicate the most important benchmark.

We could also just let the allocator take a parameter so that is increases the growth size to do it every Nth slab (N=1 for us) and set a slightly larger starting size.

Heh, I wasn't exactly planning on getting into hacking base LLVM classes with this seemingly trivial commit. Oh well, a patch is at https://reviews.llvm.org/D69050 . If it gets accepted, I'll change ConstString to use BumpPtrAllocatorImpl<llvm::MallocAllocator, 16384, 16384, 1> and change the StringPool construction back to not having an explicit default.

By the way, if I look at the flame graph for lldb starting up and setting a break point in an LLDB debug binary, it seems that on the benchmark server we spent less than 0.5% of the startup time in that allocator function and much more time in that other ConstString overhead (the double-hashing to find the StringPool, the different locking). I'm curious why the allocator logic is so disproportionally slow on your setup (0.5% vs 10-20%). The only real work we do in the allocator is calling malloc, so I assume calling malloc is much more expensive on your system?

It is more expensive, relatively speaking. I used -gpubnames, so my profile wastes no time in ManualDWARFIndex. And if your "LLDB debug binary" implies non-optimized, then I tested with an optimized build, which of course changes things. And finally I tested with debug LibreOffice build, which means tons of debug data, more than with Clang or LLDB, but processed quickly thanks to -gpubnames. Apparently at that point malloc, kernel memory handling, etc. costs can become measurable *shrug*. The cost of ConstString hashing etc. is even larger, but there I don't see as simple solution as tweaking the allocator.