This is an archive of the discontinued LLVM Phabricator instance.

[Symbolize] LRU cache binaries in llvm-symbolizer.
ClosedPublic

Authored by mysterymath on Feb 14 2022, 2:17 PM.

Details

Summary

This change adds a simple LRU cache to the Symbolize class to put a cap
on llvm-symbolizer memory usage. Previously, the Symbolizer's virtual
memory footprint would grow without bound as additional binaries were
referenced.

I'm putting this out there early for an informal review, since there may be
a dramatically different/better way to go about this. I still need to
figure out a good default constant for the memory cap and benchmark the
implementation against a large symbolization workload. Right now I've
pegged max memory usage at zero for testing purposes, which evicts the whole
cache every time.

Unfortunately, it looks like StringRefs in the returned DI objects can
directly refer to the contents of binaries. Accordingly, the cache
pruning must be explicitly requested by the caller, as the caller must
guarantee that none of the returned objects will be used afterwards.

For llvm-symbolizer this a light burden; symbolization occurs
line-by-line, and the returned objects are discarded after each.

Implementation wise, there are a number of nested caches that depend
on one another. I've implemented a simple Evictor callback system to
allow derived caches to register eviction actions to occur when the
underlying binaries are evicted.

Diff Detail

Event Timeline

mysterymath created this revision.Feb 14 2022, 2:17 PM

Clang format.

Update commit msg.

mysterymath edited the summary of this revision. (Show Details)Feb 14 2022, 2:19 PM
mysterymath edited the summary of this revision. (Show Details)
mysterymath edited the summary of this revision. (Show Details)Feb 14 2022, 2:29 PM
mysterymath added reviewers: phosek, dblaikie.
mysterymath edited the summary of this revision. (Show Details)
mysterymath edited the summary of this revision. (Show Details)Feb 14 2022, 2:31 PM
mysterymath edited the summary of this revision. (Show Details)

Fix unnecessary diff.

mysterymath published this revision for review.Feb 14 2022, 2:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 14 2022, 2:34 PM

I thought @netforce had already prototyped something like this? Could you link to the previous review/discussion on the subject if you can find it? (or maybe it was only internally discussed)

I thought @netforce had already prototyped something like this? Could you link to the previous review/discussion on the subject if you can find it? (or maybe it was only internally discussed)

Ah, I think I stumbled across this: https://reviews.llvm.org/D78950

Perhaps you could commandeer/pick up https://reviews.llvm.org/D90006 ? or summarize in this (or the other, D78950) that historical context/what's new/different/justified in this new review?

Perhaps you could commandeer/pick up https://reviews.llvm.org/D90006 ? or summarize in this (or the other, D78950) that historical context/what's new/different/justified in this new review?

The context for this change is to try to fix OOMs discovered when symbolizing system logs with llvm-symbolizer.
As is, llvm-symbolizer essentially leaks memory when using it's STDIN format. It's specified in a line-at-a-time fashion, but data is never cleaned up from one line to the next. Eventually, you either have to either stop feeding it lines or restart the job.

It seems like the Symbolizer is the right place to address this; it *could* mmap in and parse the debug binaries anew for each symbolization request, but it doesn't, because it's more efficient to keep this data around. At least, right up until physical memory is exhausted.
Accordingly, it shouldn't really matter what kind of derived data the Symbolizer keeps; everything should be fair game for eviction; otherwise whatever we don't try to evict will leak from line to line.

Perhaps you could commandeer/pick up https://reviews.llvm.org/D90006 ? or summarize in this (or the other, D78950) that historical context/what's new/different/justified in this new review?

Perhaps you could commandeer/pick up https://reviews.llvm.org/D90006 ? or summarize in this (or the other, D78950) that historical context/what's new/different/justified in this new review?

The context for this change is to try to fix OOMs discovered when symbolizing system logs with llvm-symbolizer.
As is, llvm-symbolizer essentially leaks memory when using it's STDIN format. It's specified in a line-at-a-time fashion, but data is never cleaned up from one line to the next. Eventually, you either have to either stop feeding it lines or restart the job.

It seems like the Symbolizer is the right place to address this; it *could* mmap in and parse the debug binaries anew for each symbolization request, but it doesn't, because it's more efficient to keep this data around. At least, right up until physical memory is exhausted.
Accordingly, it shouldn't really matter what kind of derived data the Symbolizer keeps; everything should be fair game for eviction; otherwise whatever we don't try to evict will leak from line to line.

Does this solution subsume the previous patches? What does it provide that they didn't/why this rather than those directions? (does it not subsume that functionality? (ie: it addresses some issues, but not some/all of the issues the other patches were interested in) if so, it'd be unfortunate to have to solve related issues in two different places/ways - would be good to understand why we can't find a common solution to these issues)

mysterymath added a comment.EditedFeb 15 2022, 12:35 PM

It doesn't look like the memory usage improvements of either patch are a proper subset of each other.

This patch operates file-at-a-time; it ejects all of the debug information for a given file at once. This includes data that isn't covered by the other patch: PDB debug info, the lists and ancillary data structures parsed in the ObjectFiles, and the contents of the binary files themselves (which may not have been mmapped if they were small).
It should completely subsume the other patch with regards to long-term leak protection; everything should be discarded between symbolization requests.

However, since the other patch operates at a finer granularity, it can reduce memory usage even with only one single binary present, as in the example provided in https://reviews.llvm.org/D90006, which showed a reduction when symbolizing clang alone.

If we start with coarse-grained caching, we could replace caching for objects like DWARFContext with finer grained caching as it becomes available.
If we instead start with finer-grained caching on a subset of the data (DWARFContext), it would help reduce the leakiness of llvm-symbolizer, but it kicks the can down the road. I'm not sure exactly how far, though; maybe quite far.

Thanks for summarizing the differences. Be nice if we could do this once/figure out a general implementation that covers both scenarios, but I'm not sure how practical that is.

Do you have some performance numbers? What if we didn't cache/had a cache size of 0? (does the caching logic buy us enough performance to justify it)

llvm/include/llvm/DebugInfo/Symbolize/Symbolize.h
228
llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
541

Do you have some performance numbers? What if we didn't cache/had a cache size of 0? (does the caching logic buy us enough performance to justify it)

I did some poking around, and it looks like restarting the job after every request was tried as a workaround.
This choked on large binaries, and they had to change to restart-every-n requests.

I took a large debug binary, Clang, and symbolized 100 random symbols from it.

Zero-size LRU:

real    4m44.289s
user    4m27.506s
sys     0m16.720s

No LRU, flush after each request:

real    4m41.313s
user    4m24.575s
sys     0m16.666s

Unbounded LRU:

real    0m4.309s
user    0m3.842s
sys     0m0.471s

HEAD:

real    0m4.505s
user    0m3.982s
sys     0m0.526s

Conversely, we can look at how much overhead the LRU order maintenance adds to 100000 very easily cacheable requests (lookup main in int main(void){return 0;}).

Unbounded LRU:

real    0m0.421s
user    0m0.305s
sys     0m0.120s

HEAD:

real    0m0.410s
user    0m0.334s
sys     0m0.079s

Looking at the zero-size LRU, I think we'd probably want cache sizes to be soft.
If a large binary is symbolized, and the binary is over the cap, then it's guaranteed to be evicted from the cache after each request, even though this doesn't lower the peak memory usage of the binary.
Instead, we can inflate the cap to the size of the largest individual binary that has been symbolized; this is the high water mark of memory usage overall.
This would let us pick a relatively low default cap, say 500MiB, without harming performance when symbolizing larger binaries.

Also, I realize that these examples are synthetic, but they should help characterize the edge case behaviors at least. I'll post back with a real example from system logs soon.

If a large binary is symbolized, and the binary is over the cap, then it's guaranteed to be evicted from the cache after each request, even though this doesn't lower the peak memory usage of the binary.

Perhaps the cache eviction could be implemented differently? Only evict if the cache size is exceeded and a new item is about to be added? (that does mean needing to know the size of the new item before it's manifest, though, which I guess is a bunch of work/more code... )

Or would it be OK to evict the last thing even as the next thing is being added to the cache - before all the derived objects are created from the new object? Would that be adequate? (so we could move eviction to the "add item to cache" piece, which would let an item larger than the cache size persist through multiple queries - since eviction wouldn't happen until a cache insertion was performed when the cache already exceeded the cache size?)

Or would it be OK to evict the last thing even as the next thing is being added to the cache - before all the derived objects are created from the new object? Would that be adequate?

I think it'd be pretty reasonable to load in the second binary before we evict the first; mapping in two large binaries isn't all that worse than one, especially since there currently isn't any bound at all.
If that's OK, we can get this effect by just always keeping the most recently used binary around. This should be much simpler to code, and it'll only slightly delay evicting the binary to the next pruneCache() call.

Or would it be OK to evict the last thing even as the next thing is being added to the cache - before all the derived objects are created from the new object? Would that be adequate?

I think it'd be pretty reasonable to load in the second binary before we evict the first; mapping in two large binaries isn't all that worse than one, especially since there currently isn't any bound at all.
If that's OK, we can get this effect by just always keeping the most recently used binary around. This should be much simpler to code, and it'll only slightly delay evicting the binary to the next pruneCache() call.

Hmm, not sure I'm following why that'd be easier to code - is the complexity related to one symbolizer query (the current cache invalidation granularity proposed in this patch, if I understand correctly) maybe needing more than one binary to be loaded, so we can't do any cache invalidation on a finer granularity because of the risk of referenced entities in the invalidated elements in the case of a symbolizer query needing more than one thing kept alive somehow?

Sorry, maybe I'm just a bit brain-fried at the moment and having trouble following something. I /think/ I'm imagining two things, perhaps the one you're proposing already - pruneCache stopping at LRUBinaries.size() == 1 instead of empty, and possibly doing pruneCache in the recordAccess call instead of having a separate call outside in llvm-symbolizer.cpp? I guess that comes back to the point you made in the review that the invalidation has to be in the client because otherwise stuff will be invalidated out from underneath the client's use case.

Yeah, maybe just changing the empty check to the <= 1 test.

>>! In D119784#3339036, @dblaikie wrote:

[...] possibly doing pruneCache in the recordAccess call instead of having a separate call outside in llvm-symbolizer.cpp? I guess that comes back to the point you made in the review that the invalidation has to be in the client because otherwise stuff will be invalidated out from underneath the client's use case.

Yeah, there is actually a safe way to do that, it's a bit hairier. You'd have pruneCache mark the cache as "safe to prune", then have the code that adds a new item to the cache run the prune if that flag were set. If anything was accessed in the mean time, it'd set the safe flag to false, since at that point a memory reference might escape out. I was thinking that setting the flag to false would be another thing you'd have to make sure to do right or else bugs, but now that you mention it, that could go in the recordAccess call.

That should be a backwards compatible change though, so I'm inclined to just whip up the size() != 1 version and come back to this later.

dblaikie accepted this revision.Feb 23 2022, 12:05 PM
>>! In D119784#3339036, @dblaikie wrote:

[...] possibly doing pruneCache in the recordAccess call instead of having a separate call outside in llvm-symbolizer.cpp? I guess that comes back to the point you made in the review that the invalidation has to be in the client because otherwise stuff will be invalidated out from underneath the client's use case.

Yeah, there is actually a safe way to do that, it's a bit hairier. You'd have pruneCache mark the cache as "safe to prune", then have the code that adds a new item to the cache run the prune if that flag were set. If anything was accessed in the mean time, it'd set the safe flag to false, since at that point a memory reference might escape out. I was thinking that setting the flag to false would be another thing you'd have to make sure to do right or else bugs, but now that you mention it, that could go in the recordAccess call.

That should be a backwards compatible change though, so I'm inclined to just whip up the size() != 1 version and come back to this later.

Fair enough - sounds good.

This revision is now accepted and ready to land.Feb 23 2022, 12:05 PM
mysterymath marked an inline comment as done.
  • Always keep the MRU binary in the cache. This keeps it from thrashing on subsequent requests.
  • Set default cache size to 512 MiB on 32-bit hosts; 4 GiB on 64-bit hosts. This is intended to be a conservative cap, as the current implementation has no limits.
  • Add a llvm-symbolizer flag to change the cache size.

Here are the results from a real production symbolization workflow, looped a few times:

LRU, Size 0, Evict all:

real    0m3.277s
user    0m3.098s
sys     0m0.183s

LRU, Size 0, Keep MRU:

real    0m0.598s
user    0m0.515s
sys     0m0.087s

LRU, 4 GiB:

real    0m0.208s
user    0m0.123s
sys     0m0.089s

HEAD:

real    0m0.182s
user    0m0.131s
sys     0m0.054s

Remove redundant if(Bin) check.

Use find over operator[]; measures very slightly faster, and it's more
consistent with the surrounding code.

This revision was landed with ongoing or failed builds.Feb 24 2022, 4:32 PM
This revision was automatically updated to reflect the committed changes.