Page MenuHomePhabricator

[memprof] Replace the block cache with a hashmap.
Needs ReviewPublic

Authored by snehasish on Tue, Oct 12, 1:28 PM.

Details

Summary

The existing implementation uses a cache + eviction based scheme to
record heap profile information. This design was adopted to ensure a
constant memory overhead (due to fixed number of cache entries) along
with incremental write-to-disk for evictions. We find that since the
number to entries to track is O(unique-allocation-contexts) the overhead
of keeping all contexts in memory is not very high. On a clang workload,
the max number of unique allocation contexts was ~35K, median ~11K.
For each context, we (currently) store 64 bytes of data - this amounts
to 5.5MB (max). Given the low overheads for a complex workload, we can
simplify the implementation by using a hashmap without eviction.

Also refactored out the MemInfoBlock struct into a separate file so that
it we can refer to it from a different module. Longer term, this
will be replaced by a .inc file shared between compiler-rt and
llvm/ProfileData.

Diff Detail

Event Timeline

snehasish created this revision.Tue, Oct 12, 1:28 PM
snehasish requested review of this revision.Tue, Oct 12, 1:28 PM
Herald added a project: Restricted Project. · View Herald TranscriptTue, Oct 12, 1:28 PM
Herald added a subscriber: Restricted Project. · View Herald Transcript
snehasish updated this revision to Diff 379177.Tue, Oct 12, 2:01 PM

Cleanup some rebase issues.

vitalybuka added inline comments.Tue, Oct 12, 4:56 PM
compiler-rt/lib/memprof/memprof_allocator.cpp
256–262

You can fix race between print merge by unlocking allocator later

however it does not solves merge vs merge race

293

previosly we had lock here

compiler-rt/lib/memprof/memprof_mibmap.h
6

AddrHashMap is useful if we need to delete items or address is full 64bit address space
D111608 just today made alloc_context_id is very sequential [0..2^31), so you can use simple structure:

`
using MIBMapTy = TwoLevelMap<MemInfoBlock, StackDepot::kNodesSize1, StackDepot::kNodesSize2>

inline void InsertOrMerge(const uptr Id, const MemInfoBlock &Block,
                          MIBMapTy &Map) {
  Map[id].Merge();
}

Print(const MIBMapTy &Map) {
  for (u32 i = 0; Map.contains(i) ; ++i)
       Map[i].Print(); 
}

After that you have same without ForEach implementation
Still Merge vs Merge is not solved

20

Handle does not lock content of MemInfoBlock, so it's a data race on Merge

snehasish updated this revision to Diff 379547.Wed, Oct 13, 4:11 PM

Address merge data race.

snehasish updated this revision to Diff 379549.Wed, Oct 13, 4:19 PM
snehasish marked an inline comment as done.

Address comments.

compiler-rt/lib/memprof/memprof_allocator.cpp
256–262

Good idea. Thanks!

compiler-rt/lib/memprof/memprof_mibmap.h
6

I'm very hesitant to rely on the sequential ordering of allocation context ids. I think it would be best for memprof to treat it as an opaque value so that the underlying implementation can evolve. For example, even memprof may store additional data which may not be indexed by a calling context.

So I would prefer it if we continue with the ForEach implementation in the parent patch unless you have a strong opinion.

20

Added locking in InsertOrMerge based on the context id. PTAL.

snehasish added inline comments.Wed, Oct 13, 5:03 PM
compiler-rt/lib/memprof/memprof_allocator.cpp
256–262

Actually I think this does not solve the Merge-Print race. The InsertOrMerge call in the Deallocate hook (L633 in the original source) is not protected by a lock. We could add more synchronization to ensure that we only start "destructing" after all other threads have finished running Deallocate but I don't think that's necessary. It's ok to be a little lossy in this case. @tejohnson Wdyt?

vitalybuka added inline comments.Wed, Oct 13, 5:53 PM
compiler-rt/lib/memprof/memprof_mibmap.cpp
24

With such hashing Idx->2^8 very likely different cores will compete for the same mutex making this as bad as a single global mutex.

Size of MemInfoBlock is quite large,
Spin mutex is just 1 byte? why not put mutex inside of each MemInfoBlock?

and then you can lock same mutex in Print solving all issues above.

compiler-rt/lib/memprof/memprof_mibmap.h
6

I would recommend you to keep it simple, which is TwoLevelMap.
It requires trivial change to this patch, and you always can recover ForEach if needed. Very likely evolution of memprof will lead you to unexpected places where neither solution is good enough.

With such hashing Idx->2^8 very likely different cores will compete for the same mutex making this as bad as a single global mutex.

Yes, the contention here (on 8 locks) is more than what was present in the prior implementation which used a lock per cache set (default number of sets ~16K). However, I ran this implementation and compared the overheads on a large internal workload and found them to roughly be the same order of magnitude, i.e. both implementations (hashmap and prior cache-based) are 2X-3X faster than PGO.

Size of MemInfoBlock is quite large,
Spin mutex is just 1 byte? why not put mutex inside of each MemInfoBlock?
and then you can lock same mutex in Print solving all issues above.

That's a neat idea. For context: we intend to share the MemInfoBlock entry definition across compiler-rt and llvm/ProfileData using a .inc style header (details in RFC 1]). I will evaluate the performance benefit of this added 1-byte to see whether it makes sense, thanks for the suggestion!

I would recommend you to keep it simple, which is TwoLevelMap.
It requires trivial change to this patch, and you always can recover ForEach if needed. Very likely evolution of memprof will lead you to unexpected places where neither solution is good enough.

I agree that the evolution of memprof will likely outgrow the current choice. I am still in favour of the AddrHashMap extension since it allows us to be not rely on the sequential context id. I will defer to Teresa for making the decision.

[1] https://lists.llvm.org/pipermail/llvm-dev/2021-September/153007.html

I would recommend you to keep it simple, which is TwoLevelMap.
It requires trivial change to this patch, and you always can recover ForEach if needed. Very likely evolution of memprof will lead you to unexpected places where neither solution is good enough.

I agree that the evolution of memprof will likely outgrow the current choice. I am still in favour of the AddrHashMap extension since it allows us to be not rely on the sequential context id. I will defer to Teresa for making the decision.

I'm also hesitant to rely on having sequential ids in this code.

compiler-rt/lib/memprof/memprof_allocator.cpp
256–262

Yes, I think we can accept some loss in case of a race here. It isn't a new issue with this patch in any case. If necessary we can add more synchronization later.

compiler-rt/lib/memprof/memprof_mibmap.h
10

I'm skeptical that 199 is large enough to avoid too much performance overhead from collisions and the resulting traversal. Why so much smaller than the default MIB Cache size?

vitalybuka added inline comments.Thu, Oct 14, 10:16 PM
compiler-rt/lib/memprof/memprof_allocator.cpp
256–262

having 1 byte mutex in Memblock solves both Print/Merge and Merge/Merge issues