This is an archive of the discontinued LLVM Phabricator instance.

[memprof] Replace the block cache with a hashmap.
ClosedPublic

Authored by snehasish on Oct 12 2021, 1:28 PM.

Details

Summary

The existing implementation uses a cache + eviction based scheme to
record heap profile information. This design was adopted to ensure a
constant memory overhead (due to fixed number of cache entries) along
with incremental write-to-disk for evictions. We find that since the
number to entries to track is O(unique-allocation-contexts) the overhead
of keeping all contexts in memory is not very high. On a clang workload,
the max number of unique allocation contexts was ~35K, median ~11K.
For each context, we (currently) store 64 bytes of data - this amounts
to 5.5MB (max). Given the low overheads for a complex workload, we can
simplify the implementation by using a hashmap without eviction.

Also refactored out the MemInfoBlock struct into a separate file so that
it we can refer to it from a different module. Longer term, this
will be replaced by a .inc file shared between compiler-rt and
llvm/ProfileData.

Diff Detail

Event Timeline

snehasish created this revision.Oct 12 2021, 1:28 PM
snehasish requested review of this revision.Oct 12 2021, 1:28 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 12 2021, 1:28 PM
Herald added a subscriber: Restricted Project. · View Herald Transcript
snehasish updated this revision to Diff 379177.Oct 12 2021, 2:01 PM

Cleanup some rebase issues.

vitalybuka added inline comments.Oct 12 2021, 4:56 PM
compiler-rt/lib/memprof/memprof_allocator.cpp
258–264

You can fix race between print merge by unlocking allocator later

however it does not solves merge vs merge race

293

previosly we had lock here

compiler-rt/lib/memprof/memprof_mibmap.h
6

AddrHashMap is useful if we need to delete items or address is full 64bit address space
D111608 just today made alloc_context_id is very sequential [0..2^31), so you can use simple structure:

`
using MIBMapTy = TwoLevelMap<MemInfoBlock, StackDepot::kNodesSize1, StackDepot::kNodesSize2>

inline void InsertOrMerge(const uptr Id, const MemInfoBlock &Block,
                          MIBMapTy &Map) {
  Map[id].Merge();
}

Print(const MIBMapTy &Map) {
  for (u32 i = 0; Map.contains(i) ; ++i)
       Map[i].Print(); 
}

After that you have same without ForEach implementation
Still Merge vs Merge is not solved

20

Handle does not lock content of MemInfoBlock, so it's a data race on Merge

snehasish updated this revision to Diff 379547.Oct 13 2021, 4:11 PM

Address merge data race.

snehasish updated this revision to Diff 379549.Oct 13 2021, 4:19 PM
snehasish marked an inline comment as done.

Address comments.

compiler-rt/lib/memprof/memprof_allocator.cpp
258–264

Good idea. Thanks!

compiler-rt/lib/memprof/memprof_mibmap.h
6

I'm very hesitant to rely on the sequential ordering of allocation context ids. I think it would be best for memprof to treat it as an opaque value so that the underlying implementation can evolve. For example, even memprof may store additional data which may not be indexed by a calling context.

So I would prefer it if we continue with the ForEach implementation in the parent patch unless you have a strong opinion.

20

Added locking in InsertOrMerge based on the context id. PTAL.

snehasish added inline comments.Oct 13 2021, 5:03 PM
compiler-rt/lib/memprof/memprof_allocator.cpp
258–264

Actually I think this does not solve the Merge-Print race. The InsertOrMerge call in the Deallocate hook (L633 in the original source) is not protected by a lock. We could add more synchronization to ensure that we only start "destructing" after all other threads have finished running Deallocate but I don't think that's necessary. It's ok to be a little lossy in this case. @tejohnson Wdyt?

vitalybuka added inline comments.Oct 13 2021, 5:53 PM
compiler-rt/lib/memprof/memprof_mibmap.cpp
25

With such hashing Idx->2^8 very likely different cores will compete for the same mutex making this as bad as a single global mutex.

Size of MemInfoBlock is quite large,
Spin mutex is just 1 byte? why not put mutex inside of each MemInfoBlock?

and then you can lock same mutex in Print solving all issues above.

compiler-rt/lib/memprof/memprof_mibmap.h
6

I would recommend you to keep it simple, which is TwoLevelMap.
It requires trivial change to this patch, and you always can recover ForEach if needed. Very likely evolution of memprof will lead you to unexpected places where neither solution is good enough.

With such hashing Idx->2^8 very likely different cores will compete for the same mutex making this as bad as a single global mutex.

Yes, the contention here (on 8 locks) is more than what was present in the prior implementation which used a lock per cache set (default number of sets ~16K). However, I ran this implementation and compared the overheads on a large internal workload and found them to roughly be the same order of magnitude, i.e. both implementations (hashmap and prior cache-based) are 2X-3X faster than PGO.

Size of MemInfoBlock is quite large,
Spin mutex is just 1 byte? why not put mutex inside of each MemInfoBlock?
and then you can lock same mutex in Print solving all issues above.

That's a neat idea. For context: we intend to share the MemInfoBlock entry definition across compiler-rt and llvm/ProfileData using a .inc style header (details in RFC 1]). I will evaluate the performance benefit of this added 1-byte to see whether it makes sense, thanks for the suggestion!

I would recommend you to keep it simple, which is TwoLevelMap.
It requires trivial change to this patch, and you always can recover ForEach if needed. Very likely evolution of memprof will lead you to unexpected places where neither solution is good enough.

I agree that the evolution of memprof will likely outgrow the current choice. I am still in favour of the AddrHashMap extension since it allows us to be not rely on the sequential context id. I will defer to Teresa for making the decision.

[1] https://lists.llvm.org/pipermail/llvm-dev/2021-September/153007.html

I would recommend you to keep it simple, which is TwoLevelMap.
It requires trivial change to this patch, and you always can recover ForEach if needed. Very likely evolution of memprof will lead you to unexpected places where neither solution is good enough.

I agree that the evolution of memprof will likely outgrow the current choice. I am still in favour of the AddrHashMap extension since it allows us to be not rely on the sequential context id. I will defer to Teresa for making the decision.

I'm also hesitant to rely on having sequential ids in this code.

compiler-rt/lib/memprof/memprof_allocator.cpp
258–264

Yes, I think we can accept some loss in case of a race here. It isn't a new issue with this patch in any case. If necessary we can add more synchronization later.

compiler-rt/lib/memprof/memprof_mibmap.h
11

I'm skeptical that 199 is large enough to avoid too much performance overhead from collisions and the resulting traversal. Why so much smaller than the default MIB Cache size?

vitalybuka added inline comments.Oct 14 2021, 10:16 PM
compiler-rt/lib/memprof/memprof_allocator.cpp
258–264

having 1 byte mutex in Memblock solves both Print/Merge and Merge/Merge issues

snehasish updated this revision to Diff 384814.Nov 4 2021, 10:54 AM

Address comments.

  • Tune the address map size based on internal workload and double checked on clang.
  • Guard deallocation insertions to ensure map has been initialized.
  • Use a pointer as value in the map instead of MIB.
  • Store a 1 byte mutex for each MIB to allow fine-grained locking.
Herald added a project: Restricted Project. · View Herald TranscriptNov 4 2021, 10:54 AM
snehasish updated this revision to Diff 384816.Nov 4 2021, 10:59 AM
snehasish marked 3 inline comments as done.

Remove accidental change to llvm/cmake/modules/HandleLLVMOptions.cmake

PTAL, thanks!

compiler-rt/lib/memprof/memprof_mibmap.h
6

After internal discussion and evaluation on a large workload we decided to keep using the AddrHashMap implementation for now since it gives us more flexibility.

LGTM but looking through the tests I added it looks like I was remiss in adding one that will result in a merge - could you add one with this change?

vitalybuka added inline comments.Nov 4 2021, 2:22 PM
compiler-rt/lib/memprof/memprof_meminfoblock.h
12

Can you extract a patch which moves MemInfoBlock into memprof_meminfoblock.h as much as possible unchanged?

snehasish updated this revision to Diff 385175.Nov 5 2021, 1:08 PM

Update diff after splitting out MemInfoBlock changes as a separate patch.

vitalybuka accepted this revision.Nov 5 2021, 1:26 PM
vitalybuka added inline comments.
compiler-rt/lib/memprof/CMakeLists.txt
39

this belongs to another patch

This revision is now accepted and ready to land.Nov 5 2021, 1:26 PM
vitalybuka added 1 blocking reviewer(s): tejohnson.Nov 5 2021, 1:28 PM
This revision now requires review to proceed.Nov 5 2021, 1:28 PM
snehasish updated this revision to Diff 385189.Nov 5 2021, 2:10 PM

Update CMakelist and add merge test.

snehasish marked an inline comment as done.Nov 5 2021, 2:17 PM

Thanks for the review Vitaly. I've cleaned up the CMakelist and added a test for to check merging of mibs. @tejohnson PTAL, thanks!

This revision is now accepted and ready to land.Nov 5 2021, 2:17 PM
compiler-rt/lib/memprof/memprof_mibmap.cpp