This is an archive of the discontinued LLVM Phabricator instance.

[memprof] Replace the block cache with a hashmap.
ClosedPublic

Authored by snehasish on Oct 12 2021, 1:28 PM.

Download Raw Diff

Details

Reviewers

tejohnson
vitalybuka

Commits

rG1243cef245f6: [memprof] Replace the block cache with a hashmap.

Summary

The existing implementation uses a cache + eviction based scheme to
record heap profile information. This design was adopted to ensure a
constant memory overhead (due to fixed number of cache entries) along
with incremental write-to-disk for evictions. We find that since the
number to entries to track is O(unique-allocation-contexts) the overhead
of keeping all contexts in memory is not very high. On a clang workload,
the max number of unique allocation contexts was ~35K, median ~11K.
For each context, we (currently) store 64 bytes of data - this amounts
to 5.5MB (max). Given the low overheads for a complex workload, we can
simplify the implementation by using a hashmap without eviction.

Also refactored out the MemInfoBlock struct into a separate file so that
it we can refer to it from a different module. Longer term, this
will be replaced by a .inc file shared between compiler-rt and
llvm/ProfileData.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

snehasish created this revision.Oct 12 2021, 1:28 PM

Herald added a subscriber: mgorny. · View Herald TranscriptOct 12 2021, 1:28 PM

snehasish requested review of this revision.Oct 12 2021, 1:28 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 12 2021, 1:28 PM

Herald added a subscriber: Restricted Project. · View Herald Transcript

snehasish added a parent revision: D111368: [sanitizer] Add a ForEach callback interface for AddrHashMap..Oct 12 2021, 1:31 PM

snehasish mentioned this in D111368: [sanitizer] Add a ForEach callback interface for AddrHashMap..Oct 12 2021, 1:40 PM

Cleanup some rebase issues.

Harbormaster completed remote builds in B128472: Diff 379177.Oct 12 2021, 2:47 PM

vitalybuka added inline comments.Oct 12 2021, 4:56 PM

compiler-rt/lib/memprof/memprof_allocator.cpp
258–264	You can fix race between print merge by unlocking allocator later however it does not solves merge vs merge race
293	previosly we had lock here
compiler-rt/lib/memprof/memprof_mibmap.h
6	AddrHashMap is useful if we need to delete items or address is full 64bit address space D111608 just today made alloc_context_id is very sequential [0..2^31), so you can use simple structure: ` using MIBMapTy = TwoLevelMap<MemInfoBlock, StackDepot::kNodesSize1, StackDepot::kNodesSize2> inline void InsertOrMerge(const uptr Id, const MemInfoBlock &Block, MIBMapTy &Map) { Map[id].Merge(); } Print(const MIBMapTy &Map) { for (u32 i = 0; Map.contains(i) ; ++i) Map[i].Print(); } After that you have same without ForEach implementation Still Merge vs Merge is not solved
20	Handle does not lock content of MemInfoBlock, so it's a data race on Merge

Address merge data race.

Address comments.

compiler-rt/lib/memprof/memprof_allocator.cpp
258–264	Good idea. Thanks!
compiler-rt/lib/memprof/memprof_mibmap.h
6	I'm very hesitant to rely on the sequential ordering of allocation context ids. I think it would be best for memprof to treat it as an opaque value so that the underlying implementation can evolve. For example, even memprof may store additional data which may not be indexed by a calling context. So I would prefer it if we continue with the ForEach implementation in the parent patch unless you have a strong opinion.
20	Added locking in InsertOrMerge based on the context id. PTAL.

Harbormaster completed remote builds in B128737: Diff 379549.Oct 13 2021, 4:40 PM

snehasish added inline comments.Oct 13 2021, 5:03 PM

compiler-rt/lib/memprof/memprof_allocator.cpp
258–264	Actually I think this does not solve the Merge-Print race. The InsertOrMerge call in the Deallocate hook (L633 in the original source) is not protected by a lock. We could add more synchronization to ensure that we only start "destructing" after all other threads have finished running Deallocate but I don't think that's necessary. It's ok to be a little lossy in this case. @tejohnson Wdyt?

snehasish added a reviewer: vitalybuka.Oct 13 2021, 5:04 PM

vitalybuka added inline comments.Oct 13 2021, 5:53 PM

compiler-rt/lib/memprof/memprof_mibmap.cpp
25	With such hashing Idx->2^8 very likely different cores will compete for the same mutex making this as bad as a single global mutex. Size of MemInfoBlock is quite large, Spin mutex is just 1 byte? why not put mutex inside of each MemInfoBlock? and then you can lock same mutex in Print solving all issues above.
compiler-rt/lib/memprof/memprof_mibmap.h
6	I would recommend you to keep it simple, which is TwoLevelMap. It requires trivial change to this patch, and you always can recover ForEach if needed. Very likely evolution of memprof will lead you to unexpected places where neither solution is good enough.

With such hashing Idx->2^8 very likely different cores will compete for the same mutex making this as bad as a single global mutex.

Yes, the contention here (on 8 locks) is more than what was present in the prior implementation which used a lock per cache set (default number of sets ~16K). However, I ran this implementation and compared the overheads on a large internal workload and found them to roughly be the same order of magnitude, i.e. both implementations (hashmap and prior cache-based) are 2X-3X faster than PGO.

Size of MemInfoBlock is quite large,
Spin mutex is just 1 byte? why not put mutex inside of each MemInfoBlock?
and then you can lock same mutex in Print solving all issues above.

That's a neat idea. For context: we intend to share the MemInfoBlock entry definition across compiler-rt and llvm/ProfileData using a .inc style header (details in RFC 1]). I will evaluate the performance benefit of this added 1-byte to see whether it makes sense, thanks for the suggestion!

I would recommend you to keep it simple, which is TwoLevelMap.
It requires trivial change to this patch, and you always can recover ForEach if needed. Very likely evolution of memprof will lead you to unexpected places where neither solution is good enough.

I agree that the evolution of memprof will likely outgrow the current choice. I am still in favour of the AddrHashMap extension since it allows us to be not rely on the sequential context id. I will defer to Teresa for making the decision.

[1] https://lists.llvm.org/pipermail/llvm-dev/2021-September/153007.html

In D111676#3065917, @snehasish wrote:

I would recommend you to keep it simple, which is TwoLevelMap.
It requires trivial change to this patch, and you always can recover ForEach if needed. Very likely evolution of memprof will lead you to unexpected places where neither solution is good enough.

I agree that the evolution of memprof will likely outgrow the current choice. I am still in favour of the AddrHashMap extension since it allows us to be not rely on the sequential context id. I will defer to Teresa for making the decision.

I'm also hesitant to rely on having sequential ids in this code.

compiler-rt/lib/memprof/memprof_allocator.cpp
258–264	Yes, I think we can accept some loss in case of a race here. It isn't a new issue with this patch in any case. If necessary we can add more synchronization later.
compiler-rt/lib/memprof/memprof_mibmap.h
11	I'm skeptical that 199 is large enough to avoid too much performance overhead from collisions and the resulting traversal. Why so much smaller than the default MIB Cache size?

vitalybuka added inline comments.Oct 14 2021, 10:16 PM

compiler-rt/lib/memprof/memprof_allocator.cpp
258–264	having 1 byte mutex in Memblock solves both Print/Merge and Merge/Merge issues

Address comments.

Tune the address map size based on internal workload and double checked on clang.
Guard deallocation insertions to ensure map has been initialized.
Use a pointer as value in the map instead of MIB.
Store a 1 byte mutex for each MIB to allow fine-grained locking.

Herald added a project: Restricted Project. · View Herald TranscriptNov 4 2021, 10:54 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Remove accidental change to llvm/cmake/modules/HandleLLVMOptions.cmake

PTAL, thanks!

compiler-rt/lib/memprof/memprof_mibmap.h
6	After internal discussion and evaluation on a large workload we decided to keep using the AddrHashMap implementation for now since it gives us more flexibility.

Harbormaster completed remote builds in B132505: Diff 384816.Nov 4 2021, 1:12 PM

LGTM but looking through the tests I added it looks like I was remiss in adding one that will result in a merge - could you add one with this change?

vitalybuka added inline comments.Nov 4 2021, 2:22 PM

compiler-rt/lib/memprof/memprof_meminfoblock.h
12	Can you extract a patch which moves MemInfoBlock into memprof_meminfoblock.h as much as possible unchanged?

snehasish mentioned this in D113315: [MemProf] Move the MemInfoBlock definition to a separate header..Nov 5 2021, 12:58 PM

Update diff after splitting out MemInfoBlock changes as a separate patch.

Harbormaster completed remote builds in B132765: Diff 385175.Nov 5 2021, 1:08 PM

snehasish edited parent revisions, added: D113315: [MemProf] Move the MemInfoBlock definition to a separate header.; removed: D111368: [sanitizer] Add a ForEach callback interface for AddrHashMap..Nov 5 2021, 1:08 PM

snehasish added a child revision: D113317: [memprof] Add a raw binary format to serialize memprof profiles..

vitalybuka accepted this revision.Nov 5 2021, 1:26 PM

vitalybuka added inline comments.

compiler-rt/lib/memprof/CMakeLists.txt
39	this belongs to another patch

This revision is now accepted and ready to land.Nov 5 2021, 1:26 PM

vitalybuka added 1 blocking reviewer(s): tejohnson.Nov 5 2021, 1:28 PM

This revision now requires review to proceed.Nov 5 2021, 1:28 PM

Update CMakelist and add merge test.

Harbormaster completed remote builds in B132776: Diff 385189.Nov 5 2021, 2:11 PM

Thanks for the review Vitaly. I've cleaned up the CMakelist and added a test for to check merging of mibs. @tejohnson PTAL, thanks!

lgtm

This revision is now accepted and ready to land.Nov 5 2021, 2:17 PM

tejohnson mentioned this in D113317: [memprof] Add a raw binary format to serialize memprof profiles..Nov 8 2021, 10:54 AM

Rebase.

Harbormaster completed remote builds in B133614: Diff 386347.Nov 10 2021, 5:19 PM

Closed by commit rG1243cef245f6: [memprof] Replace the block cache with a hashmap. (authored by snehasish). · Explain WhyNov 11 2021, 11:31 AM

This revision was automatically updated to reflect the committed changes.

snehasish mentioned this in rGfc7162414ede: [memprof] Move the MemInfoBlock definition to a separate header..

snehasish added a commit: rG1243cef245f6: [memprof] Replace the block cache with a hashmap..

Revision Contents

Path

Size

compiler-rt/

lib/

memprof/

CMakeLists.txt

3 lines

memprof_allocator.cpp

279 lines

memprof_flags.inc

10 lines

memprof_meminfoblock.h

116 lines

memprof_mibmap.h

24 lines

memprof_mibmap.cpp

35 lines

test/

memprof/

TestCases/

mem_info_cache_entries.cpp

print_miss_rate.cpp

llvm/

cmake/

modules/

HandleLLVMOptions.cmake

3 lines

Diff 384814

compiler-rt/lib/memprof/CMakeLists.txt

# Build for the Memory Profiler runtime support library.		# Build for the Memory Profiler runtime support library.

set(MEMPROF_SOURCES		set(MEMPROF_SOURCES
memprof_allocator.cpp		memprof_allocator.cpp
memprof_descriptions.cpp		memprof_descriptions.cpp
memprof_flags.cpp		memprof_flags.cpp
memprof_interceptors.cpp		memprof_interceptors.cpp
memprof_interceptors_memintrinsics.cpp		memprof_interceptors_memintrinsics.cpp
memprof_linux.cpp		memprof_linux.cpp
memprof_malloc_linux.cpp		memprof_malloc_linux.cpp
		memprof_mibmap.cpp
memprof_posix.cpp		memprof_posix.cpp
memprof_rtl.cpp		memprof_rtl.cpp
memprof_shadow_setup.cpp		memprof_shadow_setup.cpp
memprof_stack.cpp		memprof_stack.cpp
memprof_stats.cpp		memprof_stats.cpp
memprof_thread.cpp		memprof_thread.cpp
)		)

Show All 11 Lines	SET(MEMPROF_HEADERS
memprof_flags.h		memprof_flags.h
memprof_flags.inc		memprof_flags.inc
memprof_init_version.h		memprof_init_version.h
memprof_interceptors.h		memprof_interceptors.h
memprof_interceptors_memintrinsics.h		memprof_interceptors_memintrinsics.h
memprof_interface_internal.h		memprof_interface_internal.h
memprof_internal.h		memprof_internal.h
memprof_mapping.h		memprof_mapping.h
		memprof_meminfoblock.h
		vitalybukaUnsubmitted Done Reply Inline Actions this belongs to another patch vitalybuka: this belongs to another patch
		memprof_mibmap.h
memprof_stack.h		memprof_stack.h
memprof_stats.h		memprof_stats.h
memprof_thread.h		memprof_thread.h
)		)

include_directories(..)		include_directories(..)

set(MEMPROF_CFLAGS ${SANITIZER_COMMON_CFLAGS})		set(MEMPROF_CFLAGS ${SANITIZER_COMMON_CFLAGS})
▲ Show 20 Lines • Show All 149 Lines • Show Last 20 Lines

compiler-rt/lib/memprof/memprof_allocator.cpp

Show All 9 Lines

// Implementation of MemProf's memory allocator, which uses the allocator

// from sanitizer_common.

//===----------------------------------------------------------------------===//

#include "memprof_allocator.h"

#include "memprof_mapping.h"

#include "memprof_meminfoblock.h"

#include "memprof_mibmap.h"

#include "memprof_stack.h"

#include "memprof_thread.h"

#include "sanitizer_common/sanitizer_allocator_checks.h"

#include "sanitizer_common/sanitizer_allocator_interface.h"

#include "sanitizer_common/sanitizer_allocator_report.h"

#include "sanitizer_common/sanitizer_errno.h"

#include "sanitizer_common/sanitizer_file.h"

#include "sanitizer_common/sanitizer_flags.h"

#include "sanitizer_common/sanitizer_internal_defs.h"

#include "sanitizer_common/sanitizer_list.h"

#include "sanitizer_common/sanitizer_stackdepot.h"

#include <sched.h>

#include <stdlib.h>

#include <time.h>

namespace __memprof {

static int GetCpuId(void) {

// _memprof_preinit is called via the preinit_array, which subsequently calls

// malloc. Since this is before _dl_init calls VDSO_SETUP, sched_getcpu

// will seg fault as the address of __vdso_getcpu will be null.

▲ Show 20 Lines • Show All 121 Lines • ▼ Show 20 Lines

void MemprofMapUnmapCallback::OnUnmap(uptr p, uptr size) const {

thread_stats.munmaped += size;

}

AllocatorCache *GetAllocatorCache(MemprofThreadLocalMallocStorage *ms) {

CHECK(ms);

return &ms->allocator_cache;

}

struct MemInfoBlock {

u32 alloc_count;

u64 total_access_count, min_access_count, max_access_count;

u64 total_size;

u32 min_size, max_size;

u32 alloc_timestamp, dealloc_timestamp;

u64 total_lifetime;

u32 min_lifetime, max_lifetime;

u32 alloc_cpu_id, dealloc_cpu_id;

u32 num_migrated_cpu;

// Only compared to prior deallocated object currently.

u32 num_lifetime_overlaps;

u32 num_same_alloc_cpu;

u32 num_same_dealloc_cpu;

u64 data_type_id; // TODO: hash of type name

MemInfoBlock() : alloc_count(0) {}

MemInfoBlock(u32 size, u64 access_count, u32 alloc_timestamp,

u32 dealloc_timestamp, u32 alloc_cpu, u32 dealloc_cpu)

: alloc_count(1), total_access_count(access_count),

min_access_count(access_count), max_access_count(access_count),

total_size(size), min_size(size), max_size(size),

alloc_timestamp(alloc_timestamp), dealloc_timestamp(dealloc_timestamp),

total_lifetime(dealloc_timestamp - alloc_timestamp),

min_lifetime(total_lifetime), max_lifetime(total_lifetime),

alloc_cpu_id(alloc_cpu), dealloc_cpu_id(dealloc_cpu),

num_lifetime_overlaps(0), num_same_alloc_cpu(0),

num_same_dealloc_cpu(0) {

num_migrated_cpu = alloc_cpu_id != dealloc_cpu_id;

}

void Print(u64 id) {

u64 p;

if (flags()->print_terse) {

p = total_size * 100 / alloc_count;

Printf("MIB:%llu/%u/%llu.%02llu/%u/%u/", id, alloc_count, p / 100, p % 100,

min_size, max_size);

p = total_access_count * 100 / alloc_count;

Printf("%llu.%02llu/%llu/%llu/", p / 100, p % 100, min_access_count,

max_access_count);

p = total_lifetime * 100 / alloc_count;

Printf("%llu.%02llu/%u/%u/", p / 100, p % 100, min_lifetime, max_lifetime);

Printf("%u/%u/%u/%u\n", num_migrated_cpu, num_lifetime_overlaps,

num_same_alloc_cpu, num_same_dealloc_cpu);

} else {

p = total_size * 100 / alloc_count;

Printf("Memory allocation stack id = %llu\n", id);

Printf("\talloc_count %u, size (ave/min/max) %llu.%02llu / %u / %u\n",

alloc_count, p / 100, p % 100, min_size, max_size);

p = total_access_count * 100 / alloc_count;

Printf("\taccess_count (ave/min/max): %llu.%02llu / %llu / %llu\n", p / 100,

p % 100, min_access_count, max_access_count);

p = total_lifetime * 100 / alloc_count;

Printf("\tlifetime (ave/min/max): %llu.%02llu / %u / %u\n", p / 100, p % 100,

min_lifetime, max_lifetime);

Printf("\tnum migrated: %u, num lifetime overlaps: %u, num same alloc "

"cpu: %u, num same dealloc_cpu: %u\n",

num_migrated_cpu, num_lifetime_overlaps, num_same_alloc_cpu,

num_same_dealloc_cpu);

}

static void printHeader() {

CHECK(flags()->print_terse);

Printf("MIB:StackID/AllocCount/AveSize/MinSize/MaxSize/AveAccessCount/"

"MinAccessCount/MaxAccessCount/AveLifetime/MinLifetime/MaxLifetime/"

"NumMigratedCpu/NumLifetimeOverlaps/NumSameAllocCpu/"

"NumSameDeallocCpu\n");

}

void Merge(MemInfoBlock &newMIB) {

alloc_count += newMIB.alloc_count;

total_access_count += newMIB.total_access_count;

min_access_count = Min(min_access_count, newMIB.min_access_count);

max_access_count = Max(max_access_count, newMIB.max_access_count);

total_size += newMIB.total_size;

min_size = Min(min_size, newMIB.min_size);

max_size = Max(max_size, newMIB.max_size);

total_lifetime += newMIB.total_lifetime;

min_lifetime = Min(min_lifetime, newMIB.min_lifetime);

max_lifetime = Max(max_lifetime, newMIB.max_lifetime);

// We know newMIB was deallocated later, so just need to check if it was

// allocated before last one deallocated.

num_lifetime_overlaps += newMIB.alloc_timestamp < dealloc_timestamp;

alloc_timestamp = newMIB.alloc_timestamp;

dealloc_timestamp = newMIB.dealloc_timestamp;

num_same_alloc_cpu += alloc_cpu_id == newMIB.alloc_cpu_id;

num_same_dealloc_cpu += dealloc_cpu_id == newMIB.dealloc_cpu_id;

alloc_cpu_id = newMIB.alloc_cpu_id;

dealloc_cpu_id = newMIB.dealloc_cpu_id;

}

};

struct SetEntry {

SetEntry() : id(0), MIB() {}

bool Empty() { return id == 0; }

void Print() {

CHECK(!Empty());

MIB.Print(id);

}

// The stack id

u64 id;

MemInfoBlock MIB;

};

struct CacheSet {

enum { kSetSize = 4 };

void PrintAll() {

for (int i = 0; i < kSetSize; i++) {

if (Entries[i].Empty())

continue;

Entries[i].Print();

}

void insertOrMerge(u64 new_id, MemInfoBlock &newMIB) {

SpinMutexLock l(&SetMutex);

vitalybukaUnsubmitted

Not Done

previosly we had lock here

vitalybuka: previosly we had lock here

AccessCount++;

for (int i = 0; i < kSetSize; i++) {

auto id = Entries[i].id;

// Check if this is a hit or an empty entry. Since we always move any

// filled locations to the front of the array (see below), we don't need

// to look after finding the first empty entry.

if (id == new_id || !id) {

if (id == 0) {

Entries[i].id = new_id;

Entries[i].MIB = newMIB;

} else {

Entries[i].MIB.Merge(newMIB);

}

// Assuming some id locality, we try to swap the matching entry

// into the first set position.

if (i != 0) {

auto tmp = Entries[0];

Entries[0] = Entries[i];

Entries[i] = tmp;

}

return;

}

// Miss

MissCount++;

// We try to find the entries with the lowest alloc count to be evicted:

int min_idx = 0;

u64 min_count = Entries[0].MIB.alloc_count;

for (int i = 1; i < kSetSize; i++) {

CHECK(!Entries[i].Empty());

if (Entries[i].MIB.alloc_count < min_count) {

min_idx = i;

min_count = Entries[i].MIB.alloc_count;

}

// Print the evicted entry profile information

if (!flags()->print_terse)

Printf("Evicted:\n");

Entries[min_idx].Print();

// Similar to the hit case, put new MIB in first set position.

if (min_idx != 0)

Entries[min_idx] = Entries[0];

Entries[0].id = new_id;

Entries[0].MIB = newMIB;

}

void PrintMissRate(int i) {

u64 p = AccessCount ? MissCount * 10000ULL / AccessCount : 0;

Printf("Set %d miss rate: %d / %d = %5llu.%02llu%%\n", i, MissCount,

AccessCount, p / 100, p % 100);

}

SetEntry Entries[kSetSize];

u32 AccessCount = 0;

u32 MissCount = 0;

SpinMutex SetMutex;

};

struct MemInfoBlockCache {

MemInfoBlockCache() {

if (common_flags()->print_module_map)

DumpProcessMap();

if (flags()->print_terse)

MemInfoBlock::printHeader();

Sets =

(CacheSet *)malloc(sizeof(CacheSet) * flags()->mem_info_cache_entries);

Constructed = true;

}

~MemInfoBlockCache() { free(Sets); }

void insertOrMerge(u64 new_id, MemInfoBlock &newMIB) {

u64 hv = new_id;

// Use mod method where number of entries should be a prime close to power

// of 2.

hv %= flags()->mem_info_cache_entries;

return Sets[hv].insertOrMerge(new_id, newMIB);

}

void PrintAll() {

for (int i = 0; i < flags()->mem_info_cache_entries; i++) {

Sets[i].PrintAll();

}

void PrintMissRate() {

if (!flags()->print_mem_info_cache_miss_rate)

return;

u64 MissCountSum = 0;

u64 AccessCountSum = 0;

for (int i = 0; i < flags()->mem_info_cache_entries; i++) {

MissCountSum += Sets[i].MissCount;

AccessCountSum += Sets[i].AccessCount;

}

u64 p = AccessCountSum ? MissCountSum * 10000ULL / AccessCountSum : 0;

Printf("Overall miss rate: %llu / %llu = %5llu.%02llu%%\n", MissCountSum,

AccessCountSum, p / 100, p % 100);

if (flags()->print_mem_info_cache_miss_rate_details)

for (int i = 0; i < flags()->mem_info_cache_entries; i++)

Sets[i].PrintMissRate(i);

}

CacheSet *Sets;

// Flag when the Sets have been allocated, in case a deallocation is called

// very early before the static init of the Allocator and therefore this table

// have completed.

bool Constructed = false;

};

// Accumulates the access count from the shadow for the given pointer and size.

u64 GetShadowCount(uptr p, u32 size) {

u64 *shadow = (u64 *)MEM_TO_SHADOW(p);

u64 *shadow_end = (u64 *)MEM_TO_SHADOW(p + size);

u64 count = 0;

for (; shadow <= shadow_end; shadow++)

count += *shadow;

return count;

Show All 34 Lines

struct Allocator {

MemprofAllocator allocator;

StaticSpinMutex fallback_mutex;

AllocatorCache fallback_allocator_cache;

uptr max_user_defined_malloc_size;

atomic_uint8_t rss_limit_exceeded;

MemInfoBlockCache MemInfoBlockTable;

// Holds the mapping of stack ids to MemInfoBlocks.

MIBMapTy MIBMap;

bool destructing;

bool constructed = false;

// ------------------- Initialization ------------------------

explicit Allocator(LinkerInitialized) : destructing(false) {}

explicit Allocator(LinkerInitialized)

: destructing(false), constructed(true) {}

~Allocator() { FinishAndPrint(); }

static void PrintCallback(const uptr Key, LockedMemInfoBlock *const &Value,

void *Arg) {

SpinMutexLock(&Value->mutex);

Value->mib.Print(Key, bool(Arg));

}

void FinishAndPrint() {

if (common_flags()->print_module_map)

DumpProcessMap();

if (!flags()->print_terse)

Printf("Live on exit:\n");

allocator.ForceLock();

allocator.ForEachChunk(

[](uptr chunk, void *alloc) {

u64 user_requested_size;

Allocator *A = (Allocator *)alloc;

MemprofChunk *m =

((Allocator *)alloc)

A->GetMemprofChunk((void *)chunk, user_requested_size);

->GetMemprofChunk((void *)chunk, user_requested_size);

if (!m)

return;

uptr user_beg = ((uptr)m) + kChunkHeaderSize;

u64 c = GetShadowCount(user_beg, user_requested_size);

long curtime = GetTimestamp();

MemInfoBlock newMIB(user_requested_size, c, m->timestamp_ms, curtime,

m->cpu_id, GetCpuId());

((Allocator *)alloc)

InsertOrMerge(m->alloc_context_id, newMIB, A->MIBMap);

->MemInfoBlockTable.insertOrMerge(m->alloc_context_id, newMIB);

this);

allocator.ForceUnlock();

destructing = true;

MemInfoBlockTable.PrintMissRate();

MIBMap.ForEach(PrintCallback,

MemInfoBlockTable.PrintAll();

reinterpret_cast<void *>(flags()->print_terse));

StackDepotPrintAll();

allocator.ForceUnlock();

vitalybukaUnsubmitted

Done

this);

- allocator.ForceUnlock();

destructing = true;

MIBMap.ForEach(PrintCallback,

reinterpret_cast<void *>(flags()->print_terse));

- StackDepotPrintAll();

+ allocator.ForceUnlock(); StackDepotPrintAll();

You can fix race between print merge by unlocking allocator later

however it does not solves merge vs merge race

vitalybuka: You can fix race between print merge by unlocking allocator later however it does not solves…

snehasishAuthorUnsubmitted

Done

Good idea. Thanks!

snehasish: Good idea. Thanks!

snehasishAuthorUnsubmitted

Done

Actually I think this does not solve the Merge-Print race. The InsertOrMerge call in the Deallocate hook (L633 in the original source) is not protected by a lock. We could add more synchronization to ensure that we only start "destructing" after all other threads have finished running Deallocate but I don't think that's necessary. It's ok to be a little lossy in this case. @tejohnson Wdyt?

snehasish: Actually I think this does not solve the Merge-Print race. The InsertOrMerge call in the…

tejohnsonUnsubmitted

Not Done

Yes, I think we can accept some loss in case of a race here. It isn't a new issue with this patch in any case. If necessary we can add more synchronization later.

tejohnson: Yes, I think we can accept some loss in case of a race here. It isn't a new issue with this…

vitalybukaUnsubmitted

Done

having 1 byte mutex in Memblock solves both Print/Merge and Merge/Merge issues

vitalybuka: having 1 byte mutex in Memblock solves both Print/Merge and Merge/Merge issues

}

void InitLinkerInitialized() {

SetAllocatorMayReturnNull(common_flags()->allocator_may_return_null);

allocator.InitLinkerInitialized(

common_flags()->allocator_release_to_os_interval_ms);

max_user_defined_malloc_size = common_flags()->max_allocation_size_mb

? common_flags()->max_allocation_size_mb

▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines

void Deallocate(void *ptr, uptr delete_size, uptr delete_alignment,

MEMPROF_FREE_HOOK(ptr);

uptr chunk_beg = p - kChunkHeaderSize;

MemprofChunk *m = reinterpret_cast<MemprofChunk *>(chunk_beg);

u64 user_requested_size =

atomic_exchange(&m->user_requested_size, 0, memory_order_acquire);

if (memprof_inited && memprof_init_done && !destructing &&

if (memprof_inited && memprof_init_done && constructed && !destructing) {

MemInfoBlockTable.Constructed) {

u64 c = GetShadowCount(p, user_requested_size);

long curtime = GetTimestamp();

MemInfoBlock newMIB(user_requested_size, c, m->timestamp_ms, curtime,

m->cpu_id, GetCpuId());

MemInfoBlockTable.insertOrMerge(m->alloc_context_id, newMIB);

InsertOrMerge(m->alloc_context_id, newMIB, MIBMap);

}

MemprofStats &thread_stats = GetCurrentThreadStats();

thread_stats.frees++;

thread_stats.freed += user_requested_size;

void *alloc_beg = m->AllocBeg();

if (alloc_beg != m) {

▲ Show 20 Lines • Show All 263 Lines • Show Last 20 Lines

compiler-rt/lib/memprof/memprof_flags.inc

Show All 31 Lines	MEMPROF_FLAG(bool, halt_on_error, true,
"Crash the program after printing the first error report "		"Crash the program after printing the first error report "
"(WARNING: USE AT YOUR OWN RISK!)")		"(WARNING: USE AT YOUR OWN RISK!)")
MEMPROF_FLAG(bool, allocator_frees_and_returns_null_on_realloc_zero, true,		MEMPROF_FLAG(bool, allocator_frees_and_returns_null_on_realloc_zero, true,
"realloc(p, 0) is equivalent to free(p) by default (Same as the "		"realloc(p, 0) is equivalent to free(p) by default (Same as the "
"POSIX standard). If set to false, realloc(p, 0) will return a "		"POSIX standard). If set to false, realloc(p, 0) will return a "
"pointer to an allocated space which can not be used.")		"pointer to an allocated space which can not be used.")
MEMPROF_FLAG(bool, print_terse, false,		MEMPROF_FLAG(bool, print_terse, false,
"If set, prints memory profile in a terse format.")		"If set, prints memory profile in a terse format.")

MEMPROF_FLAG(
int, mem_info_cache_entries, 16381,
"Size in entries of the mem info block cache, should be closest prime"
" number to a power of two for best hashing.")
MEMPROF_FLAG(bool, print_mem_info_cache_miss_rate, false,
"If set, prints the miss rate of the mem info block cache.")
MEMPROF_FLAG(
bool, print_mem_info_cache_miss_rate_details, false,
"If set, prints detailed miss rates of the mem info block cache sets.")

compiler-rt/lib/memprof/memprof_meminfoblock.h

This file was added.

				#ifndef MEMPROF_MEMINFOBLOCK_H_
				#define MEMPROF_MEMINFOBLOCK_H_

				#include "memprof_interface_internal.h" // For u32, u64 TODO: Move these out of the internal header.
				#include "sanitizer_common/sanitizer_common.h"

				namespace __memprof {

				using __sanitizer::Printf;

				struct MemInfoBlock {
				u32 alloc_count;
				vitalybukaUnsubmitted Not Done Reply Inline Actions Can you extract a patch which moves MemInfoBlock into memprof_meminfoblock.h as much as possible unchanged? vitalybuka: Can you extract a patch which moves MemInfoBlock into memprof_meminfoblock.h as much as…
				u64 total_access_count, min_access_count, max_access_count;
				u64 total_size;
				u32 min_size, max_size;
				u32 alloc_timestamp, dealloc_timestamp;
				u64 total_lifetime;
				u32 min_lifetime, max_lifetime;
				u32 alloc_cpu_id, dealloc_cpu_id;
				u32 num_migrated_cpu;

				// Only compared to prior deallocated object currently.
				u32 num_lifetime_overlaps;
				u32 num_same_alloc_cpu;
				u32 num_same_dealloc_cpu;

				u64 data_type_id; // TODO: hash of type name

				MemInfoBlock() : alloc_count(0) {}

				MemInfoBlock(u32 size, u64 access_count, u32 alloc_timestamp,
				u32 dealloc_timestamp, u32 alloc_cpu, u32 dealloc_cpu)
				: alloc_count(1), total_access_count(access_count),
				min_access_count(access_count), max_access_count(access_count),
				total_size(size), min_size(size), max_size(size),
				alloc_timestamp(alloc_timestamp), dealloc_timestamp(dealloc_timestamp),
				total_lifetime(dealloc_timestamp - alloc_timestamp),
				min_lifetime(total_lifetime), max_lifetime(total_lifetime),
				alloc_cpu_id(alloc_cpu), dealloc_cpu_id(dealloc_cpu),
				num_lifetime_overlaps(0), num_same_alloc_cpu(0),
				num_same_dealloc_cpu(0) {
				num_migrated_cpu = alloc_cpu_id != dealloc_cpu_id;
				}

				void Print(u64 id, bool print_terse) const {
				u64 p;

				if (print_terse) {
				p = total_size * 100 / alloc_count;
				Printf("MIB:%llu/%u/%llu.%02llu/%u/%u/", id, alloc_count, p / 100,
				p % 100, min_size, max_size);
				p = total_access_count * 100 / alloc_count;
				Printf("%llu.%02llu/%llu/%llu/", p / 100, p % 100, min_access_count,
				max_access_count);
				p = total_lifetime * 100 / alloc_count;
				Printf("%llu.%02llu/%u/%u/", p / 100, p % 100, min_lifetime,
				max_lifetime);
				Printf("%u/%u/%u/%u\n", num_migrated_cpu, num_lifetime_overlaps,
				num_same_alloc_cpu, num_same_dealloc_cpu);
				} else {
				p = total_size * 100 / alloc_count;
				Printf("Memory allocation stack id = %llu\n", id);
				Printf("\talloc_count %u, size (ave/min/max) %llu.%02llu / %u / %u\n",
				alloc_count, p / 100, p % 100, min_size, max_size);
				p = total_access_count * 100 / alloc_count;
				Printf("\taccess_count (ave/min/max): %llu.%02llu / %llu / %llu\n",
				p / 100, p % 100, min_access_count, max_access_count);
				p = total_lifetime * 100 / alloc_count;
				Printf("\tlifetime (ave/min/max): %llu.%02llu / %u / %u\n", p / 100,
				p % 100, min_lifetime, max_lifetime);
				Printf("\tnum migrated: %u, num lifetime overlaps: %u, num same alloc "
				"cpu: %u, num same dealloc_cpu: %u\n",
				num_migrated_cpu, num_lifetime_overlaps, num_same_alloc_cpu,
				num_same_dealloc_cpu);
				}
				}

				static void printHeader() {
				Printf("MIB:StackID/AllocCount/AveSize/MinSize/MaxSize/AveAccessCount/"
				"MinAccessCount/MaxAccessCount/AveLifetime/MinLifetime/MaxLifetime/"
				"NumMigratedCpu/NumLifetimeOverlaps/NumSameAllocCpu/"
				"NumSameDeallocCpu\n");
				}

				void Merge(const MemInfoBlock &newMIB) {
				alloc_count += newMIB.alloc_count;

				total_access_count += newMIB.total_access_count;
				min_access_count = Min(min_access_count, newMIB.min_access_count);
				max_access_count = Max(max_access_count, newMIB.max_access_count);

				total_size += newMIB.total_size;
				min_size = Min(min_size, newMIB.min_size);
				max_size = Max(max_size, newMIB.max_size);

				total_lifetime += newMIB.total_lifetime;
				min_lifetime = Min(min_lifetime, newMIB.min_lifetime);
				max_lifetime = Max(max_lifetime, newMIB.max_lifetime);

				// We know newMIB was deallocated later, so just need to check if it was
				// allocated before last one deallocated.
				num_lifetime_overlaps += newMIB.alloc_timestamp < dealloc_timestamp;
				alloc_timestamp = newMIB.alloc_timestamp;
				dealloc_timestamp = newMIB.dealloc_timestamp;

				num_same_alloc_cpu += alloc_cpu_id == newMIB.alloc_cpu_id;
				num_same_dealloc_cpu += dealloc_cpu_id == newMIB.dealloc_cpu_id;
				alloc_cpu_id = newMIB.alloc_cpu_id;
				dealloc_cpu_id = newMIB.dealloc_cpu_id;
				}

				} __attribute__((packed));

				} // namespace __memprof

				#endif // MEMPROF_MEMINFOBLOCK_H_

compiler-rt/lib/memprof/memprof_mibmap.h

This file was added.

				#ifndef MEMPROF_MIBMAP_H_
				#define MEMPROF_MIBMAP_H_

				#include "memprof_meminfoblock.h"
				#include "sanitizer_common/sanitizer_addrhashmap.h"
				#include "sanitizer_common/sanitizer_mutex.h"
				vitalybukaUnsubmitted Not Done Reply Inline Actions AddrHashMap is useful if we need to delete items or address is full 64bit address space D111608 just today made alloc_context_id is very sequential [0..2^31), so you can use simple structure: ` using MIBMapTy = TwoLevelMap<MemInfoBlock, StackDepot::kNodesSize1, StackDepot::kNodesSize2> inline void InsertOrMerge(const uptr Id, const MemInfoBlock &Block, MIBMapTy &Map) { Map[id].Merge(); } Print(const MIBMapTy &Map) { for (u32 i = 0; Map.contains(i) ; ++i) Map[i].Print(); } After that you have same without ForEach implementation Still Merge vs Merge is not solved vitalybuka: AddrHashMap is useful if we need to delete items or address is full 64bit address space D111608…
				snehasishAuthorUnsubmitted Done Reply Inline Actions I'm very hesitant to rely on the sequential ordering of allocation context ids. I think it would be best for memprof to treat it as an opaque value so that the underlying implementation can evolve. For example, even memprof may store additional data which may not be indexed by a calling context. So I would prefer it if we continue with the ForEach implementation in the parent patch unless you have a strong opinion. snehasish: I'm very hesitant to rely on the sequential ordering of allocation context ids. I think it…
				vitalybukaUnsubmitted Not Done Reply Inline Actions I would recommend you to keep it simple, which is TwoLevelMap. It requires trivial change to this patch, and you always can recover ForEach if needed. Very likely evolution of memprof will lead you to unexpected places where neither solution is good enough. vitalybuka: I would recommend you to keep it simple, which is TwoLevelMap. It requires trivial change to…
				snehasishAuthorUnsubmitted Done Reply Inline Actions After internal discussion and evaluation on a large workload we decided to keep using the AddrHashMap implementation for now since it gives us more flexibility. snehasish: After internal discussion and evaluation on a large workload we decided to keep using the…

				namespace __memprof {

				struct LockedMemInfoBlock {
				__sanitizer::StaticSpinMutex mutex;
				tejohnsonUnsubmitted Done Reply Inline Actions I'm skeptical that 199 is large enough to avoid too much performance overhead from collisions and the resulting traversal. Why so much smaller than the default MIB Cache size? tejohnson: I'm skeptical that 199 is large enough to avoid too much performance overhead from collisions…
				MemInfoBlock mib;
				};

				// The MIB map stores a mapping from stack ids to MemInfoBlocks.
				typedef __sanitizer::AddrHashMap<LockedMemInfoBlock *, 200003> MIBMapTy;

				// Insert a new MemInfoBlock or merge with an existing block identified by the
				// stack id.
				void InsertOrMerge(const uptr Id, const MemInfoBlock &Block, MIBMapTy &Map);
				vitalybukaUnsubmitted Not Done Reply Inline Actions Handle does not lock content of MemInfoBlock, so it's a data race on Merge vitalybuka: Handle does not lock content of MemInfoBlock, so it's a data race on Merge
				snehasishAuthorUnsubmitted Done Reply Inline Actions Added locking in InsertOrMerge based on the context id. PTAL. snehasish: Added locking in InsertOrMerge based on the context id. PTAL.

				} // namespace __memprof

				#endif // MEMPROF_MIBMAP_H_

compiler-rt/lib/memprof/memprof_mibmap.cpp

This file was added.

				//===-- memprof_mibmap.cpp -----------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file is a part of MemProfiler, a memory profiler.
				//
				//===----------------------------------------------------------------------===//

				#include "memprof_mibmap.h"
				#include "sanitizer_common/sanitizer_allocator_internal.h"
				#include "sanitizer_common/sanitizer_mutex.h"

				namespace __memprof {

				void InsertOrMerge(const uptr Id, const MemInfoBlock &Block, MIBMapTy &Map) {
				MIBMapTy::Handle h(&Map, static_cast<uptr>(Id), /remove=/false,
				/create=/true);
				if (h.created()) {
				LockedMemInfoBlock *lmib =
				(LockedMemInfoBlock *)InternalAlloc(sizeof(LockedMemInfoBlock));
				lmib->mutex.Init();
				vitalybukaUnsubmitted Done Reply Inline Actions With such hashing Idx->2^8 very likely different cores will compete for the same mutex making this as bad as a single global mutex. Size of MemInfoBlock is quite large, Spin mutex is just 1 byte? why not put mutex inside of each MemInfoBlock? and then you can lock same mutex in Print solving all issues above. vitalybuka: With such hashing Idx->2^8 very likely different cores will compete for the same mutex making…
				lmib->mib = Block;
				*h = lmib;
				} else {
				LockedMemInfoBlock lmib = h;
				SpinMutexLock lock(&lmib->mutex);
				lmib->mib.Merge(Block);
				}
				}

				} // namespace __memprof

compiler-rt/test/memprof/TestCases/mem_info_cache_entries.cpp

This file was deleted.

	// Check mem_info_cache_entries option.

	// RUN: %clangxx_memprof -O0 %s -o %t && %env_memprof_opts=log_path=stderr:mem_info_cache_entries=15:print_mem_info_cache_miss_rate=1:print_mem_info_cache_miss_rate_details=1 %run %t 2>&1 \| FileCheck %s

	// CHECK: Set 14 miss rate: 0 / {{.*}} = 0.00%
	// CHECK-NOT: Set

	int main() {
	return 0;
	}

compiler-rt/test/memprof/TestCases/print_miss_rate.cpp

This file was deleted.

	// Check print_mem_info_cache_miss_rate and
	// print_mem_info_cache_miss_rate_details options.

	// RUN: %clangxx_memprof -O0 %s -o %t
	// RUN: %env_memprof_opts=log_path=stderr:print_mem_info_cache_miss_rate=1 %run %t 2>&1 \| FileCheck %s
	// RUN: %env_memprof_opts=log_path=stderr:print_mem_info_cache_miss_rate=1:print_mem_info_cache_miss_rate_details=1 %run %t 2>&1 \| FileCheck %s --check-prefix=DETAILS

	// CHECK: Overall miss rate: 0 / {{.*}} = 0.00%
	// DETAILS: Set 0 miss rate: 0 / {{.*}} = 0.00%
	// DETAILS: Set 16380 miss rate: 0 / {{.*}} = 0.00%

	int main() {
	return 0;
	}

llvm/cmake/modules/HandleLLVMOptions.cmake

Show First 20 Lines • Show All 222 Lines • ▼ Show 20 Lines	if(${CMAKE_SYSTEM_NAME} MATCHES "AIX")
endif()		endif()
endif()		endif()

# Pass -Wl,-z,defs. This makes sure all symbols are defined. Otherwise a DSO		# Pass -Wl,-z,defs. This makes sure all symbols are defined. Otherwise a DSO
# build might work on ELF but fail on MachO/COFF.		# build might work on ELF but fail on MachO/COFF.
if(NOT (CMAKE_SYSTEM_NAME MATCHES "Darwin\|FreeBSD\|OpenBSD\|DragonFly\|AIX\|SunOS\|OS390" OR		if(NOT (CMAKE_SYSTEM_NAME MATCHES "Darwin\|FreeBSD\|OpenBSD\|DragonFly\|AIX\|SunOS\|OS390" OR
WIN32 OR CYGWIN) AND		WIN32 OR CYGWIN) AND
NOT LLVM_USE_SANITIZER)		NOT LLVM_USE_SANITIZER)
set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-z,defs")		# Cannot use with PGHO, but PGHO is not a known sanitizer yet.
		# set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-z,defs")
endif()		endif()

# Pass -Wl,-z,nodelete. This makes sure our shared libraries are not unloaded		# Pass -Wl,-z,nodelete. This makes sure our shared libraries are not unloaded
# by dlclose(). We need that since the CLI API relies on cross-references		# by dlclose(). We need that since the CLI API relies on cross-references
# between global objects which became horribly broken when one of the libraries		# between global objects which became horribly broken when one of the libraries
# is unloaded.		# is unloaded.
if(${CMAKE_SYSTEM_NAME} MATCHES "Linux")		if(${CMAKE_SYSTEM_NAME} MATCHES "Linux")
set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-z,nodelete")		set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-z,nodelete")
▲ Show 20 Lines • Show All 982 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[memprof] Replace the block cache with a hashmap.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 384814

compiler-rt/lib/memprof/CMakeLists.txt

compiler-rt/lib/memprof/memprof_allocator.cpp

compiler-rt/lib/memprof/memprof_flags.inc

compiler-rt/lib/memprof/memprof_meminfoblock.h

compiler-rt/lib/memprof/memprof_mibmap.h

compiler-rt/lib/memprof/memprof_mibmap.cpp

compiler-rt/test/memprof/TestCases/mem_info_cache_entries.cpp

compiler-rt/test/memprof/TestCases/print_miss_rate.cpp

llvm/cmake/modules/HandleLLVMOptions.cmake

[memprof] Replace the block cache with a hashmap.
ClosedPublic