This is an archive of the discontinued LLVM Phabricator instance.

Enable tasks dependencies hashmaps resizing
ClosedPublic

Authored by viroulep on Sep 11 2019, 7:31 AM.

Download Raw Diff

Details

Reviewers

AndreyChurbanov
jcownie
jdoerfert
ABataev
Hahnfeld
protze.joachim
grokos

Commits

rOMP372879: Enable tasks dependencies hashmaps resizing.
rGa1639b9bba7c: Enable tasks dependencies hashmaps resizing.
rL372879: Enable tasks dependencies hashmaps resizing.

Summary

This patch is a follow up to https://reviews.llvm.org/D63196.
In this paper (full text) we studied the impact of the dependencies hash table capacity on the performance of one of our applications (see section 3.3, experiments are made using the "jacobi" kernel in the KASTORS benchmark suite).
We did a breakdown of the task management related task (Figure 3) which showed that a lot of time was spent in the check dependencies part, which we narrowed down to the dependency lookup part.
Statically changing the size proved to be an appropriate quickfix, but it would be nicer to have the hashtables be resized automatically when reaching some threshold.

While simply doubling the hashtable capacity would be a quite easy implementation, it also leads to disastrous collisions statistics (see here and here for some experiments on a 24 core haswell architecture and an arm architecture).
It shows how many buckets have x elements just before resizing the hashtable (I removed the 0 elements bar for visibility, as most buckets are actually empty), which basically shows there are a whole lot of collisions.

So instead I went for an arbitrarily fixed amount of resizing, using prime numbers close to twice the previous capacity.
The buckets distributions for the same application executions (here and here) look far better as the majority of the buckets are used and collisions don't go as high as before.

The hashtable resizing is triggered when the total number of conflicts in all buckets exceeds the number of buckets.

By using this resizing mechanic we managed to observe roughly the same performance gains as in the paper, but without having to manually change the hashtable size.

Diff Detail

Repository: rL LLVM

Event Timeline

viroulep created this revision.Sep 11 2019, 7:31 AM

Herald added a reviewer: jdoerfert. · View Herald TranscriptSep 11 2019, 7:31 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: openmp-commits, kristof.beyls. · View Herald Transcript

Thierry added a subscriber: Thierry.Sep 12 2019, 12:52 AM

I like this. If there are objections, questions, concerns, please raise them now.

runtime/src/kmp_taskdeps.cpp
72 ↗	(On Diff #219714)	Do we want to give up here? I've seen people with a lot of dynamic tasks so we might want to scale somehow.

AndreyChurbanov accepted this revision.Sep 16 2019, 9:10 AM

AndreyChurbanov added inline comments.

runtime/src/kmp_taskdeps.cpp
72 ↗	(On Diff #219714)	I don't think making unlimited hash is the right way to scale application with millions of different dependences per single hash. There should be trivial workaround for users, like replace for (i = 0; i < N; i++) { #pragma omp task depend(inout: deps[i]) with for (i = 0; i < N; i++) { #pragma omp task depend(inout: deps[i%some_threshold]) This should work fine given that some_threshold is reasonably bigger than number of threads. Rather than having unlimited hash, same dependences can better be re-used without losing semantics, e.g. if previous tasks had already been executed by the team. Even if this is not realistic, e.g. because tasks are executed too slowly comparing to speed of their generation, the introduced overhead of extra dependences should be negligible comparing to maintaining unlimited hash size (my view may be questionable of cause...). I also thought of things like cleaning hash entry after the dependence is not needed any more, but his does not look simple to implement. And may impact performance because of needed extra synchronizations.

This revision is now accepted and ready to land.Sep 16 2019, 9:10 AM

Herald added a subscriber: dmgreen. · View Herald TranscriptSep 16 2019, 9:10 AM

@viroulep do you have commit access or should we land that for you?

Sorry for the delay in getting back to you there.
@protze.joachim I actually don't have commit access, it would be awesome if one of you could land this patch for me!

@AndreyChurbanov, @jdoerfert thanks for the reviews and comments!

Closed by commit rL372879: Enable tasks dependencies hashmaps resizing. (authored by achurbanov). · Explain WhySep 25 2019, 7:39 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptSep 25 2019, 7:39 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Revision Contents

Path

Size

openmp/

trunk/

runtime/

src/

kmp.h

3 lines

kmp_taskdeps.cpp

73 lines

test/

tasking/

omp_task_depend_resize_hashmap.c

38 lines

Diff 221769

openmp/trunk/runtime/src/kmp.h

Show First 20 Lines • Show All 2,175 Lines • ▼ Show 20 Lines	struct kmp_dephash_entry {
kmp_int32 last_flag;		kmp_int32 last_flag;
kmp_lock_t mtx_lock; / is referenced by depnodes w/mutexinoutset dep */		kmp_lock_t mtx_lock; / is referenced by depnodes w/mutexinoutset dep */
kmp_dephash_entry_t *next_in_bucket;		kmp_dephash_entry_t *next_in_bucket;
};		};

typedef struct kmp_dephash {		typedef struct kmp_dephash {
kmp_dephash_entry_t **buckets;		kmp_dephash_entry_t **buckets;
size_t size;		size_t size;
#ifdef KMP_DEBUG		size_t generation;
kmp_uint32 nelements;		kmp_uint32 nelements;
kmp_uint32 nconflicts;		kmp_uint32 nconflicts;
#endif
} kmp_dephash_t;		} kmp_dephash_t;

typedef struct kmp_task_affinity_info {		typedef struct kmp_task_affinity_info {
kmp_intptr_t base_addr;		kmp_intptr_t base_addr;
size_t len;		size_t len;
struct {		struct {
bool flag1 : 1;		bool flag1 : 1;
bool flag2 : 1;		bool flag2 : 1;
▲ Show 20 Lines • Show All 1,722 Lines • Show Last 20 Lines

openmp/trunk/runtime/src/kmp_taskdeps.cpp

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines

static inline kmp_depnode_t __kmp_node_ref(kmp_depnode_t node) {		static inline kmp_depnode_t __kmp_node_ref(kmp_depnode_t node) {
KMP_ATOMIC_INC(&node->dn.nrefs);		KMP_ATOMIC_INC(&node->dn.nrefs);
return node;		return node;
}		}

enum { KMP_DEPHASH_OTHER_SIZE = 97, KMP_DEPHASH_MASTER_SIZE = 997 };		enum { KMP_DEPHASH_OTHER_SIZE = 97, KMP_DEPHASH_MASTER_SIZE = 997 };

		size_t sizes[] = { 997, 2003, 4001, 8191, 16001, 32003, 64007, 131071, 270029 };
		const size_t MAX_GEN = 8;

static inline kmp_int32 __kmp_dephash_hash(kmp_intptr_t addr, size_t hsize) {		static inline kmp_int32 __kmp_dephash_hash(kmp_intptr_t addr, size_t hsize) {
// TODO alternate to try: set = (((Addr64)(addrUsefulBits * 9.618)) %		// TODO alternate to try: set = (((Addr64)(addrUsefulBits * 9.618)) %
// m_num_sets );		// m_num_sets );
return ((addr >> 6) ^ (addr >> 2)) % hsize;		return ((addr >> 6) ^ (addr >> 2)) % hsize;
}		}

		static kmp_dephash_t __kmp_dephash_extend(kmp_info_t thread,
		kmp_dephash_t *current_dephash) {
		kmp_dephash_t *h;

		size_t gen = current_dephash->generation + 1;
		if (gen >= MAX_GEN)
		return current_dephash;
		size_t new_size = sizes[gen];

		kmp_int32 size_to_allocate =
		new_size * sizeof(kmp_dephash_entry_t *) + sizeof(kmp_dephash_t);

		#if USE_FAST_MEMORY
		h = (kmp_dephash_t *)__kmp_fast_allocate(thread, size_to_allocate);
		#else
		h = (kmp_dephash_t *)__kmp_thread_malloc(thread, size_to_allocate);
		#endif

		h->size = new_size;
		h->nelements = current_dephash->nelements;
		h->buckets = (kmp_dephash_entry **)(h + 1);
		h->generation = gen;

		// insert existing elements in the new table
		for (size_t i = 0; i < current_dephash->size; i++) {
		kmp_dephash_entry_t *next;
		for (kmp_dephash_entry_t *entry = current_dephash->buckets[i]; entry; entry = next) {
		next = entry->next_in_bucket;
		// Compute the new hash using the new size, and insert the entry in
		// the new bucket.
		kmp_int32 new_bucket = __kmp_dephash_hash(entry->addr, h->size);
		if (entry->next_in_bucket) {
		h->nconflicts++;
		}
		entry->next_in_bucket = h->buckets[new_bucket];
		h->buckets[new_bucket] = entry;
		}
		}

		// Free old hash table
		#if USE_FAST_MEMORY
		__kmp_fast_free(thread, current_dephash);
		#else
		__kmp_thread_free(thread, current_dephash);
		#endif

		return h;
		}

static kmp_dephash_t __kmp_dephash_create(kmp_info_t thread,		static kmp_dephash_t __kmp_dephash_create(kmp_info_t thread,
kmp_taskdata_t *current_task) {		kmp_taskdata_t *current_task) {
kmp_dephash_t *h;		kmp_dephash_t *h;

size_t h_size;		size_t h_size;

if (current_task->td_flags.tasktype == TASK_IMPLICIT)		if (current_task->td_flags.tasktype == TASK_IMPLICIT)
h_size = KMP_DEPHASH_MASTER_SIZE;		h_size = KMP_DEPHASH_MASTER_SIZE;
else		else
h_size = KMP_DEPHASH_OTHER_SIZE;		h_size = KMP_DEPHASH_OTHER_SIZE;

kmp_int32 size =		kmp_int32 size =
h_size * sizeof(kmp_dephash_entry_t *) + sizeof(kmp_dephash_t);		h_size * sizeof(kmp_dephash_entry_t *) + sizeof(kmp_dephash_t);

#if USE_FAST_MEMORY		#if USE_FAST_MEMORY
h = (kmp_dephash_t *)__kmp_fast_allocate(thread, size);		h = (kmp_dephash_t *)__kmp_fast_allocate(thread, size);
#else		#else
h = (kmp_dephash_t *)__kmp_thread_malloc(thread, size);		h = (kmp_dephash_t *)__kmp_thread_malloc(thread, size);
#endif		#endif
h->size = h_size;		h->size = h_size;

#ifdef KMP_DEBUG		h->generation = 0;
h->nelements = 0;		h->nelements = 0;
h->nconflicts = 0;		h->nconflicts = 0;
#endif
h->buckets = (kmp_dephash_entry **)(h + 1);		h->buckets = (kmp_dephash_entry **)(h + 1);

for (size_t i = 0; i < h_size; i++)		for (size_t i = 0; i < h_size; i++)
h->buckets[i] = 0;		h->buckets[i] = 0;

return h;		return h;
}		}

#define ENTRY_LAST_INS 0		#define ENTRY_LAST_INS 0
#define ENTRY_LAST_MTXS 1		#define ENTRY_LAST_MTXS 1

static kmp_dephash_entry *		static kmp_dephash_entry *
__kmp_dephash_find(kmp_info_t thread, kmp_dephash_t h, kmp_intptr_t addr) {		__kmp_dephash_find(kmp_info_t thread, kmp_dephash_t *hash, kmp_intptr_t addr) {
		kmp_dephash_t h = hash;
		if (h->nelements != 0
		&& h->nconflicts/h->size >= 1) {
		*hash = __kmp_dephash_extend(thread, h);
		h = *hash;
		}
kmp_int32 bucket = __kmp_dephash_hash(addr, h->size);		kmp_int32 bucket = __kmp_dephash_hash(addr, h->size);

kmp_dephash_entry_t *entry;		kmp_dephash_entry_t *entry;
for (entry = h->buckets[bucket]; entry; entry = entry->next_in_bucket)		for (entry = h->buckets[bucket]; entry; entry = entry->next_in_bucket)
if (entry->addr == addr)		if (entry->addr == addr)
break;		break;

if (entry == NULL) {		if (entry == NULL) {
// create entry. This is only done by one thread so no locking required		// create entry. This is only done by one thread so no locking required
#if USE_FAST_MEMORY		#if USE_FAST_MEMORY
entry = (kmp_dephash_entry_t *)__kmp_fast_allocate(		entry = (kmp_dephash_entry_t *)__kmp_fast_allocate(
thread, sizeof(kmp_dephash_entry_t));		thread, sizeof(kmp_dephash_entry_t));
#else		#else
entry = (kmp_dephash_entry_t *)__kmp_thread_malloc(		entry = (kmp_dephash_entry_t *)__kmp_thread_malloc(
thread, sizeof(kmp_dephash_entry_t));		thread, sizeof(kmp_dephash_entry_t));
#endif		#endif
entry->addr = addr;		entry->addr = addr;
entry->last_out = NULL;		entry->last_out = NULL;
entry->last_ins = NULL;		entry->last_ins = NULL;
entry->last_mtxs = NULL;		entry->last_mtxs = NULL;
entry->last_flag = ENTRY_LAST_INS;		entry->last_flag = ENTRY_LAST_INS;
entry->mtx_lock = NULL;		entry->mtx_lock = NULL;
entry->next_in_bucket = h->buckets[bucket];		entry->next_in_bucket = h->buckets[bucket];
h->buckets[bucket] = entry;		h->buckets[bucket] = entry;
#ifdef KMP_DEBUG
h->nelements++;		h->nelements++;
if (entry->next_in_bucket)		if (entry->next_in_bucket)
h->nconflicts++;		h->nconflicts++;
#endif
}		}
return entry;		return entry;
}		}

static kmp_depnode_list_t __kmp_add_node(kmp_info_t thread,		static kmp_depnode_list_t __kmp_add_node(kmp_info_t thread,
kmp_depnode_list_t *list,		kmp_depnode_list_t *list,
kmp_depnode_t *node) {		kmp_depnode_t *node) {
kmp_depnode_list_t *new_head;		kmp_depnode_list_t *new_head;
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	if (sink->dn.task) {
}		}
KMP_RELEASE_DEPNODE(gtid, sink);		KMP_RELEASE_DEPNODE(gtid, sink);
}		}
return npredecessors;		return npredecessors;
}		}

template <bool filter>		template <bool filter>
static inline kmp_int32		static inline kmp_int32
__kmp_process_deps(kmp_int32 gtid, kmp_depnode_t node, kmp_dephash_t hash,		__kmp_process_deps(kmp_int32 gtid, kmp_depnode_t node, kmp_dephash_t *hash,
bool dep_barrier, kmp_int32 ndeps,		bool dep_barrier, kmp_int32 ndeps,
kmp_depend_info_t dep_list, kmp_task_t task) {		kmp_depend_info_t dep_list, kmp_task_t task) {
KA_TRACE(30, ("__kmp_process_deps<%d>: T#%d processing %d dependencies : "		KA_TRACE(30, ("__kmp_process_deps<%d>: T#%d processing %d dependencies : "
"dep_barrier = %d\n",		"dep_barrier = %d\n",
filter, gtid, ndeps, dep_barrier));		filter, gtid, ndeps, dep_barrier));

kmp_info_t *thread = __kmp_threads[gtid];		kmp_info_t *thread = __kmp_threads[gtid];
kmp_int32 npredecessors = 0;		kmp_int32 npredecessors = 0;
▲ Show 20 Lines • Show All 103 Lines • ▼ Show 20 Lines	__kmp_process_deps(kmp_int32 gtid, kmp_depnode_t node, kmp_dephash_t *hash,
return npredecessors;		return npredecessors;
}		}

#define NO_DEP_BARRIER (false)		#define NO_DEP_BARRIER (false)
#define DEP_BARRIER (true)		#define DEP_BARRIER (true)

// returns true if the task has any outstanding dependence		// returns true if the task has any outstanding dependence
static bool __kmp_check_deps(kmp_int32 gtid, kmp_depnode_t *node,		static bool __kmp_check_deps(kmp_int32 gtid, kmp_depnode_t *node,
kmp_task_t task, kmp_dephash_t hash,		kmp_task_t task, kmp_dephash_t *hash,
bool dep_barrier, kmp_int32 ndeps,		bool dep_barrier, kmp_int32 ndeps,
kmp_depend_info_t *dep_list,		kmp_depend_info_t *dep_list,
kmp_int32 ndeps_noalias,		kmp_int32 ndeps_noalias,
kmp_depend_info_t *noalias_dep_list) {		kmp_depend_info_t *noalias_dep_list) {
int i, n_mtxs = 0;		int i, n_mtxs = 0;
#if KMP_DEBUG		#if KMP_DEBUG
kmp_taskdata_t *taskdata = KMP_TASK_TO_TASKDATA(task);		kmp_taskdata_t *taskdata = KMP_TASK_TO_TASKDATA(task);
#endif		#endif
▲ Show 20 Lines • Show All 183 Lines • ▼ Show 20 Lines
#else		#else
kmp_depnode_t *node =		kmp_depnode_t *node =
(kmp_depnode_t *)__kmp_thread_malloc(thread, sizeof(kmp_depnode_t));		(kmp_depnode_t *)__kmp_thread_malloc(thread, sizeof(kmp_depnode_t));
#endif		#endif

__kmp_init_node(node);		__kmp_init_node(node);
new_taskdata->td_depnode = node;		new_taskdata->td_depnode = node;

if (__kmp_check_deps(gtid, node, new_task, current_task->td_dephash,		if (__kmp_check_deps(gtid, node, new_task, &current_task->td_dephash,
NO_DEP_BARRIER, ndeps, dep_list, ndeps_noalias,		NO_DEP_BARRIER, ndeps, dep_list, ndeps_noalias,
noalias_dep_list)) {		noalias_dep_list)) {
KA_TRACE(10, ("__kmpc_omp_task_with_deps(exit): T#%d task had blocking "		KA_TRACE(10, ("__kmpc_omp_task_with_deps(exit): T#%d task had blocking "
"dependencies: "		"dependencies: "
"loc=%p task=%p, return: TASK_CURRENT_NOT_QUEUED\n",		"loc=%p task=%p, return: TASK_CURRENT_NOT_QUEUED\n",
gtid, loc_ref, new_taskdata));		gtid, loc_ref, new_taskdata));
#if OMPT_SUPPORT		#if OMPT_SUPPORT
if (ompt_enabled.enabled) {		if (ompt_enabled.enabled) {
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	KA_TRACE(10, ("__kmpc_omp_wait_deps(exit): T#%d has no blocking "
"dependencies : loc=%p\n",		"dependencies : loc=%p\n",
gtid, loc_ref));		gtid, loc_ref));
return;		return;
}		}

kmp_depnode_t node = {0};		kmp_depnode_t node = {0};
__kmp_init_node(&node);		__kmp_init_node(&node);

if (!__kmp_check_deps(gtid, &node, NULL, current_task->td_dephash,		if (!__kmp_check_deps(gtid, &node, NULL, &current_task->td_dephash,
DEP_BARRIER, ndeps, dep_list, ndeps_noalias,		DEP_BARRIER, ndeps, dep_list, ndeps_noalias,
noalias_dep_list)) {		noalias_dep_list)) {
KA_TRACE(10, ("__kmpc_omp_wait_deps(exit): T#%d has no blocking "		KA_TRACE(10, ("__kmpc_omp_wait_deps(exit): T#%d has no blocking "
"dependencies : loc=%p\n",		"dependencies : loc=%p\n",
gtid, loc_ref));		gtid, loc_ref));
return;		return;
}		}

Show All 11 Lines

openmp/trunk/runtime/test/tasking/omp_task_depend_resize_hashmap.c

Property	Old Value	New Value
svn:eol-style	null	native \ No newline at end of property
svn:keywords	null	Author Date Id Rev URL \ No newline at end of property
svn:mime-type	null	text/plain \ No newline at end of property

				// RUN: %libomp-compile && env KMP_ENABLE_TASK_THROTTLING=0 %libomp-run

				#include<omp.h>
				#include<stdlib.h>
				#include<string.h>

				// The first hashtable static size is 997
				#define NUM_DEPS 4000


				int main()
				{
				int *deps = calloc(NUM_DEPS, sizeof(int));
				int i;
				int failed = 0;

				#pragma omp parallel
				#pragma omp master
				{
				for (i = 0; i < NUM_DEPS; i++) {
				#pragma omp task firstprivate(i) depend(inout: deps[i])
				{
				deps[i] = 1;
				}
				#pragma omp task firstprivate(i) depend(inout: deps[i])
				{
				deps[i] = 2;
				}
				}
				}

				for (i = 0; i < NUM_DEPS; i++) {
				if (deps[i] != 2)
				failed++;
				}

				return failed;
				}