This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/runtime/src/
-
runtime/
-
src/
-
kmp.h
-
kmp_affinity.cpp
-
kmp_runtime.cpp

Differential D101882

[OpenMP] Fix hidden helper + affinity assignment
ClosedPublic

Authored by jlpeyton on May 4 2021, 9:03 PM.

Download Raw Diff

Details

Reviewers

AndreyChurbanov
hbae
tlwilmar
jdoerfert
ronlieb

Commits

rGc765d140fe45: [OpenMP] Fix hidden helper + affinity

Summary

When KMP_AFFINITY is set, each thread's gtid value is used as an index into the place list to determine the thread's placement. With hidden helpers enabled, this gtid value is shifted down leading to unexpected shifted thread placement. This patch restores the previous behavior by adjusting the mask index to take the number of hidden helper threads into account.

Hidden helper threads are given the full initial mask and do not participate in any of the affinity mechanisms besides the initial affinity assignment.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jlpeyton created this revision.May 4 2021, 9:03 PM

Herald added subscribers: guansong, yaxunl. · View Herald TranscriptMay 4 2021, 9:03 PM

jlpeyton requested review of this revision.May 4 2021, 9:03 PM

Herald added a subscriber: sstefan1. · View Herald TranscriptMay 4 2021, 9:03 PM

Harbormaster completed remote builds in B102664: Diff 342936.May 4 2021, 9:47 PM

Do we set any affinity for the hidden helper threads, or are they free floating?

@ronlieb Does this patch fix the performance issue you experienced for the SPEC CPU benchmark?

testing results: 128 cores used.

In my production tree:
Current helpers=false num_helpers=0
Success 619.lbm_s base refspeed ratio=77.36, runtime=67.711483

Helpers=false num_helpers=8 ( more like upstream) slowdown
Success 619.lbm_s base refspeed ratio=63.64, runtime=82.301507

Helpers=false num_helpers=8 patch applied: slowdown slightly improved, but not recovered sufficiently
Success 619.lbm_s base refspeed ratio=66.47, runtime=78.807917

Helpers=false num_helpers=0 patch applied , performance recovered,
Success 619.lbm_s base refspeed ratio=77.31, runtime=67.752051

@ronlieb Thanks for testing! When I saw this patch, I thought that broken thread-binding might possibly cause the observed performance issue.

@tianshilei1992 this patch is just another case, confirming my statement in D77609:

I think, the fundamental issue of this patch is, that it broke the implicit assumption, that entries in __kmp_threads are handed out contiguously.

I wonder, whether it would improve the situation, if we move the hidden helper threads to __kmp_threads[-8:-1], i.e., below the initial thread?

Hi Joachim, welcome, happy to try again if you come up with another patch.
FI: our benchmark run depends on setting this env-var
export GOMP_CPU_AFFINITY = 0-127

hbae added a reviewer: jlpeyton.May 5 2021, 6:14 AM

hbae removed a reviewer: jlpeyton.

Ignore my last action (my mistake).

Thanks for helping with the issue! And thanks @ronlieb for testing.

In D101882#2738720, @protze.joachim wrote:

@tianshilei1992 this patch is just another case, confirming my statement in D77609:

I think, the fundamental issue of this patch is, that it broke the implicit assumption, that entries in __kmp_threads are handed out contiguously.

I wonder, whether it would improve the situation, if we move the hidden helper threads to __kmp_threads[-8:-1], i.e., below the initial thread?

Yeah, that should be the most ideal way to do that. However, it doesn't work because negative gtids have special uses, e.g. for locks. We could not potentially set those special gtids to a very "large" negative value (or very small in math) because we don't know the maximum number of helper threads. Since helper thread is not part of spec, we could arguably set a maximum number, and move those special gtids out of this range, but the drawback is, if in the future, helper threads (or similar concept) is adopted to the spec, and if the spec doesn't set a maximum number (which I think is very possible), then we need to "redesign" again. That's why I chose to use the first few slots for helper thread, but didn't expect to break the affinity configuration. :-)

I cannot reproduce the regression with this patch. On 48 core (96 hyperthreads) machine with settings

GOMP_CPU_AFFINITY=0-95
OMP_NUM_THREADS=96

I got the following results:

Current library default:
Success 619.lbm_s base refspeed ratio=67.88, runtime=77.167402
Patch applied:
Success 619.lbm_s base refspeed ratio=87.54, runtime=59.837863
Helpers=false num_helpers=0:
Success 619.lbm_s base refspeed ratio=87.28, runtime=60.015925
Helpers=false num_helpers=0 patch applied:
Success 619.lbm_s base refspeed ratio=87.88, runtime=59.602862

@ronlieb, could you please provide the error output of the library with setting

KMP_AFFINITY=verbose

for your runs with/without patch applied (slightly improved and slow ones)?

Once the outputs should be big enough, it might be better to attach them to the corresponding bug https://bugs.llvm.org/show_bug.cgi?id=49673 in order to not pollute this review with huge testing logs.

Do we set any affinity for the hidden helper threads, or are they free floating?

The hidden helpers do get their affinity set as if they were normal worker threads.

e.g., with this patch and if KMP_AFFINITY=compact (and two hardware threads per core), then
regular gtid 0 is pinned to the first core
regular gtid 9 is pinned to the first core
regular gtid 10 is pinned to the second core
regular gtid 11 is pinned to the second core
regular gtid 12 is pinned to the third core
...
hidden helper gitd 1 is pinned to the first core
hidden helper gtid 2 is pinned to the second core
hidden helper gtid 3 is pinned to the second core
hidden helper gtid 4 is pinned to the third core
...
hidden helper gtid 8 is pinned to the fifth core

Is there a consensus on if we want them free-floating or not? I assume we do, but want to make sure.

In D101882#2745448, @jlpeyton wrote:

Is there a consensus on if we want them free-floating or not? I assume we do, but want to make sure.

That seems to be a reasonable default for now.

JonChesterfield added a reviewer: ronlieb.May 10 2021, 1:25 PM

sorry i missed the request for the verbose output, is that still needed from me ?

With latest patch,

Added full mask assignment for hidden helper threads.

Took away the check for hidden helper enabled in the adjust_gtid function since the gtid needs to be adjusted regardless whether hidden helpers are enabled or not.

Removed hidden helpers from printing their thread affinity, unless it is a debug build, to help avoid confusion when using KMP_AFFINITY=verbose.

Harbormaster completed remote builds in B103596: Diff 344202.May 10 2021, 3:22 PM

In D101882#2748831, @ronlieb wrote:

sorry i missed the request for the verbose output, is that still needed from me ?

When @AndreyChurbanov did the testing this removed the slowdown completely. Assuming this is not the case for you the output would probably help.

@AndreyChurbanov, this seems to improve the upstream repo, right? If so we should probably go ahead and land this even while we figure what @ronlieb is seeing.

i am testing your new patch.
i have no objection if you land what you have.
up to you.

i can confirm 619.lbm looks better with the patch, peformance recovered,
thanks

This revision is now accepted and ready to land.May 10 2021, 4:41 PM

In D101882#2745448, @jlpeyton wrote:

Do we set any affinity for the hidden helper threads, or are they free floating?

The hidden helpers do get their affinity set as if they were normal worker threads.

e.g., with this patch and if KMP_AFFINITY=compact (and two hardware threads per core), then
regular gtid 0 is pinned to the first core
regular gtid 9 is pinned to the first core
regular gtid 10 is pinned to the second core
regular gtid 11 is pinned to the second core
regular gtid 12 is pinned to the third core
...
hidden helper gitd 1 is pinned to the first core
hidden helper gtid 2 is pinned to the second core
hidden helper gtid 3 is pinned to the second core
hidden helper gtid 4 is pinned to the third core
...
hidden helper gtid 8 is pinned to the fifth core

Is there a consensus on if we want them free-floating or not? I assume we do, but want to make sure.

I don't think, that such binding of the hidden threads is optimal. On our dual-socket system with sub-numa clustering this would bind all helper threads to the same sub-numa domain.
I'd suggest a second env var to control the binding of the helper threads (with some reasonable default).

This is nothing to block this patch, but can be implemented in a followup patch.

This revision was landed with ongoing or failed builds.May 11 2021, 6:55 AM

Closed by commit rGc765d140fe45: [OpenMP] Fix hidden helper + affinity (authored by jlpeyton). · Explain Why

This revision was automatically updated to reflect the committed changes.

jlpeyton added a commit: rGc765d140fe45: [OpenMP] Fix hidden helper + affinity.

Revision Contents

Path

Size

openmp/

runtime/

src/

kmp.h

15 lines

kmp_affinity.cpp

32 lines

kmp_runtime.cpp

3 lines

Diff 344388

openmp/runtime/src/kmp.h

	Show First 20 Lines • Show All 4,061 Lines • ▼ Show 20 Lines

	// Check whether a given thread is a hidden helper thread			// Check whether a given thread is a hidden helper thread
	#define KMP_HIDDEN_HELPER_THREAD(gtid) \			#define KMP_HIDDEN_HELPER_THREAD(gtid) \
	((gtid) >= 1 && (gtid) <= __kmp_hidden_helper_threads_num)			((gtid) >= 1 && (gtid) <= __kmp_hidden_helper_threads_num)

	#define KMP_HIDDEN_HELPER_WORKER_THREAD(gtid) \			#define KMP_HIDDEN_HELPER_WORKER_THREAD(gtid) \
	((gtid) > 1 && (gtid) <= __kmp_hidden_helper_threads_num)			((gtid) > 1 && (gtid) <= __kmp_hidden_helper_threads_num)

				#define KMP_HIDDEN_HELPER_TEAM(team) \
				(team->t.t_threads[0] == __kmp_hidden_helper_main_thread)

	// Map a gtid to a hidden helper thread. The first hidden helper thread, a.k.a			// Map a gtid to a hidden helper thread. The first hidden helper thread, a.k.a
	// main thread, is skipped.			// main thread, is skipped.
	#define KMP_GTID_TO_SHADOW_GTID(gtid) \			#define KMP_GTID_TO_SHADOW_GTID(gtid) \
	((gtid) % (__kmp_hidden_helper_threads_num - 1) + 2)			((gtid) % (__kmp_hidden_helper_threads_num - 1) + 2)

				// Return the adjusted gtid value by subtracting from gtid the number
				// of hidden helper threads. This adjusted value is the gtid the thread would
				// have received if there were no hidden helper threads.
				static inline int __kmp_adjust_gtid_for_hidden_helpers(int gtid) {
				int adjusted_gtid = gtid;
				if (__kmp_hidden_helper_threads_num > 0 && gtid > 0 &&
				gtid - __kmp_hidden_helper_threads_num >= 0) {
				adjusted_gtid -= __kmp_hidden_helper_threads_num;
				}
				return adjusted_gtid;
				}

	// Support for error directive			// Support for error directive
	typedef enum kmp_severity_t {			typedef enum kmp_severity_t {
	severity_warning = 1,			severity_warning = 1,
	severity_fatal = 2			severity_fatal = 2
	} kmp_severity_t;			} kmp_severity_t;
	extern void __kmpc_error(ident_t loc, int severity, const char message);			extern void __kmpc_error(ident_t loc, int severity, const char message);

	#ifdef __cplusplus			#ifdef __cplusplus
	▲ Show 20 Lines • Show All 213 Lines • Show Last 20 Lines

openmp/runtime/src/kmp_affinity.cpp

Show First 20 Lines • Show All 3,934 Lines • ▼ Show 20 Lines	void __kmp_affinity_set_init_mask(int gtid, int isa_root) {
// __kmp_affinity_type == affinity_none, copy the "full" mask, i.e. one that		// __kmp_affinity_type == affinity_none, copy the "full" mask, i.e. one that
// has all of the OS proc ids set, or if __kmp_affinity_respect_mask is set,		// has all of the OS proc ids set, or if __kmp_affinity_respect_mask is set,
// then the full mask is the same as the mask of the initialization thread.		// then the full mask is the same as the mask of the initialization thread.
kmp_affin_mask_t *mask;		kmp_affin_mask_t *mask;
int i;		int i;

if (KMP_AFFINITY_NON_PROC_BIND) {		if (KMP_AFFINITY_NON_PROC_BIND) {
if ((__kmp_affinity_type == affinity_none) \|\|		if ((__kmp_affinity_type == affinity_none) \|\|
(__kmp_affinity_type == affinity_balanced)) {		(__kmp_affinity_type == affinity_balanced) \|\|
		KMP_HIDDEN_HELPER_THREAD(gtid)) {
#if KMP_GROUP_AFFINITY		#if KMP_GROUP_AFFINITY
if (__kmp_num_proc_groups > 1) {		if (__kmp_num_proc_groups > 1) {
return;		return;
}		}
#endif		#endif
KMP_ASSERT(__kmp_affin_fullMask != NULL);		KMP_ASSERT(__kmp_affin_fullMask != NULL);
i = 0;		i = 0;
mask = __kmp_affin_fullMask;		mask = __kmp_affin_fullMask;
} else {		} else {
		int mask_idx = __kmp_adjust_gtid_for_hidden_helpers(gtid);
KMP_DEBUG_ASSERT(__kmp_affinity_num_masks > 0);		KMP_DEBUG_ASSERT(__kmp_affinity_num_masks > 0);
i = (gtid + __kmp_affinity_offset) % __kmp_affinity_num_masks;		i = (mask_idx + __kmp_affinity_offset) % __kmp_affinity_num_masks;
mask = KMP_CPU_INDEX(__kmp_affinity_masks, i);		mask = KMP_CPU_INDEX(__kmp_affinity_masks, i);
}		}
} else {		} else {
if ((!isa_root) \|\|		if ((!isa_root) \|\| KMP_HIDDEN_HELPER_THREAD(gtid) \|\|
(__kmp_nested_proc_bind.bind_types[0] == proc_bind_false)) {		(__kmp_nested_proc_bind.bind_types[0] == proc_bind_false)) {
#if KMP_GROUP_AFFINITY		#if KMP_GROUP_AFFINITY
if (__kmp_num_proc_groups > 1) {		if (__kmp_num_proc_groups > 1) {
return;		return;
}		}
#endif		#endif
KMP_ASSERT(__kmp_affin_fullMask != NULL);		KMP_ASSERT(__kmp_affin_fullMask != NULL);
i = KMP_PLACE_ALL;		i = KMP_PLACE_ALL;
mask = __kmp_affin_fullMask;		mask = __kmp_affin_fullMask;
} else {		} else {
// int i = some hash function or just a counter that doesn't		// int i = some hash function or just a counter that doesn't
// always start at 0. Use gtid for now.		// always start at 0. Use adjusted gtid for now.
		int mask_idx = __kmp_adjust_gtid_for_hidden_helpers(gtid);
KMP_DEBUG_ASSERT(__kmp_affinity_num_masks > 0);		KMP_DEBUG_ASSERT(__kmp_affinity_num_masks > 0);
i = (gtid + __kmp_affinity_offset) % __kmp_affinity_num_masks;		i = (mask_idx + __kmp_affinity_offset) % __kmp_affinity_num_masks;
mask = KMP_CPU_INDEX(__kmp_affinity_masks, i);		mask = KMP_CPU_INDEX(__kmp_affinity_masks, i);
}		}
}		}

th->th.th_current_place = i;		th->th.th_current_place = i;
if (isa_root) {		if (isa_root \|\| KMP_HIDDEN_HELPER_THREAD(gtid)) {
th->th.th_new_place = i;		th->th.th_new_place = i;
th->th.th_first_place = 0;		th->th.th_first_place = 0;
th->th.th_last_place = __kmp_affinity_num_masks - 1;		th->th.th_last_place = __kmp_affinity_num_masks - 1;
} else if (KMP_AFFINITY_NON_PROC_BIND) {		} else if (KMP_AFFINITY_NON_PROC_BIND) {
// When using a Non-OMP_PROC_BIND affinity method,		// When using a Non-OMP_PROC_BIND affinity method,
// set all threads' place-partition-var to the entire place list		// set all threads' place-partition-var to the entire place list
th->th.th_first_place = 0;		th->th.th_first_place = 0;
th->th.th_last_place = __kmp_affinity_num_masks - 1;		th->th.th_last_place = __kmp_affinity_num_masks - 1;
}		}

if (i == KMP_PLACE_ALL) {		if (i == KMP_PLACE_ALL) {
KA_TRACE(100, ("__kmp_affinity_set_init_mask: binding T#%d to all places\n",		KA_TRACE(100, ("__kmp_affinity_set_init_mask: binding T#%d to all places\n",
gtid));		gtid));
} else {		} else {
KA_TRACE(100, ("__kmp_affinity_set_init_mask: binding T#%d to place %d\n",		KA_TRACE(100, ("__kmp_affinity_set_init_mask: binding T#%d to place %d\n",
gtid, i));		gtid, i));
}		}

KMP_CPU_COPY(th->th.th_affin_mask, mask);		KMP_CPU_COPY(th->th.th_affin_mask, mask);

if (__kmp_affinity_verbose		if (__kmp_affinity_verbose && !KMP_HIDDEN_HELPER_THREAD(gtid)
/* to avoid duplicate printing (will be correctly printed on barrier) */		/* to avoid duplicate printing (will be correctly printed on barrier) */
&& (__kmp_affinity_type == affinity_none \|\|		&& (__kmp_affinity_type == affinity_none \|\|
(i != KMP_PLACE_ALL && __kmp_affinity_type != affinity_balanced))) {		(i != KMP_PLACE_ALL && __kmp_affinity_type != affinity_balanced))) {
char buf[KMP_AFFIN_MASK_PRINT_LEN];		char buf[KMP_AFFIN_MASK_PRINT_LEN];
__kmp_affinity_print_mask(buf, KMP_AFFIN_MASK_PRINT_LEN,		__kmp_affinity_print_mask(buf, KMP_AFFIN_MASK_PRINT_LEN,
th->th.th_affin_mask);		th->th.th_affin_mask);
KMP_INFORM(BoundToOSProcSet, "KMP_AFFINITY", (kmp_int32)getpid(),		KMP_INFORM(BoundToOSProcSet, "KMP_AFFINITY", (kmp_int32)getpid(),
__kmp_gettid(), gtid, buf);		__kmp_gettid(), gtid, buf);
}		}

		#if KMP_DEBUG
		// Hidden helper thread affinity only printed for debug builds
		if (__kmp_affinity_verbose && KMP_HIDDEN_HELPER_THREAD(gtid)) {
		char buf[KMP_AFFIN_MASK_PRINT_LEN];
		__kmp_affinity_print_mask(buf, KMP_AFFIN_MASK_PRINT_LEN,
		th->th.th_affin_mask);
		KMP_INFORM(BoundToOSProcSet, "KMP_AFFINITY (hidden helper thread)",
		(kmp_int32)getpid(), __kmp_gettid(), gtid, buf);
		}
		#endif

#if KMP_OS_WINDOWS		#if KMP_OS_WINDOWS
// On Windows* OS, the process affinity mask might have changed. If the user		// On Windows* OS, the process affinity mask might have changed. If the user
// didn't request affinity and this call fails, just continue silently.		// didn't request affinity and this call fails, just continue silently.
// See CQ171393.		// See CQ171393.
if (__kmp_affinity_type == affinity_none) {		if (__kmp_affinity_type == affinity_none) {
__kmp_set_system_affinity(th->th.th_affin_mask, FALSE);		__kmp_set_system_affinity(th->th.th_affin_mask, FALSE);
} else		} else
#endif		#endif
▲ Show 20 Lines • Show All 269 Lines • ▼ Show 20 Lines
}		}

// Dynamic affinity settings - Affinity balanced		// Dynamic affinity settings - Affinity balanced
void __kmp_balanced_affinity(kmp_info_t *th, int nthreads) {		void __kmp_balanced_affinity(kmp_info_t *th, int nthreads) {
KMP_DEBUG_ASSERT(th);		KMP_DEBUG_ASSERT(th);
bool fine_gran = true;		bool fine_gran = true;
int tid = th->th.th_info.ds.ds_tid;		int tid = th->th.th_info.ds.ds_tid;

		// Do not perform balanced affinity for the hidden helper threads
		if (KMP_HIDDEN_HELPER_THREAD(__kmp_gtid_from_thread(th)))
		return;

switch (__kmp_affinity_gran) {		switch (__kmp_affinity_gran) {
case KMP_HW_THREAD:		case KMP_HW_THREAD:
break;		break;
case KMP_HW_CORE:		case KMP_HW_CORE:
if (__kmp_nThreadsPerCore > 1) {		if (__kmp_nThreadsPerCore > 1) {
fine_gran = false;		fine_gran = false;
}		}
break;		break;
▲ Show 20 Lines • Show All 266 Lines • Show Last 20 Lines

openmp/runtime/src/kmp_runtime.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 4,582 Lines • ▼ Show 20 Lines

	#if KMP_AFFINITY_SUPPORTED			#if KMP_AFFINITY_SUPPORTED

	// __kmp_partition_places() is the heart of the OpenMP 4.0 affinity mechanism.			// __kmp_partition_places() is the heart of the OpenMP 4.0 affinity mechanism.
	// It calculates the worker + primary thread's partition based upon the parent			// It calculates the worker + primary thread's partition based upon the parent
	// thread's partition, and binds each worker to a thread in their partition.			// thread's partition, and binds each worker to a thread in their partition.
	// The primary thread's partition should already include its current binding.			// The primary thread's partition should already include its current binding.
	static void __kmp_partition_places(kmp_team_t *team, int update_master_only) {			static void __kmp_partition_places(kmp_team_t *team, int update_master_only) {
				// Do not partition places for the hidden helper team
				if (KMP_HIDDEN_HELPER_TEAM(team))
				return;
	// Copy the primary thread's place partition to the team struct			// Copy the primary thread's place partition to the team struct
	kmp_info_t *master_th = team->t.t_threads[0];			kmp_info_t *master_th = team->t.t_threads[0];
	KMP_DEBUG_ASSERT(master_th != NULL);			KMP_DEBUG_ASSERT(master_th != NULL);
	kmp_proc_bind_t proc_bind = team->t.t_proc_bind;			kmp_proc_bind_t proc_bind = team->t.t_proc_bind;
	int first_place = master_th->th.th_first_place;			int first_place = master_th->th.th_first_place;
	int last_place = master_th->th.th_last_place;			int last_place = master_th->th.th_last_place;
	int masters_place = master_th->th.th_current_place;			int masters_place = master_th->th.th_current_place;
	team->t.t_first_place = first_place;			team->t.t_first_place = first_place;
	▲ Show 20 Lines • Show All 4,101 Lines • Show Last 20 Lines