This is an archive of the discontinued LLVM Phabricator instance.

Make balanced affinity work on AArch64 (and possibly other architectures too)
ClosedPublic

Authored by pawosm01 on Jul 14 2016, 9:34 AM.

Download Raw Diff

Details

Reviewers

jlpeyton
AndreyChurbanov

Commits

rGecbe2ea002cd: Make balanced affinity work on AArch64.
rOMP277212: Make balanced affinity work on AArch64.
rL277212: Make balanced affinity work on AArch64.

Summary

This patch enables balanced affinity on machines that do not have
hardware threads and/or do have cores clustered into packages. In facts,
balacing algorithm could be generalized for any arrangement with
at least two levels of hierarchy (depth > 1).

Diff Detail

Repository: rL LLVM

Event Timeline

pawosm01 updated this revision to Diff 63992.Jul 14 2016, 9:34 AM

pawosm01 retitled this revision from to Make balanced affinity work on AArch64 (and possibly other architectures too).

pawosm01 updated this object.

pawosm01 added reviewers: jlpeyton, AndreyChurbanov.

pawosm01 set the repository for this revision to rL LLVM.

pawosm01 added a subscriber: openmp-commits.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptJul 14 2016, 9:34 AM

Can you re-upload the patch with all context?

pawosm01 updated this revision to Diff 63996.Jul 14 2016, 9:47 AM

pawosm01 edited edge metadata.

Hello?

The patch as it is does not work for the case of multi package non-uniform topology.

So this case should be either fixed or disabled.

runtime/src/kmp_affinity.cpp
3994	This is number of cores in the last package, as the code supposed to be run on single package topology only.
3995	This code does not take into account number of packages (supposed to be run on single-package topology).
4017	Here the memory gets overwritten when the last package has lesser number of cores than max available, because (core*nth_per_core+thread) can be bigger than nproc so that the write is out of array bound.
4655	This is the number of cores in the last package. Other packages are skipped (no threads bound there).

This revision now requires changes to proceed.Jul 21 2016, 9:23 AM

Seems like we wrote our comments at the same time. I'll look at the issues raised.

There are ARMv8 boards where cores of one SoC are clustered into two packages: one for LITTLE cores and one for big. With original limitations this affinity method refused to work (due to nPackages > 1) which is bit discriminatory.

Note that I tested both uniform (equal number of cores per clusters) and non-uniform (more LITTLE cores than big) arrangements.

What I couldn't test (but I may gain access to such x86_64 machine in the future) is multi-package hyper-threaded machine (more than one CPU on board, each with more than one core of which each has more than one hardware thread) - I need to check again what effect my chagnes have on such arrangement (e.g. possible overwrites in procarr array).

runtime/src/kmp_affinity.cpp
3994	The whole point of change in line 3967 is to lift the limitation that the code is supposed to be run on single package topology only.
3995	In case of nPackages > 1 and nCoresPerPkg, nth_per_core equals to nCoresPerPkg. I did not change the algorithm, I didn't change names of the variables used either.
4655	Do you refer to the effect this has on line 4679? This code crashes when __kmp_ncores is used there instead.

Maybe I was not clear enough. The code works fine for multi package non-uniform if no hyperthreads available, but works incorrectly for multi package non-uniform with hyperthreads. So the simplest fix would be disable balanced affinity for this case.

I haven't finished debugging no-HT case yet, I expect there should not be any problems, because the output of verbose affinity looks fine there.
Just tried to flag the found problem early...

Still, I need to fix it as I plan to have machine arranged like that.

AndreyChurbanov added inline comments.Jul 21 2016, 12:31 PM

runtime/src/kmp_affinity.cpp
4655	My comment was probably irrelevant here. The problem was incorrect filling of the procarr array that only contained info on the last package. So if this array is in use (I mostly tested case on line 4710), then all threads were bound to this last package. If you fix memory overwriting at initialization, then anyway ncores will need to be multiplied by number of packages here or extra loop over packages needs to be added, I think. Otherwise l don't see how all machine can be covered, given that loops for( int i = 0; i < ncores; i++ ) only walk through subset of all cores now. BTW, I finished with debugging of no-HT case. It works fine because you effectively reduced this case to single package via shifting topology one level down and making the balanced affinity to see packages as cores and cores as threads. This trick does not work for HT enabled topology.

I prepared and made use of three helper functions to make dealing with any case of non-uniform arrangement easier. I did not touch uniform topology code as it does not seem to be affected by existence of package layer: for HT case there are always the two deepest levels in play anyway. I wonder if we need this uniform - non-uniform distinction at all, uniform topology is just a special case of non-uniform topology and my new approach is capable to deal with both.

Actually, these are four helper functions, I didn't notice the smallest which only calls the other one.

LGTM

The code may still degrade to affinity none in some exotic cases of non-uniform topology (when address2os array has info on cores with hyperthreads numbered not from 0), but the value of code is much bigger than the need to support balanced affinity for such cases.

This revision is now accepted and ready to land.Jul 29 2016, 10:32 AM

Closed by commit rL277212: Make balanced affinity work on AArch64. (authored by pawosm01). · Explain WhyJul 29 2016, 2:02 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

runtime/

src/

kmp_affinity.cpp

62 lines

Diff 63996

runtime/src/kmp_affinity.cpp

Show First 20 Lines • Show All 3,958 Lines • ▼ Show 20 Lines	# endif
case affinity_compact:		case affinity_compact:
if (__kmp_affinity_compact >= depth) {		if (__kmp_affinity_compact >= depth) {
__kmp_affinity_compact = depth - 1;		__kmp_affinity_compact = depth - 1;
}		}
goto sortAddresses;		goto sortAddresses;

case affinity_balanced:		case affinity_balanced:
// Balanced works only for the case of a single package		// Balanced works only for the case of a single package
if( nPackages > 1 ) {		if( depth <= 1 ) {
if( __kmp_affinity_verbose \|\| __kmp_affinity_warnings ) {		if( __kmp_affinity_verbose \|\| __kmp_affinity_warnings ) {
KMP_WARNING( AffBalancedNotAvail, "KMP_AFFINITY" );		KMP_WARNING( AffBalancedNotAvail, "KMP_AFFINITY" );
}		}
__kmp_affinity_type = affinity_none;		__kmp_affinity_type = affinity_none;
return;		return;
} else if( __kmp_affinity_uniform_topology() ) {		} else if( __kmp_affinity_uniform_topology() ) {
break;		break;
} else { // Non-uniform topology		} else { // Non-uniform topology

// Save the depth for further usage		// Save the depth for further usage
__kmp_aff_depth = depth;		__kmp_aff_depth = depth;

// Number of hyper threads per core in HT machine		// Number of hyper threads per core in HT machine
int nth_per_core = __kmp_nThreadsPerCore;		int nth_per_core = __kmp_nThreadsPerCore;

int core_level;		int core_level;
if( nth_per_core > 1 ) {		if( nth_per_core > 1 ) {
core_level = depth - 2;		core_level = depth - 2;
} else {		} else {
		if(( nPackages > 1) && ( nCoresPerPkg > 1)) {
		core_level = depth - 2;
		nth_per_core = nCoresPerPkg;
		} else {
core_level = depth - 1;		core_level = depth - 1;
}		}
		}
int ncores = address2os[ __kmp_avail_proc - 1 ].first.labels[ core_level ] + 1;		int ncores = address2os[ __kmp_avail_proc - 1 ].first.labels[ core_level ] + 1;
		AndreyChurbanovUnsubmitted Not Done Reply Inline Actions This is number of cores in the last package, as the code supposed to be run on single package topology only. AndreyChurbanov: This is number of cores in the last package, as the code supposed to be run on single package…
		pawosm01AuthorUnsubmitted Not Done Reply Inline Actions The whole point of change in line 3967 is to lift the limitation that the code is supposed to be run on single package topology only. pawosm01: The whole point of change in line 3967 is to lift the limitation that the code is supposed to…
int nproc = nth_per_core * ncores;		int nproc = nth_per_core * ncores;
		AndreyChurbanovUnsubmitted Not Done Reply Inline Actions This code does not take into account number of packages (supposed to be run on single-package topology). AndreyChurbanov: This code does not take into account number of packages (supposed to be run on single-package…
		pawosm01AuthorUnsubmitted Not Done Reply Inline Actions In case of nPackages > 1 and nCoresPerPkg, nth_per_core equals to nCoresPerPkg. I did not change the algorithm, I didn't change names of the variables used either. pawosm01: In case of nPackages > 1 and nCoresPerPkg, nth_per_core equals to nCoresPerPkg. I did not…

procarr = ( int * )__kmp_allocate( sizeof( int ) * nproc );		procarr = ( int * )__kmp_allocate( sizeof( int ) * nproc );
for( int i = 0; i < nproc; i++ ) {		for( int i = 0; i < nproc; i++ ) {
procarr[ i ] = -1;		procarr[ i ] = -1;
}		}

for( int i = 0; i < __kmp_avail_proc; i++ ) {		for( int i = 0; i < __kmp_avail_proc; i++ ) {
int proc = address2os[ i ].second;		int proc = address2os[ i ].second;
// If depth == 3 then level=0 - package, level=1 - core, level=2 - thread.		// If depth == 3 then level=0 - package, level=1 - core, level=2 - thread.
// If there is only one thread per core then depth == 2: level 0 - package,		// If there is only one thread per core then depth == 2: level 0 - package,
// level 1 - core.		// level 1 - core.
int level = depth - 1;		int level = depth - 1;

// __kmp_nth_per_core == 1		// __kmp_nth_per_core == 1
int thread = 0;		int thread = 0;
int core = address2os[ i ].first.labels[ level ];		int core = address2os[ i ].first.labels[ level ];
// If the thread level exists, that is we have more than one thread context per core		// If the thread level exists, that is we have more than one thread context per core
if( nth_per_core > 1 ) {		if( nth_per_core > 1 ) {
thread = address2os[ i ].first.labels[ level ] % nth_per_core;		thread = address2os[ i ].first.labels[ level ] % nth_per_core;
core = address2os[ i ].first.labels[ level - 1 ];		core = address2os[ i ].first.labels[ level - 1 ];
}		}
procarr[ core * nth_per_core + thread ] = proc;		procarr[ core * nth_per_core + thread ] = proc;
		AndreyChurbanovUnsubmitted Not Done Reply Inline Actions Here the memory gets overwritten when the last package has lesser number of cores than max available, because (corenth_per_core+thread) can be bigger than nproc so that the write is out of array bound. AndreyChurbanov:* Here the memory gets overwritten when the last package has lesser number of cores than max…
}		}

break;		break;
}		}

sortAddresses:		sortAddresses:
//		//
// Allocate the gtid->affinity mask table.		// Allocate the gtid->affinity mask table.
▲ Show 20 Lines • Show All 530 Lines • ▼ Show 20 Lines	# endif

return KMP_CPU_ISSET(proc, (kmp_affin_mask_t )(mask));		return KMP_CPU_ISSET(proc, (kmp_affin_mask_t )(mask));
}		}


// Dynamic affinity settings - Affinity balanced		// Dynamic affinity settings - Affinity balanced
void __kmp_balanced_affinity( int tid, int nthreads )		void __kmp_balanced_affinity( int tid, int nthreads )
{		{
		bool fine_gran = true;

		switch (__kmp_affinity_gran) {
		case affinity_gran_fine:
		case affinity_gran_thread:
		break;
		case affinity_gran_core:
		if( __kmp_nThreadsPerCore > 1) {
		fine_gran = false;
		}
		break;
		case affinity_gran_package:
		if( nCoresPerPkg > 1) {
		fine_gran = false;
		}
		break;
		default:
		fine_gran = false;
		}

if( __kmp_affinity_uniform_topology() ) {		if( __kmp_affinity_uniform_topology() ) {
int coreID;		int coreID;
int threadID;		int threadID;
// Number of hyper threads per core in HT machine		// Number of hyper threads per core in HT machine
int __kmp_nth_per_core = __kmp_avail_proc / __kmp_ncores;		int __kmp_nth_per_core = __kmp_avail_proc / __kmp_ncores;
// Number of cores		// Number of cores
int ncores = __kmp_ncores;		int ncores = __kmp_ncores;
		if(( nPackages > 1) && ( __kmp_nth_per_core <= 1)) {
		__kmp_nth_per_core = __kmp_avail_proc / nPackages;
		ncores = nPackages;
		}
// How many threads will be bound to each core		// How many threads will be bound to each core
int chunk = nthreads / ncores;		int chunk = nthreads / ncores;
// How many cores will have an additional thread bound to it - "big cores"		// How many cores will have an additional thread bound to it - "big cores"
int big_cores = nthreads % ncores;		int big_cores = nthreads % ncores;
// Number of threads on the big cores		// Number of threads on the big cores
int big_nth = ( chunk + 1 ) * big_cores;		int big_nth = ( chunk + 1 ) * big_cores;
if( tid < big_nth ) {		if( tid < big_nth ) {
coreID = tid / (chunk + 1 );		coreID = tid / (chunk + 1 );
threadID = ( tid % (chunk + 1 ) ) % __kmp_nth_per_core ;		threadID = ( tid % (chunk + 1 ) ) % __kmp_nth_per_core ;
} else { //tid >= big_nth		} else { //tid >= big_nth
coreID = ( tid - big_cores ) / chunk;		coreID = ( tid - big_cores ) / chunk;
threadID = ( ( tid - big_cores ) % chunk ) % __kmp_nth_per_core ;		threadID = ( ( tid - big_cores ) % chunk ) % __kmp_nth_per_core ;
}		}

KMP_DEBUG_ASSERT2(KMP_AFFINITY_CAPABLE(),		KMP_DEBUG_ASSERT2(KMP_AFFINITY_CAPABLE(),
"Illegal set affinity operation when not capable");		"Illegal set affinity operation when not capable");

kmp_affin_mask_t *mask;		kmp_affin_mask_t *mask;
KMP_CPU_ALLOC_ON_STACK(mask);		KMP_CPU_ALLOC_ON_STACK(mask);
KMP_CPU_ZERO(mask);		KMP_CPU_ZERO(mask);

// Granularity == thread		if( fine_gran) {
if( __kmp_affinity_gran == affinity_gran_fine \|\| __kmp_affinity_gran == affinity_gran_thread) {
int osID = address2os[ coreID * __kmp_nth_per_core + threadID ].second;		int osID = address2os[ coreID * __kmp_nth_per_core + threadID ].second;
KMP_CPU_SET( osID, mask);		KMP_CPU_SET( osID, mask);
} else if( __kmp_affinity_gran == affinity_gran_core ) { // Granularity == core		} else {
for( int i = 0; i < __kmp_nth_per_core; i++ ) {		for( int i = 0; i < __kmp_nth_per_core; i++ ) {
int osID;		int osID;
osID = address2os[ coreID * __kmp_nth_per_core + i ].second;		osID = address2os[ coreID * __kmp_nth_per_core + i ].second;
KMP_CPU_SET( osID, mask);		KMP_CPU_SET( osID, mask);
}		}
}		}
if (__kmp_affinity_verbose) {		if (__kmp_affinity_verbose) {
char buf[KMP_AFFIN_MASK_PRINT_LEN];		char buf[KMP_AFFIN_MASK_PRINT_LEN];
Show All 10 Lines	if( __kmp_affinity_uniform_topology() ) {
KMP_CPU_ZERO(mask);		KMP_CPU_ZERO(mask);

// Number of hyper threads per core in HT machine		// Number of hyper threads per core in HT machine
int nth_per_core = __kmp_nThreadsPerCore;		int nth_per_core = __kmp_nThreadsPerCore;
int core_level;		int core_level;
if( nth_per_core > 1 ) {		if( nth_per_core > 1 ) {
core_level = __kmp_aff_depth - 2;		core_level = __kmp_aff_depth - 2;
} else {		} else {
		if(( nPackages > 1) && ( nCoresPerPkg > 1)) {
		nth_per_core = nCoresPerPkg;
		core_level = __kmp_aff_depth - 2;
		} else {
core_level = __kmp_aff_depth - 1;		core_level = __kmp_aff_depth - 1;
}		}
		}

// Number of cores - maximum value; it does not count trail cores with 0 processors		// Number of cores - maximum value; it does not count trail cores with 0 processors
int ncores = address2os[ __kmp_avail_proc - 1 ].first.labels[ core_level ] + 1;		int ncores = address2os[ __kmp_avail_proc - 1 ].first.labels[ core_level ] + 1;
		AndreyChurbanovUnsubmitted Not Done Reply Inline Actions This is the number of cores in the last package. Other packages are skipped (no threads bound there). AndreyChurbanov: This is the number of cores in the last package. Other packages are skipped (no threads bound…
		pawosm01AuthorUnsubmitted Not Done Reply Inline Actions Do you refer to the effect this has on line 4679? This code crashes when __kmp_ncores is used there instead. pawosm01: Do you refer to the effect this has on line 4679? This code crashes when __kmp_ncores is used…
		AndreyChurbanovUnsubmitted Not Done Reply Inline Actions My comment was probably irrelevant here. The problem was incorrect filling of the procarr array that only contained info on the last package. So if this array is in use (I mostly tested case on line 4710), then all threads were bound to this last package. If you fix memory overwriting at initialization, then anyway ncores will need to be multiplied by number of packages here or extra loop over packages needs to be added, I think. Otherwise l don't see how all machine can be covered, given that loops for( int i = 0; i < ncores; i++ ) only walk through subset of all cores now. BTW, I finished with debugging of no-HT case. It works fine because you effectively reduced this case to single package via shifting topology one level down and making the balanced affinity to see packages as cores and cores as threads. This trick does not work for HT enabled topology. AndreyChurbanov: My comment was probably irrelevant here. The problem was incorrect filling of the procarr…

// For performance gain consider the special case nthreads == __kmp_avail_proc		// For performance gain consider the special case nthreads == __kmp_avail_proc
if( nthreads == __kmp_avail_proc ) {		if( nthreads == __kmp_avail_proc ) {
if( __kmp_affinity_gran == affinity_gran_fine \|\| __kmp_affinity_gran == affinity_gran_thread) {		if( fine_gran) {
int osID = address2os[ tid ].second;		int osID = address2os[ tid ].second;
KMP_CPU_SET( osID, mask);		KMP_CPU_SET( osID, mask);
} else if( __kmp_affinity_gran == affinity_gran_core ) { // Granularity == core		} else {
int coreID = address2os[ tid ].first.labels[ core_level ];		int coreID = address2os[ tid ].first.labels[ core_level ];
// We'll count found osIDs for the current core; they can be not more than nth_per_core;		// We'll count found osIDs for the current core; they can be not more than nth_per_core;
// since the address2os is sortied we can break when cnt==nth_per_core		// since the address2os is sortied we can break when cnt==nth_per_core
int cnt = 0;		int cnt = 0;
for( int i = 0; i < __kmp_avail_proc; i++ ) {		for( int i = 0; i < __kmp_avail_proc; i++ ) {
int osID = address2os[ i ].second;		int osID = address2os[ i ].second;
int core = address2os[ i ].first.labels[ core_level ];		int core = address2os[ i ].first.labels[ core_level ];
if( core == coreID ) {		if( core == coreID ) {
KMP_CPU_SET( osID, mask);		KMP_CPU_SET( osID, mask);
cnt++;		cnt++;
if( cnt == nth_per_core ) {		if( cnt == nth_per_core ) {
break;		break;
}		}
}		}
}		}
}		}
} else if( nthreads <= __kmp_ncores ) {		} else if( nthreads <= ncores ) {

int core = 0;		int core = 0;
for( int i = 0; i < ncores; i++ ) {		for( int i = 0; i < ncores; i++ ) {
// Check if this core from procarr[] is in the mask		// Check if this core from procarr[] is in the mask
int in_mask = 0;		int in_mask = 0;
for( int j = 0; j < nth_per_core; j++ ) {		for( int j = 0; j < nth_per_core; j++ ) {
if( procarr[ i * nth_per_core + j ] != - 1 ) {		if( procarr[ i * nth_per_core + j ] != - 1 ) {
in_mask = 1;		in_mask = 1;
break;		break;
}		}
}		}
if( in_mask ) {		if( in_mask ) {
if( tid == core ) {		if( tid == core ) {
for( int j = 0; j < nth_per_core; j++ ) {		for( int j = 0; j < nth_per_core; j++ ) {
int osID = procarr[ i * nth_per_core + j ];		int osID = procarr[ i * nth_per_core + j ];
if( osID != -1 ) {		if( osID != -1 ) {
KMP_CPU_SET( osID, mask );		KMP_CPU_SET( osID, mask );
// For granularity=thread it is enough to set the first available osID for this core		// For fine granularity it is enough to set the first available osID for this core
if( __kmp_affinity_gran == affinity_gran_fine \|\| __kmp_affinity_gran == affinity_gran_thread) {		if( fine_gran) {
break;		break;
}		}
}		}
}		}
break;		break;
} else {		} else {
core++;		core++;
}		}
}		}
}		}

} else { // nthreads > __kmp_ncores		} else { // nthreads > ncores

// Array to save the number of processors at each core		// Array to save the number of processors at each core
int* nproc_at_core = (int)KMP_ALLOCA(sizeof(int)ncores);		int* nproc_at_core = (int)KMP_ALLOCA(sizeof(int)ncores);
// Array to save the number of cores with "x" available processors;		// Array to save the number of cores with "x" available processors;
int* ncores_with_x_procs = (int)KMP_ALLOCA(sizeof(int)(nth_per_core+1));		int* ncores_with_x_procs = (int)KMP_ALLOCA(sizeof(int)(nth_per_core+1));
// Array to save the number of cores with # procs from x to nth_per_core		// Array to save the number of cores with # procs from x to nth_per_core
int* ncores_with_x_to_max_procs = (int)KMP_ALLOCA(sizeof(int)(nth_per_core+1));		int* ncores_with_x_to_max_procs = (int)KMP_ALLOCA(sizeof(int)(nth_per_core+1));

▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	if( __kmp_affinity_uniform_topology() ) {
}		}
}		}
flag = 1;		flag = 1;
}		}
int sum = 0;		int sum = 0;
for( int i = 0; i < nproc; i++ ) {		for( int i = 0; i < nproc; i++ ) {
sum += newarr[ i ];		sum += newarr[ i ];
if( sum > tid ) {		if( sum > tid ) {
// Granularity == thread		if( fine_gran) {
if( __kmp_affinity_gran == affinity_gran_fine \|\| __kmp_affinity_gran == affinity_gran_thread) {
int osID = procarr[ i ];		int osID = procarr[ i ];
KMP_CPU_SET( osID, mask);		KMP_CPU_SET( osID, mask);
} else if( __kmp_affinity_gran == affinity_gran_core ) { // Granularity == core		} else {
int coreID = i / nth_per_core;		int coreID = i / nth_per_core;
for( int ii = 0; ii < nth_per_core; ii++ ) {		for( int ii = 0; ii < nth_per_core; ii++ ) {
int osID = procarr[ coreID * nth_per_core + ii ];		int osID = procarr[ coreID * nth_per_core + ii ];
if( osID != -1 ) {		if( osID != -1 ) {
KMP_CPU_SET( osID, mask);		KMP_CPU_SET( osID, mask);
}		}
}		}
}		}
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines